1.1 Autoencoder networks and unsupervised learning
Gathering labeled datasets is onerous, so unsupervised learning is an important research area . Autoencoder networks encode the input into a latent space. They can be written in two parts,
where the encoder and decoder networks are trained jointly to minimize the distance between the input and the output for some training set. This is called unsupervised learning. The latent space captures much of the information from the input, and so it can be used for downstream tasks. Convolutional autoencoder networks that combine downsampling and upsampling layers can be used to learn latent representation of spatial data [33, 34]. Unsupervised learning has also been done with 2D ConvNets in other ways, such as solving jigsaw puzzles made up of image fragments , and learning to predict the identity of images within an large unlabeled database [6, 3, 4].
In natural language processing unsupervised (or self-supervised) techniques such as Word2Vec and FastText  have shown that features trained simply to predict the environment are useful for a range of down-stream tasks, such as question answering and machine translation, where they may be fed into a convolutional or recurrent network as input features. Word2Vec is a low-rank factorization of the matrix of nearby word co-occurrences. The implicit language model is to guess a missing word based on the immediate context.
For autoencoders trained on image datasets with the metric, the output is typically blurry. For an autoencoder network to reconstruct a sharp image of a furry animal, you need to capture the location of every visible hair, even though the reconstruction would look fine with the hairs arranged slightly differently. Overly precise information about the location of fine details that form part of a larger pattern is unlikely to be of any interest for downstream tasks. This problem has driven research into variational autoencoders and GANs.
1.2 Encoder-decoder networks for image to image transformations
. Downsampling can be performed by max-pooling or strided convolution, and upsampling can be performed by unpooling or transpose convolutions. Shortcut connections linking hidden layers at the same spatial scale in the encoder and decoder networks improve accuracy.
1.3 Spatially-sparse input in dimensions
The success of two-dimensional convolutional networks operating on dense 2D images has spurred interest in higher dimensional machine learning. Larger 3D datasets have been released recently, such as ShapeNet111https://www.shapenet.org/ and ScanNet 222http://www.scan-net.org/. However, higher dimensional ConvNets have not yet becomes as widely used as their 2D counterparts. Limiting factors have included:
High computational overhead in terms of floating-point operations (FLOPs) and memory.
Restricted software support in popular machine learning software packages.
However, a flip side to the curse of dimensionality is that in higher dimensional settings, sparsity becomes more likely.
A pen drawing the letter ‘z’ in a grid might visit approximately of the locations, suggesting that handwriting might be 5-10% occupied.
A bounding cuboid around the Eiffel Tower is 99.98% empty space (air) and just 0.02% iron.
A space-time path in a hyper-cube of size visits just 0.0004% of the lattice sites.
Sparse data can be represented using either point clouds or sparse tensors. In the case of tensors, recall that-dimensional ConvNets typically operate on or dimensional dense tensors, the extra dimensions represent the feature channels and possibly the batch size; e.g. the input to an AlexNet 2D ConvNet will be a tensor of size where 3 is the number of input channels of an RGB image, and is the spatial size.
For 3D and 4D objects, the most intuitive form of sparsity is spatial/spatio-temporal sparsity: each location in space or space-time is either:
in which case the value of each of the feature channels at that location is typically non-zero, or
with all of the feature channels taking value zero.
This regularity in the sparsity structure means that the vectors of feature channels at active locations can be represented by contiguous tensors, which is important for efficient computation. To exploit this spatial sparsity, a variety of sparse convolutional networks have been developed:
Permutohedral lattice CNNs  operate on set of points with floating-point coordinates. Convolutions are implemented by projecting onto a permutohedral lattice and back using splat-convolve-slice operations at each level of the network.
SparseConvNets  are mathematically identical to regular dense ConvNets, but implemented using sparse data structures. A ground state hidden vector was used at each level of the network to describe the hidden vectors that could see no input. A drawback of SparseConvNets is that deep stacks of size-3 stride-1 convolutions  quickly reduces the level of sparsity due to the blurring effect of convolutions.
OctNets provided an alternative form of sparsity. Empty portions of the input are amalgamated into dyadic cubes that then share a hidden state vector.
Kd-Networks  implements a type of graph convolution on the Kd-tree of point clouds.
To allow computational resources to be more focused on the active regions, both SparseConvNets and OctNets have both been modified to remove the hidden vectors corresponding to empty regions:
Submanifold SparseConvNets (SSCN)  discards the ground state hidden vector, and introduces a parsimonious convolution operation that is restricted to only operates on already active sites, hence eliminating the blurring problem.
In both cases, stride-1 convolutions are performed in a sparsity preserving manner, while stride-2 convolutions used for downsampling by a factor of are greedy.
In terms of implementation, at each layer SSCN uses a hash table to store the set of active locations; it can be compiled to support any positive integer dimension. O-CNN uses a hybrid mix of OctTrees and hash tables to store the spatial structure, so it is hard-coded to operate in three dimensions, but could in principle be extended to support other dimensions using -trees. The networks we introduce in the next section could be defined in either of the two frameworks. We use SSCN as it already supports dimensions 2, 3 and 4.
2 Sparse Autoencoders
SSCN contain three types of layers:
Sparse convolution layers have input channels, output channels, filter size and stride . They operate greedily: if any site in the receptive field of an output square is active, then the output is active. We set for downsampling by a factor two.
Submanifold sparse convolutions always have stride 1. They preserve the sparsity structure, only being applied at sites already active in the input. Deep SSCNs are primarily composed of stacks of SSC convolutions, sometimes with skip connections added to produce simple ResNet style blocks .
Deconvolution layers restore the sparsity pattern from a matching SC. They can be used to build U-Nets for image-to-image mapping problems such as semantic segmentation.
To these, we add two new layers.
Transpose Convolutions will be used for upsampling. Given the kernel size , stride , and input size , and the output size is . Upsampling is greedy, if an input location is active, all of the corresponding output locations are active.
Sparsification layers convert some active spatial locations to in-active ones. During training, they function like deconvolution layers, restoring the sparsity pattern to match the sparsity pattern at the same scale in the encoder. During testing, if the first feature channel is positive, the site remains active and the feature channels pass unchanged from input to output. If the first feature channel is less than or equal to zero, the channel becomes non-active. In the special case of there being only one feature channel, this is equivalent to a ReLU activation function.
In the dense case, transpose convolution are also known as fractional stride convolutions or ‘deconvolutions’. With , a TC operation corresponds to replacing each active site with a cube of size . Technically this preserves the level of sparsity. However, this is misleading; the volume has grown by a factor of , but sparse sets of points typically have a fractal dimension of less than , so we should expect greater sparsity. Looking at subfigures (c), (b) and (a) in Figure 1, we see that the set of active sites corresponding to a 1-d circle in 3D space should become much more sparse as the scale increases. Hence the need for ‘sparsify’ layers.
The framework is similar to Generative OctNets [28, 31], especially if TC and sparsifier layers are used back-to-back. However, separating the layers allows us to place trainable layers between upscaling and sparsifying. In Figure 1 we show an autoencoder operating on input size .
The encoder alternates between blocks of one or more SSC layers, and downsampling SC layers. Each SC layer reduces the spatial size by a factor of 2. Extra layers can be added to handle larger input. Once the spatial size is , a final SC layer can be used to reduce the spatial size to a trivial .
The sparsity patterns at different layers of the encoder are entirely determined by the sparsity pattern in the input; they are independent of the encoder’s trainable parameters.
The decoder uses sequences of (i) a TC layer to upsample, (ii) an SSC layer to propagate information spatially, (iii) a sparsify layer to increase sparsity, and (iv) an SSC layer to propagate information again before the next TC layer.
The spatial scales in the decoder, 1—4—8—16, are the inverse of the scales in the encoder. During training, the sparsity pattern in the decoder after each sparsify layer is taken from the corresponding level of the encoder. During testing, the sparsify layer keeps input locations where the first feature channel is positive, and deletes the rest.
2.3 Hierarchical training loss
To train the autoencoder, we define a loss that looks at the output features (unless the output is monochrome), and also each Sparsify layer. During training, the output sparsity pattern matches the input sparsity pattern. The first term in our loss is the mean squared error of the reconstruction compared to the input over the set of input/output active spatial locations.
For each sparsifier layer, let =‘positive’ denote the set of active sites in the corresponding layer of the encoder; let =‘negative’ denote the set of inactive sites in the encoder. Let denote the first feature channel of the sparsifier layer input. The sparsification loss for that sparsifier level is defined to be
This loss encourages the decoder to learn to iteratively reproduce the sparsity pattern from the input. False positives, where a site is absent in the encoder but active in the decoder, can be corrected in later sparsification layers. However, false negatives, sites incorrectly turned off during decoding, cannot be corrected.
2.4 Classifiers and NonConvNet spatial classifiers
To make use of the latent space representations learnt by the sparse autoencoder, we need to be able to use the output of the encoder—the latent space—as input for downstream tasks such as classification and segmentation.
For classification, we can either have a linear layer followed by the softmax function. However, as the set of interesting classes may not be linearly separable in the latent space, we will also try training multilayer perceptrons (MLPs) as classifier; they will be fully connected neural networks with two hidden layers.
For segmentation, for each active point in the input, we want to produce a segmentation decision. However, for the results of the experiments to be meaningful, the classifier must not be allowed to base its decision on the input sparsity pattern, or else it could just ignore the latent space entirely and learn from the input from scratch.
To prevent this kind of cheating, we consider a ‘non-convolutional’ decoder network, see the NonConvNet table in Figure 1. It is implemented as a sparse ConvNet, but only using a sequence of deconvolutions. There is no overlap of the receptive fields, so given the latent vector, the segmentation decision at any input location is independent of the set of active input sites.
Compared to more typical decoder networks, the NonConvNet has some advantages. It is a small shallow network so it is quick to compute. It is easy to calculate the output at a particular location, without calculating the full output, e.g. ‘Is there a wall here?’ It is memory efficient in the sense that you can calculate the output without storing the autoencoder’s input. However, compared to other segmentation networks, such as U-Nets (see Section 3.1), the lack of shortcut connections between the input and output will tend to limit accuracy when performing fine-grained segmentation; depending on the application this may be considered an acceptable trade-off.
2.5 Arbitrary sized inputs
The autoencoder in Figure 1 is designed to take input of a given size, , and reduce it to a dimensionless latent vector with trivial spatial size . The network can be expanded to take larger inputs, e.g. , by adding extra SC/TC layers to the encoder/decoder, respectively. However, for large inputs such as scans of whole buildings, it is unrealistic to expect a single latent vector to capture all the information needed to reconstruct the extended scene.
A fixed size autoencoder could be applied to patches of the scene, to create a spatial ensemble of latent vectors. An advantage of this approach is that it is easy to update your ‘memory’ when you revisit a location and find that the environment has changed.
Alternatively, and this is the approach taken here, one can build autoencoders that take arbitrary sized input and downsample by a fixed factor, i.e. by , by adding extra SC/TC convolutions to the encoder/decoder, and removing the , SC/TC convolutions. The latent space then has spatial size .
When the latent space has a non-trivial spatial size, we will allow the segmentation classifier to consist of (a) an SSCN network operating on the latent space, followed by (b) a NonConvNet classifier. Storing just the latent space, or the output of (a), it is possible to evaluate the classifier at any input location.
Our first experiments are with 2D handwriting datasets. In 2D, sparsity is less important than in 3D or 4D, as the sparsity ratio will generally be lower. However, it is interesting to look at datasets that are relatively large compared to typical 3D/4D datasets, and to see if the autoencoders can capture fine detail. We then look at two 3D segmentation dataset, and a 4D segmentation problem.
To assess the utility of the latent representations for other tasks, we will consider supervised and unsupervised baselines. We will pick networks with similar computational cost to the encoder+classifier pairs. Methods trained fully supervised are marked with a .
As a simple baseline, we take a randomly initialized copy of the encoder . To burn-in the batch norm population statistics, we perform 100 forward passes on the training data, but no actual training.
Another copy of the encoder, but trained fully supervised for the test task.
U-Nets have been applied to dense  and sparse data  to obtain state-of-the-art results for segmentation problems. As they are trained fully supervised, with shortcut connections allowing segmentation decisions to be made with access to fine grained input detail, these provided an effective upper bound on the accuracy of unsupervised learning methods trained on the same number of examples. See Figure 2.
- Shape Context
Shape context features  provide a simple summary of the local environment by performing pooling over a variety of scales. Let denote the number of input feature channels. In parallel, the input is downscaled by average pooling by factors of . At each scale, at each active location, gather the feature vectors from neighboring spatial locations and concatenate them to produce feature channels. Unpooling the results from the different scales and concatenating them produces features at each active spatial location in the input.
For segmentation problems, the representation at each input-level spatial location is fed into a multilayer perceptron (MLP) with 2 hidden layers to predict the voxel class.
3.2 Handwriting in 2D space
The PenDigits and Assamese handwriting datasets333https://archive.ics.uci.edu/ml/datasets contain samples of handwritten characters, each stored as a collection of paths in 2D space. The PenDigits dataset has samples of the digits with a total of 7494 training samples, and 3498 test samples, see Figure 3. The Assamese handwriting dataset has 45 samples of 183 characters; we split this into training characters and test characters.
We scale the input to size , and apply random affine transformation to the training data. For each dataset, we build 6 networks. Each network consists of an encoder network (c.f. Figure 1) and on top of that either a linear classifier, or a 2-hidden-layer MLP. Each encoder is either (i) randomly initialized, (ii) trained with full supervision, or (iii) trained unsupervised as part of a sparse autoencoder. The classifier is always trained with supervision. Results are in Table 1.
In the fully supervised case, the choice of classifier is unimportant; the encoder is already adapted to the character classes. The untrained encoder does significantly better than chance, especially with the MLP classifier. The encoder trained unsupervised as part of a sparse autoencoder does even better, performing only slightly worse than the fully supervised encoder on the PenDigits dataset.
3.3 ShapeNet 3D models
ShapeNet444https://shapenet.cs.stanford.edu/iccv17/ is a dataset for semantic segmentation. There are 16 categories of object: airplane, bag, chair, etc. Each category is segmented into between 2 and 6 part types; see Figures 4. Across all object categories, the dataset contains a total of 50 different object part classes. Each object is represented as a 3D point cloud obtained by sampling points uniformly from the surface of an underlying CAD model. Each point cloud contains between and points. We split the labeled data to obtain a training set with 6,955 examples and a validation set with 7,052 examples.
To make the reconstruction and segmentation problems more challenging, we randomly rotate the objects in 3D; if airplanes always points along the -axis, finding the nose is rather easy, and you are limited to only ever fly in one direction! Also, rather than treating the 16 object categories as separate tasks, we combine them. We train the autoencoder on all categories. For the segmentation task, we test classification and segmentation ability simultaneously by treating the dataset as a 50 class segmentation problem (bag handle, plane wing, …), and report the average intersection-over-union score (IOU). We rendered the shapes at two different scales: diameter 15 in a grid of size and diameter 50 in an grid of size .
At scale 15, we have a sparse autoencoder with input size to produce a latent representation with trivial size . The baseline methods are shape context with an MLP of size 64, a U-Net, a randomly initialized encoder, and an encoder+NonConvNet pair trained end to end. See Table 2.
3.4 Motion capture walking wire frames
We selected 1179 motion capture sequences for which we could extract complete and consistent set of keypoints, and used them to construct a simple wireframe model for the actors. The data can be represented as a simple time series, with the keypoint coordinates as features , but this discards much of the 3D information. Instead we render the skeletons as 1+1 dimensional surfaces in 3+1 dimensional space-time (with one feature channel to indicate skeleton/not-skeleton). The model has no prior knowledge of how the skeleton is joined up or moves.
We split the dataset into 912 training sequences and 267 test sequences. The method could in principle also be applied to motion capture data with multiple figures without modification to the sparse networks, but to simplify the data preparation, we restricted to the case of individual people.
We rendered samples of 64 frames (30 frames/s) in a cube of size , and downscaled by a factor of or . Baselines methods are 4D shape context features, a U-Net, a randomly initialized network, and a fully supervised encoder+NonConvNet network.
For this experiment, we increased the number of features per enoder level linearly: e.g. 32, 64, 96, 128, 160, rather than by powers of 2. This is denoted ‘’ in Table 3.
3.5 ScanNet room scenes
For training we randomly rotate the scenes in the horizontal plane, and apply random affine data augmentation. We voxelize the the points with grid resolution cm, and use autoencoders to learn a latent space on a scale downsized by a factor of , or . For this experiment, we replaced the SSC blocks in Figure 1 with 2 simple ResNet block. This is denoted by ‘’ in Table 4. Our results are calculated using 3-fold testing.
The fully supervised U-Net baseline is roughly on par with state-of-the-art methods . The unsupervised encoder compares respectably to some of the fairly recent fully supervised methods.
We repeated the supervised learning using only 10% of the labelled scenes, see Figure 5. The gap between the fully supervised U-Net reduces. The unsupervised representation outperforms an equivalent network trained fully supervised on the reduced training set.
Cited results were calculated on a different test set.
We have introduced a new framework for building spatially-sparse autoencoder networks in 2D, 3D and 4D. We have also introduced a number of segmentation benchmark tasks to measure the quality of the latent space representations generated by the autoencoders. Other possible uses include reinforcement learning tasks related to navigation in 3D environments and embodied Q&A777https://embodiedqa.org/.
-  Scannet benchmark challenge. http://kaldir.vc.in.tum.de/scannet_benchmark/. Accessed: 2018-11-16.
-  Serge Belongie, Jitendra Malik, and Jan Puzicha. Shape Matching and Object Recognition using Shape Contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002.
-  Piotr Bojanowski and Armand Joulin. Unsupervised learning by predicting noise. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pages 517–526. PMLR, 2017.
-  Mathilde Caron, Piotr Bojanowski, Armand Joulin, and Matthijs Douze. Deep clustering for unsupervised learning of visual features. CoRR, abs/1807.05520, 2018.
-  Angela Dai and Matthias Nießner. 3dmv: Joint 3d-multi-view prediction for 3d semantic scene segmentation. CoRR, abs/1803.10409, 2018.
-  Alexey Dosovitskiy, Jost Tobias Springenberg, Martin A. Riedmiller, and Thomas Brox. Discriminative unsupervised feature learning with convolutional neural networks. CoRR, abs/1406.6909, 2014.
-  Martin Engelcke, Dushyant Rao, Dominic Zeng Wang, Chi Hay Tong, and Ingmar Posner. Vote3Deep: Fast Object Detection in 3D Point Clouds using Efficient Convolutional Neural Networks. IEEE International Conference on Robotics and Automation, 2017.
-  Benjamin Graham. Sparse 3D Convolutional Neural Networks. British Machine Vision Conference, 2015.
-  Benjamin Graham, Martin Engelcke, and Laurens van der Maaten. 3D Semantic Segmentation with Submanifold SparseConvNets. 2017. http://arxiv.org/abs/1711.10275.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Identity Mappings in Deep Residual Networks.
European Conference on Computer Vision, 2016.
-  Daniel Holden, Jun Saito, Taku Komura, and Thomas Joyce. Learning motion manifolds with convolutional autoencoders. In SIGGRAPH Asia Technical Briefs, pages 18:1–18:4. ACM, 2015.
-  Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431. Association for Computational Linguistics, April 2017.
-  Martin Kiefel, Varun Jampani, and Peter V. Gehler. Permutohedral lattice cnns. ICLR, 2015.
-  Roman Klokov and Victor Lempitsky. Escape from Cells: Deep Kd-Networks for The Recognition of 3D Point Cloud Models. arXiv preprint arXiv:1704.01222, 2017.
-  Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic DENOYER, and Marc’Aurelio Ranzato. Fader networks:manipulating images by sliding attributes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5967–5976. Curran Associates, Inc., 2017.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, November 1998.
Jonathan Long, Evan Shelhamer, and Trevor Darrell.
Fully Convolutional Networks for Semantic Segmentation.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015.
-  Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
-  Mehdi Noroozi and Paolo Favaro. Unsupervised learning of visual representations by solving jigsaw puzzles. CoRR, abs/1603.09246, 2016.
-  Hao Pan, Shilin Liu, Yang Liu 0014, and Xin Tong 0001. Convolutional neural networks on 3d surfaces using parallel frames. CoRR, abs/1808.04952, 2018.
-  Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. arXiv preprint arXiv:1612.00593, 2016.
-  Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  Gernot Riegler, Ali Osman Ulusoys, and Andreas Geiger. Octnet: Learning Deep 3D Representations at High Resolutions. arXiv preprint arXiv:1611.05009, 2016.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-Net: Convolutional Networks for Biomedical Image Segmentation. International Conference on Medical Image Computing and Computer-Assisted Intervention, 2015.
-  David E. Rumelhart, Geoffrey E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In D. E. Rumelhart, J. L. McClelland, and the PDP research group., editors, Parallel distributed processing: Explorations in the microstructure of cognition, Volume 1: Foundations. MIT Press, 1986.
-  Andrew M. Saxe, Pang Wei Koh, Zhenghao Chen, Maneesh Bhand, Bipin Suresh, and Andrew Y. Ng. On random weights and unsupervised feature learning. In Lise Getoor and Tobias Scheffer, editors, Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pages 1089–1096. Omnipress, 2011.
-  K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
-  Maxim Tatarchenko, Alexey Dosovitskiy, and Thomas Brox. Octree Generating Networks: Efficient Convolutional Architectures for High-resolution 3D Outputs. 2017.
-  Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. Learning spatiotemporal features with 3d convolutional networks. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4489–4497. IEEE, 2015.
-  Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. O-CNN: octree-based convolutional neural networks for D shape analysis. ACM Transactions on Graphics, 36(4):72:1–72:??, July 2017.
-  Peng-Shuai Wang, Yang Liu, Yu-Xiao Guo, Chun-Yu Sun, and Xin Tong. Adaptive O-CNN: A Patch-based Deep Representation of 3D Shapes. ACM Transactions on Graphics (SIGGRAPH Asia), 37(6), 2018.
-  Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d environment. arXiv preprint arXiv:1801.02209, 2018.
-  Matthew D. Zeiler, Dilip Krishnan, Graham W. Taylor, and Rob Fergus. Deconvolutional Networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2010.
-  Matthew D. Zeiler, Graham W. Taylor, and Rob Fergus. Adaptive deconvolutional networks for mid and high level feature learning. In Dimitris N. Metaxas, Long Quan, Alberto Sanfeliu, and Luc J. Van Gool, editors, IEEE International Conference on Computer Vision, ICCV 2011, Barcelona, Spain, November 6-13, 2011, pages 2018–2025. IEEE Computer Society, 2011.
Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In Computer Vision (ICCV), 2017 IEEE International Conference on, 2017.