Unsupervised learning with sparse space-and-time autoencoders

11/26/2018 ∙ by Benjamin Graham, et al. ∙ Facebook 0

We use spatially-sparse two, three and four dimensional convolutional autoencoder networks to model sparse structures in 2D space, 3D space, and 3+1=4 dimensional space-time. We evaluate the resulting latent spaces by testing their usefulness for downstream tasks. Applications are to handwriting recognition in 2D, segmentation for parts in 3D objects, segmentation for objects in 3D scenes, and body-part segmentation for 4D wire-frame models generated from motion capture data.



There are no comments yet.


page 6

page 12

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Convolutional networks were initially developed for supervised learning. They are used in deep learning to classify two-dimensional

spatial information such as hand writing samples and photographs [16]. In the one dimensional setting, they have been applied to temporal data such as audio recordings of speech and music, and writing encoded at either the character level or the word level. In the three dimensional setting, applications have included medical scans, object detection for self driving cars, and object recognition from RGB-D photos. Videos, with their two spatial dimensions and one time dimension can also be seen as dimensional objects for purposes of applying convolutional networks [29]. The movement of 3D objects happens in dimensional space-time, but 4D ConvNets are relatively unexplored.

1.1 Autoencoder networks and unsupervised learning

Gathering labeled datasets is onerous, so unsupervised learning is an important research area [25]. Autoencoder networks encode the input into a latent space. They can be written in two parts,


where the encoder and decoder networks are trained jointly to minimize the distance between the input and the output for some training set. This is called unsupervised learning. The latent space captures much of the information from the input, and so it can be used for downstream tasks. Convolutional autoencoder networks that combine downsampling and upsampling layers can be used to learn latent representation of spatial data [33, 34]. Unsupervised learning has also been done with 2D ConvNets in other ways, such as solving jigsaw puzzles made up of image fragments [19], and learning to predict the identity of images within an large unlabeled database [6, 3, 4].

In natural language processing unsupervised (or self-supervised) techniques such as Word2Vec

[18] and FastText [12] have shown that features trained simply to predict the environment are useful for a range of down-stream tasks, such as question answering and machine translation, where they may be fed into a convolutional or recurrent network as input features. Word2Vec is a low-rank factorization of the matrix of nearby word co-occurrences. The implicit language model is to guess a missing word based on the immediate context.

For autoencoders trained on image datasets with the metric, the output is typically blurry. For an autoencoder network to reconstruct a sharp image of a furry animal, you need to capture the location of every visible hair, even though the reconstruction would look fine with the hairs arranged slightly differently. Overly precise information about the location of fine details that form part of a larger pattern is unlikely to be of any interest for downstream tasks. This problem has driven research into variational autoencoders and GANs.

1.2 Encoder-decoder networks for image to image transformations

Convolutional networks combining encoder and decoder components can also be used to perform image to image mapping operations such as segmentation [17] and image editing [35, 15]

. Downsampling can be performed by max-pooling or strided convolution, and upsampling can be performed by unpooling or transpose convolutions. Shortcut connections linking hidden layers at the same spatial scale in the encoder and decoder networks improve accuracy


1.3 Spatially-sparse input in dimensions

The success of two-dimensional convolutional networks operating on dense 2D images has spurred interest in higher dimensional machine learning. Larger 3D datasets have been released recently, such as ShapeNet

111https://www.shapenet.org/ and ScanNet 222http://www.scan-net.org/. However, higher dimensional ConvNets have not yet becomes as widely used as their 2D counterparts. Limiting factors have included:

  • High computational overhead in terms of floating-point operations (FLOPs) and memory.

  • Restricted software support in popular machine learning software packages.

However, a flip side to the curse of dimensionality is that in higher dimensional settings, sparsity becomes more likely.


A pen drawing the letter ‘z’ in a grid might visit approximately of the locations, suggesting that handwriting might be 5-10% occupied.


A bounding cuboid around the Eiffel Tower is 99.98% empty space (air) and just 0.02% iron.


A space-time path in a hyper-cube of size visits just 0.0004% of the lattice sites.

Sparse data can be represented using either point clouds or sparse tensors. In the case of tensors, recall that

-dimensional ConvNets typically operate on or dimensional dense tensors, the extra dimensions represent the feature channels and possibly the batch size; e.g. the input to an AlexNet 2D ConvNet will be a tensor of size where 3 is the number of input channels of an RGB image, and is the spatial size.

For 3D and 4D objects, the most intuitive form of sparsity is spatial/spatio-temporal sparsity: each location in space or space-time is either:


in which case the value of each of the feature channels at that location is typically non-zero, or


with all of the feature channels taking value zero.

This regularity in the sparsity structure means that the vectors of feature channels at active locations can be represented by contiguous tensors, which is important for efficient computation. To exploit this spatial sparsity, a variety of sparse convolutional networks have been developed:

  • Permutohedral lattice CNNs [13] operate on set of points with floating-point coordinates. Convolutions are implemented by projecting onto a permutohedral lattice and back using splat-convolve-slice operations at each level of the network.

  • SparseConvNets [8] are mathematically identical to regular dense ConvNets, but implemented using sparse data structures. A ground state hidden vector was used at each level of the network to describe the hidden vectors that could see no input. A drawback of SparseConvNets is that deep stacks of size-3 stride-1 convolutions [27] quickly reduces the level of sparsity due to the blurring effect of convolutions.

  • OctNets[23] provided an alternative form of sparsity. Empty portions of the input are amalgamated into dyadic cubes that then share a hidden state vector.

  • Vote3Deep [7]

    uses dense tensors, but it is sparse computationally, and uses a loss function during training to promote sparsity.

  • Kd-Networks [14] implements a type of graph convolution on the Kd-tree of point clouds.

  • PointNet [21]

    treats coordinates as floating point features for a fully connected neural network.

To allow computational resources to be more focused on the active regions, both SparseConvNets and OctNets have both been modified to remove the hidden vectors corresponding to empty regions:

  • Submanifold SparseConvNets (SSCN) [9] discards the ground state hidden vector, and introduces a parsimonious convolution operation that is restricted to only operates on already active sites, hence eliminating the blurring problem.

  • Octree-based Convolutional Neural Networks (O-CNN)

    [30] remove the hidden vectors for empty OctTree cells.

In both cases, stride-1 convolutions are performed in a sparsity preserving manner, while stride-2 convolutions used for downsampling by a factor of are greedy.

In terms of implementation, at each layer SSCN uses a hash table to store the set of active locations; it can be compiled to support any positive integer dimension. O-CNN uses a hybrid mix of OctTrees and hash tables to store the spatial structure, so it is hard-coded to operate in three dimensions, but could in principle be extended to support other dimensions using -trees. The networks we introduce in the next section could be defined in either of the two frameworks. We use SSCN as it already supports dimensions 2, 3 and 4.

2 Sparse Autoencoders

SSCN contain three types of layers:


Sparse convolution layers have input channels, output channels, filter size and stride . They operate greedily: if any site in the receptive field of an output square is active, then the output is active. We set for downsampling by a factor two.


Submanifold sparse convolutions always have stride 1. They preserve the sparsity structure, only being applied at sites already active in the input. Deep SSCNs are primarily composed of stacks of SSC convolutions, sometimes with skip connections added to produce simple ResNet style blocks [10].


Deconvolution layers restore the sparsity pattern from a matching SC. They can be used to build U-Nets for image-to-image mapping problems such as semantic segmentation.

To these, we add two new layers.


Transpose Convolutions will be used for upsampling. Given the kernel size , stride , and input size , and the output size is . Upsampling is greedy, if an input location is active, all of the corresponding output locations are active.


Sparsification layers convert some active spatial locations to in-active ones. During training, they function like deconvolution layers, restoring the sparsity pattern to match the sparsity pattern at the same scale in the encoder. During testing, if the first feature channel is positive, the site remains active and the feature channels pass unchanged from input to output. If the first feature channel is less than or equal to zero, the channel becomes non-active. In the special case of there being only one feature channel, this is equivalent to a ReLU activation function.

In the dense case, transpose convolution are also known as fractional stride convolutions

[22] or ‘deconvolutions’. With , a TC operation corresponds to replacing each active site with a cube of size . Technically this preserves the level of sparsity. However, this is misleading; the volume has grown by a factor of , but sparse sets of points typically have a fractal dimension of less than , so we should expect greater sparsity. Looking at subfigures (c), (b) and (a) in Figure 1, we see that the set of active sites corresponding to a 1-d circle in 3D space should become much more sparse as the scale increases. Hence the need for ‘sparsify’ layers.

The framework is similar to Generative OctNets [28, 31], especially if TC and sparsifier layers are used back-to-back. However, separating the layers allows us to place trainable layers between upscaling and sparsifying. In Figure 1 we show an autoencoder operating on input size .

Encoder Layer Input Output Sparsity SSC 16 16 a SC 16 8 ab SSC 8 8 b SC 8 4 bc SSC 4 4 c SC 4 1 cd NonConvNet Spatial Classifier Layer Input Output Sparsity DC 4 dc DC 4 8 cb DC 8 16 ba Decoder Layer Input Output Sparsity TC 1 4 de SSC 4 4 e Sparsify 4 4 ef SSC 4 4 f TC 4 8 fg SSC 8 8 g Sparsify 8 8 gh SSC 8 8 h TC 8 16 hi SSC 16 16 i Sparsify 16 16 ij SSC 16 16 j                                              (a)       (b)       (c)       (d)   (e)         (f)         (g)         (h)         (i)         (j)

Figure 1: Top: Small sparse autoencoder architecture for inputs with spatial size . It can be expanded to process larger input, or modified to downsample the input by a fixed factor, i.e. . Below: Illustration of the autoencoder operating on sparse input of size . Input (a) is downscaled by strided convolutions to sizes (b) , (c) then (d) . During training, these patterns of active sites form the ground truth for a hierarchical loss function. At test time, reconstruction in the decoder is performed by alternating between ‘greedy’ sparse transpose convolutions and sparsification layers; (e) the scale increases to and (f) some sites are deleted . This is repeated to take the scale to (g-h) and finally (i-j) back to . True positives are shown in green, false positives in red, and false negatives in purple; true negatives are omitted. To create the figures, sparsification decisions were made randomly with 85% accuracy.

2.1 Encoder

The encoder alternates between blocks of one or more SSC layers, and downsampling SC layers. Each SC layer reduces the spatial size by a factor of 2. Extra layers can be added to handle larger input. Once the spatial size is , a final SC layer can be used to reduce the spatial size to a trivial .

The sparsity patterns at different layers of the encoder are entirely determined by the sparsity pattern in the input; they are independent of the encoder’s trainable parameters.

2.2 Decoder

The decoder uses sequences of (i) a TC layer to upsample, (ii) an SSC layer to propagate information spatially, (iii) a sparsify layer to increase sparsity, and (iv) an SSC layer to propagate information again before the next TC layer.

The spatial scales in the decoder, 1—4—8—16, are the inverse of the scales in the encoder. During training, the sparsity pattern in the decoder after each sparsify layer is taken from the corresponding level of the encoder. During testing, the sparsify layer keeps input locations where the first feature channel is positive, and deletes the rest.

2.3 Hierarchical training loss

To train the autoencoder, we define a loss that looks at the output features (unless the output is monochrome), and also each Sparsify layer. During training, the output sparsity pattern matches the input sparsity pattern. The first term in our loss is the mean squared error of the reconstruction compared to the input over the set of input/output active spatial locations.

For each sparsifier layer, let =‘positive’ denote the set of active sites in the corresponding layer of the encoder; let =‘negative’ denote the set of inactive sites in the encoder. Let denote the first feature channel of the sparsifier layer input. The sparsification loss for that sparsifier level is defined to be

This loss encourages the decoder to learn to iteratively reproduce the sparsity pattern from the input. False positives, where a site is absent in the encoder but active in the decoder, can be corrected in later sparsification layers. However, false negatives, sites incorrectly turned off during decoding, cannot be corrected.

2.4 Classifiers and NonConvNet spatial classifiers

To make use of the latent space representations learnt by the sparse autoencoder, we need to be able to use the output of the encoder—the latent space—as input for downstream tasks such as classification and segmentation.

For classification, we can either have a linear layer followed by the softmax function. However, as the set of interesting classes may not be linearly separable in the latent space, we will also try training multilayer perceptrons (MLPs) as classifier; they will be fully connected neural networks with two hidden layers.

For segmentation, for each active point in the input, we want to produce a segmentation decision. However, for the results of the experiments to be meaningful, the classifier must not be allowed to base its decision on the input sparsity pattern, or else it could just ignore the latent space entirely and learn from the input from scratch.

To prevent this kind of cheating, we consider a ‘non-convolutional’ decoder network, see the NonConvNet table in Figure 1. It is implemented as a sparse ConvNet, but only using a sequence of deconvolutions. There is no overlap of the receptive fields, so given the latent vector, the segmentation decision at any input location is independent of the set of active input sites.

Compared to more typical decoder networks, the NonConvNet has some advantages. It is a small shallow network so it is quick to compute. It is easy to calculate the output at a particular location, without calculating the full output, e.g. ‘Is there a wall here?’ It is memory efficient in the sense that you can calculate the output without storing the autoencoder’s input. However, compared to other segmentation networks, such as U-Nets (see Section 3.1), the lack of shortcut connections between the input and output will tend to limit accuracy when performing fine-grained segmentation; depending on the application this may be considered an acceptable trade-off.

2.5 Arbitrary sized inputs

The autoencoder in Figure 1 is designed to take input of a given size, , and reduce it to a dimensionless latent vector with trivial spatial size . The network can be expanded to take larger inputs, e.g. , by adding extra SC/TC layers to the encoder/decoder, respectively. However, for large inputs such as scans of whole buildings, it is unrealistic to expect a single latent vector to capture all the information needed to reconstruct the extended scene.

A fixed size autoencoder could be applied to patches of the scene, to create a spatial ensemble of latent vectors. An advantage of this approach is that it is easy to update your ‘memory’ when you revisit a location and find that the environment has changed.

Alternatively, and this is the approach taken here, one can build autoencoders that take arbitrary sized input and downsample by a fixed factor, i.e. by , by adding extra SC/TC convolutions to the encoder/decoder, and removing the , SC/TC convolutions. The latent space then has spatial size .

When the latent space has a non-trivial spatial size, we will allow the segmentation classifier to consist of (a) an SSCN network operating on the latent space, followed by (b) a NonConvNet classifier. Storing just the latent space, or the output of (a), it is possible to evaluate the classifier at any input location.

3 Experiments

Our first experiments are with 2D handwriting datasets. In 2D, sparsity is less important than in 3D or 4D, as the sparsity ratio will generally be lower. However, it is interesting to look at datasets that are relatively large compared to typical 3D/4D datasets, and to see if the autoencoders can capture fine detail. We then look at two 3D segmentation dataset, and a 4D segmentation problem.

3.1 Baselines

To assess the utility of the latent representations for other tasks, we will consider supervised and unsupervised baselines. We will pick networks with similar computational cost to the encoder+classifier pairs. Methods trained fully supervised are marked with a .


As a simple baseline, we take a randomly initialized copy of the encoder [26]. To burn-in the batch norm population statistics, we perform 100 forward passes on the training data, but no actual training.


Another copy of the encoder, but trained fully supervised for the test task.


U-Nets have been applied to dense [24] and sparse data [9] to obtain state-of-the-art results for segmentation problems. As they are trained fully supervised, with shortcut connections allowing segmentation decisions to be made with access to fine grained input detail, these provided an effective upper bound on the accuracy of unsupervised learning methods trained on the same number of examples. See Figure 2.

Shape Context

Shape context features [2] provide a simple summary of the local environment by performing pooling over a variety of scales. Let denote the number of input feature channels. In parallel, the input is downscaled by average pooling by factors of . At each scale, at each active location, gather the feature vectors from neighboring spatial locations and concatenate them to produce feature channels. Unpooling the results from the different scales and concatenating them produces features at each active spatial location in the input.
For segmentation problems, the representation at each input-level spatial location is fed into a multilayer perceptron (MLP) with 2 hidden layers to predict the voxel class.

Figure 2: U-Net architecture for fully supervised segmentation. Blue blocks correspond to sparsity preserving SSC operations. The red blocks are stride-2 SC operations, and the green blocks are deconvolutions.

3.2 Handwriting in 2D space

The PenDigits and Assamese handwriting datasets333https://archive.ics.uci.edu/ml/datasets contain samples of handwritten characters, each stored as a collection of paths in 2D space. The PenDigits dataset has samples of the digits with a total of 7494 training samples, and 3498 test samples, see Figure 3. The Assamese handwriting dataset has 45 samples of 183 characters; we split this into training characters and test characters.

We scale the input to size , and apply random affine transformation to the training data. For each dataset, we build 6 networks. Each network consists of an encoder network (c.f. Figure 1) and on top of that either a linear classifier, or a 2-hidden-layer MLP. Each encoder is either (i) randomly initialized, (ii) trained with full supervision, or (iii) trained unsupervised as part of a sparse autoencoder. The classifier is always trained with supervision. Results are in Table 1.

In the fully supervised case, the choice of classifier is unimportant; the encoder is already adapted to the character classes. The untrained encoder does significantly better than chance, especially with the MLP classifier. The encoder trained unsupervised as part of a sparse autoencoder does even better, performing only slightly worse than the fully supervised encoder on the PenDigits dataset.

Figure 3: Handwritten digit (left) and reconstruction (right). The reconstruction seems to differ from the original by an elastic distortion. It is far from the original in pixel space but still quite readable.
Dataset Encoder Linear MLP
Digits Untrained 16.84 6.26
Trained 1.14 0.89
Unsupervised 2.80 1.26
Assamese Untrained 68.43 44.51
Trained 2.79 2.61
Unsupervised 28.05 16.51
Table 1: Handwriting recognition test errors, %, for 10 and 183 class classification tasks. Within each column, the network architecture is the same, but trained differently. The classifier at the top of the network is either a linear layer or a 2-hidden layer fully-connected neural network.

3.3 ShapeNet 3D models

Figure 4: ShapeNet segmented point clouds. There are 16 object categories, each with 2-6 part types, e.g. a plane has wings, body, engines and a tail.
Figure 5: A randomly-oriented ShapeNet chair rendered with diameter 50 (left), and the reconstruction from an autoencoder with downscaling (right). The chair’s style seems to have changed but location and pose are captured correctly.

ShapeNet444https://shapenet.cs.stanford.edu/iccv17/ is a dataset for semantic segmentation. There are 16 categories of object: airplane, bag, chair, etc. Each category is segmented into between 2 and 6 part types; see Figures 4. Across all object categories, the dataset contains a total of 50 different object part classes. Each object is represented as a 3D point cloud obtained by sampling points uniformly from the surface of an underlying CAD model. Each point cloud contains between and points. We split the labeled data to obtain a training set with 6,955 examples and a validation set with 7,052 examples.

To make the reconstruction and segmentation problems more challenging, we randomly rotate the objects in 3D; if airplanes always points along the -axis, finding the nose is rather easy, and you are limited to only ever fly in one direction! Also, rather than treating the 16 object categories as separate tasks, we combine them. We train the autoencoder on all categories. For the segmentation task, we test classification and segmentation ability simultaneously by treating the dataset as a 50 class segmentation problem (bag handle, plane wing, …), and report the average intersection-over-union score (IOU). We rendered the shapes at two different scales: diameter 15 in a grid of size and diameter 50 in an grid of size .

At scale 15, we have a sparse autoencoder with input size to produce a latent representation with trivial size . The baseline methods are shape context with an MLP of size 64, a U-Net, a randomly initialized encoder, and an encoder+NonConvNet pair trained end to end. See Table 2.

For scale 50, we trained autoencoders that downscale space by and . See Figure 5 and Table 2.

Scale Method IOU
15 Shape Context 0.134
U-Net 0.590
Untrained 0.161
Trained 0.516
Unsupervised 0.278
50 Shape Context 0.161
U-Net 0.687
Unsupervised 0.536
Unsupervised 0.420
Table 2: ShapeNet segmentation results—average IOU over 50 classes. For scale 15, the latent space has trivial size . For scale 50, it is downscaled by a factor of or .
Figure 6: Skeleton wire frames from motion capture data: a person jumping and spinning. The 5 classes are left and right arms and legs, and the spine.

3.4 Motion capture walking wire frames

The CMU Graphics Lab Motion Capture Database MOCAP555http://mocap.cs.cmu.edu/ contains keypoints data for people walking, running, dancing and doing gymnastics, see Figure 6.

We selected 1179 motion capture sequences for which we could extract complete and consistent set of keypoints, and used them to construct a simple wireframe model for the actors. The data can be represented as a simple time series, with the keypoint coordinates as features [11], but this discards much of the 3D information. Instead we render the skeletons as 1+1 dimensional surfaces in 3+1 dimensional space-time (with one feature channel to indicate skeleton/not-skeleton). The model has no prior knowledge of how the skeleton is joined up or moves.

We split the dataset into 912 training sequences and 267 test sequences. The method could in principle also be applied to motion capture data with multiple figures without modification to the sparse networks, but to simplify the data preparation, we restricted to the case of individual people.

We rendered samples of 64 frames (30 frames/s) in a cube of size , and downscaled by a factor of or . Baselines methods are 4D shape context features, a U-Net, a randomly initialized network, and a fully supervised encoder+NonConvNet network.

For this experiment, we increased the number of features per enoder level linearly: e.g. 32, 64, 96, 128, 160, rather than by powers of 2. This is denoted ‘’ in Table 3.

Model IOU
Shape Context 0.718
U-Net 0.988
Untrained 0.701
Trained 0.913
Unsupervised 0.879
Unsupervised 0.808
Table 3: MOCAP 4D wireframe pose results with 5 classes.
Figure 7: ScanNet RGB test scans and reconstructions from downsampled latent space. The reconstructions capture much of the shape, but little of the color information.

3.5 ScanNet room scenes

The ScanNet dataset666http://www.scan-net.org/ has 1513 3D scans of scenes, segmented into 20 classes. We split the data into 1286 training samples and 227 test samples, see Figure 7.

For training we randomly rotate the scenes in the horizontal plane, and apply random affine data augmentation. We voxelize the the points with grid resolution cm, and use autoencoders to learn a latent space on a scale downsized by a factor of , or . For this experiment, we replaced the SSC blocks in Figure 1 with 2 simple ResNet block. This is denoted by ‘’ in Table 4. Our results are calculated using 3-fold testing.

The fully supervised U-Net baseline is roughly on par with state-of-the-art methods [1]. The unsupervised encoder compares respectably to some of the fairly recent fully supervised methods.

We repeated the supervised learning using only 10% of the labelled scenes, see Figure 5. The gap between the fully supervised U-Net reduces. The unsupervised representation outperforms an equivalent network trained fully supervised on the reduced training set.

Method IOU
Shape Context 0.211
U-Net 0.703
3DMV [5] 0.484
SurfaceConvPF[20] 0.442
Mink34 [1] 0.679
Unsupervised 0.518
Unsupervised 0.414
Unsupervised 0.299
Table 4: ScanNet room segmentation results.
Cited results were calculated on a different test set.
Method IOU
Shape Context 0.172
U-Net 0.460
Trained 0.212
Unsupervised 0.295
Table 5: ScanNet using 10% of the training labels.

4 Conclusion

We have introduced a new framework for building spatially-sparse autoencoder networks in 2D, 3D and 4D. We have also introduced a number of segmentation benchmark tasks to measure the quality of the latent space representations generated by the autoencoders. Other possible uses include reinforcement learning tasks related to navigation in 3D environments

[32] and embodied Q&A777https://embodiedqa.org/.