Unsupervised Visual Representation Learning with Increasing Object Shape Bias

(Very early draft)Traditional supervised learning keeps pushing convolution neural network(CNN) achieving state-of-art performance. However, lack of large-scale annotation data is always a big problem due to the high cost of it, even ImageNet dataset is over-fitted by complex models now. The success of unsupervised learning method represented by the Bert model in natural language processing(NLP) field shows its great potential. And it makes that unlimited training samples becomes possible and the great universal generalization ability changes NLP research direction directly. In this article, we purpose a novel unsupervised learning method based on contrastive predictive coding. Under that, we are able to train model with any non-annotation images and improve model's performance to reach state-of-art performance at the same level of model complexity. Beside that, since the number of training images could be unlimited amplification, an universal large-scale pre-trained computer vision model is possible in the future.



There are no comments yet.


page 3

page 4


Representation Learning with Contrastive Predictive Coding

While supervised learning has enabled great progress in many application...

Universal Sentence Representation Learning with Conditional Masked Language Model

This paper presents a novel training method, Conditional Masked Language...

MoPro: Webly Supervised Learning with Momentum Prototypes

We propose a webly-supervised representation learning method that does n...

Learning Finer-class Networks for Universal Representations

Many real-world visual recognition use-cases can not directly benefit fr...

Unsupervised Learning for Large-Scale Fiber Detection and Tracking in Microscopic Material Images

Constructing 3D structures from serial section data is a long standing p...

Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction

We propose split-brain autoencoders, a straightforward modification of t...

Unsupervised Representation Learning from Pathology Images with Multi-directional Contrastive Predictive Coding

Digital pathology tasks have benefited greatly from modern deep learning...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep learning model, characterized by large volume of parameters, strong domain adoption and great generalization ability, has reached state-of-art performance in every computer vision tasks and dominated this area for almost ten years long. Beyond that, new designations of model structure and massive parameters keep improving models’ performance, but behind this, increasingly powerful computing power and large-scale annotation data sets are the biggest driving force. Nowadays, powerful computing power is easy to obtain, but not for high quality annotation data sets. Even now, ImageNet started to be overfitted by large scale convolution neural model, and model’s performance doesn’t improve much in these two years.

The two main strategies to solve problems are: (1) Discovering a better model architecture designation and (2) Improving model’s feature extraction ability by more training data though applying extra data sources or data augmentation. For strategy 1, the mainstream research focus on minimizing the size of model but remain similar performance, the common methods are knowledge distillation(Use pre-trained large model to teach small size model) and EfficientNet(scaling model to be best fit on specific data set). For strategy 2, by modifying the training images with computer vision to expand original training data size, so that model can get better generalization ability. Both strategies have made the models reach the best performance on Imagenet in 2019, but they don’t solve the problem that lack of large-scale annotation data to train large size model, and specialized training strategy makes the model hard to overcome domain adoption problem to be an universal pre-train model.

To overcome this challenge, semi-supervised learning which also is widely used in knowledge distillation field tries to do machine annotation by large model or web supervision to expand training data set to reach billion-scale. However, the drawback of these approaches is that these generated annotations are noisy and limited in specific categories, this also limits the model’s performance improvement space. In order to avoid this situation, unsupervised learning is purposed and actively being researched on by the research community. Compared with previous research, the learning object is not decided by image’s annotation in unsupervised learning strategy, training purpose is more focused on universal feature detection and extraction. The evaluation criteria for training is generated by the training image itself.

The most common unsupervised learning strategy is to make prediction for the missing patches of contextual information in a text sentence or pixel in a image, so is also inferred as representation learning. One of the oldest strategy in this research field is generated from signal data compression called predictive coding. Inspired by this, large NLP pre-training model Bert which makes prediction by neighbor words has been proved successful in practice. So for computer vision research, since object has its own unique texture and shape characterizes, it is reasonable to assume that random pixel or image patches in the image is highly dependent on its neighbors as well on the similar shared high level latent information. And recently research has shown that this brings stronger feature extracting ability than normal supervised training result on the same data set. However, current research ignores image object’s own special structure and texture information and location information. We address these key challenge by introducing new contrastive predictive method and special data augmentation.

Our first contribution is to reform contrastive predictive method’s learning mechanism. Given a image, model will be forced to pay more attention to the object’s shape and profile, other than

Our second contribution is to introduce transformer model as autoregreesion model to supervised train computer vision model.

Our third contribution is that we use neural transfer model as data argumentation to regularize model’s learning direction.

Through our unsupervised training method, model shows better domain adoption ability and performance than model with regular supervised training. This also hinting that an universal pre-trained large computer vision model is potentially.

2 Related Work

2.1 Contrastive Predictive

Figure 1: The overview of contrastive prediction coding application on the sequence data.

Contrastive Predictive Coding, as shown in figure 1, is unsupervised learning method with primary object is to learn high level information from predicting the representation of future or missing information of a sequential data. It assumes that information(context or image patches or pixels) with closed locations are related and predictable by each others. Also a good representation of context or image should be able to reconstruct the input data and predict similar data. In other words, model should filter useless low-level details from visual perception or context. Similar applications like Word2Vector and Bert both provide strong performance and widely used in many tasks.

Contrastive predictive coding normally includes two parts, one is contrastive loss, the other is predictive coding. Contrastive loss function is widely used in objects detection tasks and domains adoption, normally it is based on the triplet losses and by using max-margin method to separate positive examples from negative examples. Though this loss function, the useful vision feature information can be distinguished from low-level texture features. Inspired by the brain neuron, predictive coding is an unifying framework for understanding redundancy reduction and efficient coding in the nervous system. In recently researches, it is used in pixel recurrent neural networks and video generation. The mechanism of it is to predict the future by the known information and prediction made already by the model itself. Since the future or missing information is included or implied in the training data, this constructs a prefect supervised learning situation under unsupervised learning.

2.2 Object Shape Bias Increment

Convolutional Neural Networks (CNNs) has been successfully used in countless computer vision tasks. The most common thought about how the CNN recognizes the vision objects from its perceptions is based on the learned representation of objects’ shapes. However, recent research proves that object’s texture is much more decisive than object’s shape(Object’s shape’s section is kind based on texture as well). This subverts our perception of the CNNs. And this explains to some extent that why CNN lacks the generalization ability it should have in domain adoption problems and pre-trained model is not that necessary in some cases. In contrast, pre-trained Bert model reached state-of-the-art in every NLP tasks at the moment it just come out. And external training for increasing object shape learning does considerably improve model’s classification performance.

2.3 Neural Transfer

In 2015, The development of the Neural-Style algorithm extends the limit of what a CNN can learn from the picture. By the combination of the content loss and style loss functions, CNN can learn the representation of target image’s artistic style. Though this way, CNN can reproduce and apply learned image style onto other image.

2.4 Transformer

In 2018, the Google’s paper ”Attention is All You Need” first introduced attention mechanism and a brand new kind neural layer called Transformer which is completely based on the attention mechanism. And the models which are based on the transformer layer has been proved to be superior in performance in all NLP problems. Other than that, the transformer model’s good parallelism resolves the problem that traditional large NLP model is hard to be trained parallel. This breaks the domination of LSTM in NLP applications.

3 The Designation of Unsupervised Learning

Our aim is to design an unsupervised learning method that we are able to train any large computer vision model on training data without label. In Section 3.1, we introduce how to process image and self-generate image sequence. In Section 3.2, we introduce neural transfer as data augmentation for increasing object shape bias. In Section 3.3, we describe our improved contrastive prediction coding method. Finally in Section 3.4, we describe the complete improved contrastive prediction coding architecture.

3.1 Image Processing

Figure 2: The generation of image patches grid. Every image patch is overlapped with its neighbor patch as much as its half area. This forms a 7x7 image patches grid.
Figure 3: The generation of image patches grid. Every image patch is overlapped with its neighbor patch as much as its half area. This forms a 7x7 image patches grid.

Every image will first to be resized as 224x224 shape. Then follow the row and column both directions as shown in Figure 3

, set step as 28 pixels, total 7x7 image overlap patches(for the row direction, the overlap size is 56x28, for the column direction, it is 28x56) are cropped and formed a new 7x7 image grid. Then for each element of the image grid, we pad it as 224x224 with value


3.2 Data Augmentation

Figure 4: The designation of final unsupervised training architecture.

In order to force computer vision model to learn more complex representation of object shape. Texture information should be treated as negative information and be separated from shape information. with this purpose, hand-picked texture samples are chose as target image styles learned by neural transfer models. By utilizing the models, images with different texture can be generated from the original images. All these extra images will be processed with the same method introduced in Section 3.1.

3.3 Improved Contrastive Prediction Coding Method

Figure 5: The illusion of autoregression training. The blue part is used as known information to predict the orange part which is around the blue part. And Keep doing this by moving right or down with one image patch as a step as shown in the left of the figure, for the right is about the opposite direction.

Tradition contrastive prediction method extracts information and make prediction in a sequence order. However, image information structure is different from context and audio signal whose location information is strong in order, it comes with strong aggregating attribute. So, in order to make the image patches sequence include the image information location, we design the image sequence generation mechanism as show in Figure 6

: we always choose a 3x3 image patches as a training block, the two layers of images patches around the training block are the target image patches to predict. Then for each image patch are chose for the training and target, it will be padded with value 0 as 224x224 size, then processed by the computer vision model and mean pool layer and output as a tensor. So we get a sequence of tensors for training and use it to feed the autoregression model which is composed with transformer layers to predict the orange part’s tensor sequence. This will force model to learn object’s profile when the object is partly covered by the training block.

3.3.1 Contrastive Learning on Same Image

So if a image patches grid is generated as with size

, and set stride

, perception, anchor, the training sequence will be totally image patches from the image grid. The target sequence will be which also meets that . With computer vision model and mean pool, the training sequence will be transferred as a sequence of latent representations for , so is the target sequence for .

As we have shape of the sequences, an autoregression function can be set. By feeding the transferred training sequence, a prediction sequence which has the same length and size of target sequence will be generated. The learning process is regularized by

3.3.2 Contrastive Learning On Different Image

Figure 6: The illusion of contrastive learning on different images.

3.4 The Designation of Cost Function

The training of feature extraction ability of computer vision model is processed by the evaluation of the quality of prediction of autoregression model. The essential purpose of the training is a condition probability model that based on the representation learning information of sampled image patches to predict the target image patches’ representation information. Also, considering the contrastive learning processes on images with different and same texture are running synchronize, the final cost function is a combination of multiple cost functions. The base component of the cost function is based on cross-entropy loss with softmax as the probability prediction function, and it sums the loss over the locations of prediction and target sequences:


Due to the fact that multiple learning processes happen at the same time on multiple images, the final cost function is composed by multiple cost functions as well:


Where represents the cost function of contrastive learning on the same image, the means the cost function of learning on the original image and image with th kind texture. The sum of all values are not necessary to be 1, the ratio between and the sum of controls model learn object’s texture or shape more.

3.5 Final Unsupervised Learning Architecture

Working on the details.

As shown in Figure 4, for every training image, we used 5 five neural transfer models which are introduced in Section 3.2, to generate another 5 images with different textures.

4 Experiments

4.1 References

List and number all bibliographical references in 9-point Times, single-spaced, at the end of your paper. When referenced in the text, enclose the citation number in square brackets, for example [1]. Where appropriate, include the name(s) of editors of referenced books.

Method Frobnability
Theirs Frumpy
Yours Frobbly
Ours Makes one’s heart Frob
Table 1: Results. Ours is better.


  • [1] Authors (2014) The frobnicatable foo filter. Note: Face and Gesture submission ID 324. Supplied as additional material fg324.pdf Cited by: §4.1.