Deep learning model, characterized by large volume of parameters, strong domain adoption and great generalization ability, has reached state-of-art performance in every computer vision tasks and dominated this area for almost ten years long. Beyond that, new designations of model structure and massive parameters keep improving models’ performance, but behind this, increasingly powerful computing power and large-scale annotation data sets are the biggest driving force. Nowadays, powerful computing power is easy to obtain, but not for high quality annotation data sets. Even now, ImageNet started to be overfitted by large scale convolution neural model, and model’s performance doesn’t improve much in these two years.
The two main strategies to solve problems are: (1) Discovering a better model architecture designation and (2) Improving model’s feature extraction ability by more training data though applying extra data sources or data augmentation. For strategy 1, the mainstream research focus on minimizing the size of model but remain similar performance, the common methods are knowledge distillation(Use pre-trained large model to teach small size model) and EfficientNet(scaling model to be best fit on specific data set). For strategy 2, by modifying the training images with computer vision to expand original training data size, so that model can get better generalization ability. Both strategies have made the models reach the best performance on Imagenet in 2019, but they don’t solve the problem that lack of large-scale annotation data to train large size model, and specialized training strategy makes the model hard to overcome domain adoption problem to be an universal pre-train model.
To overcome this challenge, semi-supervised learning which also is widely used in knowledge distillation field tries to do machine annotation by large model or web supervision to expand training data set to reach billion-scale. However, the drawback of these approaches is that these generated annotations are noisy and limited in specific categories, this also limits the model’s performance improvement space. In order to avoid this situation, unsupervised learning is purposed and actively being researched on by the research community. Compared with previous research, the learning object is not decided by image’s annotation in unsupervised learning strategy, training purpose is more focused on universal feature detection and extraction. The evaluation criteria for training is generated by the training image itself.
The most common unsupervised learning strategy is to make prediction for the missing patches of contextual information in a text sentence or pixel in a image, so is also inferred as representation learning. One of the oldest strategy in this research field is generated from signal data compression called predictive coding. Inspired by this, large NLP pre-training model Bert which makes prediction by neighbor words has been proved successful in practice. So for computer vision research, since object has its own unique texture and shape characterizes, it is reasonable to assume that random pixel or image patches in the image is highly dependent on its neighbors as well on the similar shared high level latent information. And recently research has shown that this brings stronger feature extracting ability than normal supervised training result on the same data set. However, current research ignores image object’s own special structure and texture information and location information. We address these key challenge by introducing new contrastive predictive method and special data augmentation.
Our first contribution is to reform contrastive predictive method’s learning mechanism. Given a image, model will be forced to pay more attention to the object’s shape and profile, other than
Our second contribution is to introduce transformer model as autoregreesion model to supervised train computer vision model.
Our third contribution is that we use neural transfer model as data argumentation to regularize model’s learning direction.
Through our unsupervised training method, model shows better domain adoption ability and performance than model with regular supervised training. This also hinting that an universal pre-trained large computer vision model is potentially.
2 Related Work
2.1 Contrastive Predictive
Contrastive Predictive Coding, as shown in figure 1, is unsupervised learning method with primary object is to learn high level information from predicting the representation of future or missing information of a sequential data. It assumes that information(context or image patches or pixels) with closed locations are related and predictable by each others. Also a good representation of context or image should be able to reconstruct the input data and predict similar data. In other words, model should filter useless low-level details from visual perception or context. Similar applications like Word2Vector and Bert both provide strong performance and widely used in many tasks.
Contrastive predictive coding normally includes two parts, one is contrastive loss, the other is predictive coding. Contrastive loss function is widely used in objects detection tasks and domains adoption, normally it is based on the triplet losses and by using max-margin method to separate positive examples from negative examples. Though this loss function, the useful vision feature information can be distinguished from low-level texture features. Inspired by the brain neuron, predictive coding is an unifying framework for understanding redundancy reduction and efficient coding in the nervous system. In recently researches, it is used in pixel recurrent neural networks and video generation. The mechanism of it is to predict the future by the known information and prediction made already by the model itself. Since the future or missing information is included or implied in the training data, this constructs a prefect supervised learning situation under unsupervised learning.
2.2 Object Shape Bias Increment
Convolutional Neural Networks (CNNs) has been successfully used in countless computer vision tasks. The most common thought about how the CNN recognizes the vision objects from its perceptions is based on the learned representation of objects’ shapes. However, recent research proves that object’s texture is much more decisive than object’s shape(Object’s shape’s section is kind based on texture as well). This subverts our perception of the CNNs. And this explains to some extent that why CNN lacks the generalization ability it should have in domain adoption problems and pre-trained model is not that necessary in some cases. In contrast, pre-trained Bert model reached state-of-the-art in every NLP tasks at the moment it just come out. And external training for increasing object shape learning does considerably improve model’s classification performance.
2.3 Neural Transfer
In 2015, The development of the Neural-Style algorithm extends the limit of what a CNN can learn from the picture. By the combination of the content loss and style loss functions, CNN can learn the representation of target image’s artistic style. Though this way, CNN can reproduce and apply learned image style onto other image.
In 2018, the Google’s paper ”Attention is All You Need” first introduced attention mechanism and a brand new kind neural layer called Transformer which is completely based on the attention mechanism. And the models which are based on the transformer layer has been proved to be superior in performance in all NLP problems. Other than that, the transformer model’s good parallelism resolves the problem that traditional large NLP model is hard to be trained parallel. This breaks the domination of LSTM in NLP applications.
3 The Designation of Unsupervised Learning
Our aim is to design an unsupervised learning method that we are able to train any large computer vision model on training data without label. In Section 3.1, we introduce how to process image and self-generate image sequence. In Section 3.2, we introduce neural transfer as data augmentation for increasing object shape bias. In Section 3.3, we describe our improved contrastive prediction coding method. Finally in Section 3.4, we describe the complete improved contrastive prediction coding architecture.
3.1 Image Processing
Every image will first to be resized as 224x224 shape. Then follow the row and column both directions as shown in Figure 3
, set step as 28 pixels, total 7x7 image overlap patches(for the row direction, the overlap size is 56x28, for the column direction, it is 28x56) are cropped and formed a new 7x7 image grid. Then for each element of the image grid, we pad it as 224x224 with valueelement.
3.2 Data Augmentation
In order to force computer vision model to learn more complex representation of object shape. Texture information should be treated as negative information and be separated from shape information. with this purpose, hand-picked texture samples are chose as target image styles learned by neural transfer models. By utilizing the models, images with different texture can be generated from the original images. All these extra images will be processed with the same method introduced in Section 3.1.
3.3 Improved Contrastive Prediction Coding Method
Tradition contrastive prediction method extracts information and make prediction in a sequence order. However, image information structure is different from context and audio signal whose location information is strong in order, it comes with strong aggregating attribute. So, in order to make the image patches sequence include the image information location, we design the image sequence generation mechanism as show in Figure 6
: we always choose a 3x3 image patches as a training block, the two layers of images patches around the training block are the target image patches to predict. Then for each image patch are chose for the training and target, it will be padded with value 0 as 224x224 size, then processed by the computer vision model and mean pool layer and output as a tensor. So we get a sequence of tensors for training and use it to feed the autoregression model which is composed with transformer layers to predict the orange part’s tensor sequence. This will force model to learn object’s profile when the object is partly covered by the training block.
3.3.1 Contrastive Learning on Same Image
So if a image patches grid is generated as with size
, and set stride, perception, anchor, the training sequence will be totally image patches from the image grid. The target sequence will be which also meets that . With computer vision model and mean pool, the training sequence will be transferred as a sequence of latent representations for , so is the target sequence for .
As we have shape of the sequences, an autoregression function can be set. By feeding the transferred training sequence, a prediction sequence which has the same length and size of target sequence will be generated. The learning process is regularized by
3.3.2 Contrastive Learning On Different Image
3.4 The Designation of Cost Function
The training of feature extraction ability of computer vision model is processed by the evaluation of the quality of prediction of autoregression model. The essential purpose of the training is a condition probability model that based on the representation learning information of sampled image patches to predict the target image patches’ representation information. Also, considering the contrastive learning processes on images with different and same texture are running synchronize, the final cost function is a combination of multiple cost functions. The base component of the cost function is based on cross-entropy loss with softmax as the probability prediction function, and it sums the loss over the locations of prediction and target sequences:
Due to the fact that multiple learning processes happen at the same time on multiple images, the final cost function is composed by multiple cost functions as well:
Where represents the cost function of contrastive learning on the same image, the means the cost function of learning on the original image and image with th kind texture. The sum of all values are not necessary to be 1, the ratio between and the sum of controls model learn object’s texture or shape more.
3.5 Final Unsupervised Learning Architecture
Working on the details.
As shown in Figure 4, for every training image, we used 5 five neural transfer models which are introduced in Section 3.2, to generate another 5 images with different textures.
List and number all bibliographical references in 9-point Times, single-spaced, at the end of your paper. When referenced in the text, enclose the citation number in square brackets, for example . Where appropriate, include the name(s) of editors of referenced books.
|Ours||Makes one’s heart Frob|
-  (2014) The frobnicatable foo filter. Note: Face and Gesture submission ID 324. Supplied as additional material fg324.pdf Cited by: §4.1.