Comparison Detector: A novel object detection method for small dataset

by   Zhihong Tang, et al.

Though the object detection has shown great success when the training set is sufficient, there is a serious shortage of generalization in the small dataset scenario. However, we inevitably just get a small one in some application scenarios, especially medicine. In this paper, we propose Comparison detector which still maintains the end-to-end fashion in training and testing, surpassing the state-of-the-art two-stage object detection model on the small dataset. Inspired by one/few-shot learning, we replace the parameter classifier in feature pyramid network(FPN) with the comparison classifier in no-parameters or semi-parameters manner. In fact, a stronger inductive bias is added to the model to simplify the problem and reduce the dependence of data. The performance of our model is evaluated on the cervical cancer pathology test set. When training on the small dataset, it achieves a mAP 26.3 35.7 Comparison detector achieves same mAP performance as the current state-of-the-art model when training on the medium dataset, and improves AR by 4 points. Our method is promising for the development of object detection in small dataset scenario.


page 8

page 16


Interactron: Embodied Adaptive Object Detection

Over the years various methods have been proposed for the problem of obj...

2nd Place Solution for Waymo Open Dataset Challenge – 2D Object Detection

A practical autonomous driving system urges the need to reliably and acc...

Comparison Network for One-Shot Conditional Object Detection

The current advances in object detection depend on large-scale datasets ...

Enhancement of SSD by concatenating feature maps for object detection

We propose an object detection method that improves the accuracy of the ...

Development of Real-time ADAS Object Detector for Deployment on CPU

In this work, we outline the set of problems, which any Object Detection...

Scalable Logo Recognition using Proxies

Logo recognition is the task of identifying and classifying logos. Logo ...

DSSD : Deconvolutional Single Shot Detector

The main contribution of this paper is an approach for introducing addit...

1 Introduction

Due to deep neural network can learn robust features that are closely related to tasks, it has achieved great success in many fields, such as speech recognition

[1, 2]

, natural language processing

[3], robotic control [4, 5], art [6, 7], especially in various image tasks. ResNet [8]

has become a standard model for extracting features in the field of image research due to its overwhelming performance, which is better than human on the ImageNet

[9] dataset. Faster R-CNN with Feature Pyramid Networks (FPN) [10, 11] achieves top precision in object detection benchmark. Recently, expanding the application field of deep neural networks has become a new trend. The information extraction from medical images and the diagnosis of pathological images have been one of the most popular research domain [12, 13, 14, 15]. It is a well-known fact that sufficient data is needed to obtain good generalization performance for deep neural networks. However, collecting medical images will be limited by law, and some positive samples in pathological images are rare. Furthermore, it’s labor-intensive and time-consuming. It is necessary to explore deep neural networks to obtain strong generalization and better performance on the small datasets. In image recognition, we call the task which learns from one or a few images as one/few-shot learning [16, 17, 18, 19]. Most methods are based on the idea of comparison.

In this work, we migrate the idea of comparison into object detection for small dataset and propose the Comparison detector to alleviate the over-fitting problems which often occur in modern object detection models, while still achieving better performance when the training sample size increases. Specifically, we choose the state-of-the-art object detection method, Faster R-CNN with FPN [10, 11]

, as our baseline model and replace the original parameter classifier with a non-parametric or semi-parameter one which is based on the comparison with the reference images of each category. Instead of manually choosing the reference images of the background by some heuristic rules, we propose to learn them from the data. We also investigate several important factors including generating prototype representations of categories and the design of head model. It maintains the end-to-end paradigm in the training and testing stage.

We evaluate the performance of the proposed Comparison detector on the pathological image dataset of cervical cancer(CCD) test set. When the model is learn from (see Section 4.1 for details) training dataset, our Comparison detector achieves almost the same result with a mean Average Precision(mAP) of 52.3% on CCD test set, and improves nearly 4 points comparing to baseline model with Average Recall (AR). When the model is learned from (see Section 4.1 for details) which is a small dataset, the performance of Comparison detector is obviously better than that baseline model. Comparison detector has an mAP 26.3% and an AR 35.7%. However, baseline model only gains an mAP 6.6% and an AR 12.9%. The experimental results show that our method alleviates the over-fitting problem of deep neural network on small data and has better generalization in object detection.

2 Related work

2.1 One/few shot learning

One/few shot learning is the task of image recognition in a small number of samples of each new class [16, 17, 18, 19, 20, 21]

. To the best of our knowledge, it can be divided into three types of method. The first type is based on metric function. It converts the reasoning problem into judgment problem, which reduces the complexity greatly. The general practice is to extract the features of query image and support set images through a projection function, such as convolution neural network(CNN). Then a metric function is used to measure the similarity of the features. Finially, we can infer the label of query image based on the similarity. Recent related work such as Siamese network

[16], Matching Networks [17], Prototype network [18], Relation Network [19], follow this method. It should be pointed out that categories in the training and test dataset are different in the one/few shot learning, but same in our model setting.

The second is using memory mechanism. Inspired by Neural Turing Machines

[22], Santoro [20] builds a new module in the network called external memory, which uses the principle of Least Recently Used Access to update the memory with the input features so that the new class of information can be quickly absorbed by the network. By using the content addressing method, it reads the relevant content for classification.

The third is to learn the way of updating the parameters. They [21] think optimization is key to learn, and propose to learn a good optimizer to improve the generalization problem on small dataset. This model updates classifier by using optimization learner in the training dataset which can gain a good initialization model, then updates the optimization learner using the data of support set in the test dataset. And updating the classifier can make model have a good generalization.

The three methods can learn new knowledge from smaples rapidly, but the first method is simpler and easier to implement. So we use the idea of the first method to solve the problem of object detection in the small dataset scenario.

2.2 R-Cnn

Since R-CNN [23] has achieved good performance in the object detection task by using deep neural networks, many research teams devote to improve it. Fast R-CNN [24] adds the bounding box regression learning into the model to achieve multi-task learning. Faster R-CNN [10] integrates object proposals into the model and comes into being end-to-end training and inference fashion. FPN [11] exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids, which makes the model robust to different sizes of objects. Mask R-CNN [25] uses Align-pooling instead of ROI pooling to maintain the equivariance of features and further improve the performance of object detection model. At present, Faster R-CNN with FPN is state-of-the-art model and fundamental model on object detection task. Many later work is based on it, such as Deformable Convolutional Networks [26] and Light-Head R-CNN [27]. Similarly, we also use we use it as fundamental and baseline model in this work.

2.3 Object detection with few example

We investigate the research domain of object detection in small dataset scenarios. Most of them use iterative or semi-supervised methods to make use of unannotated data. For example, Dong[28] uses few annotated images to train the model, and the objects of high confidence are selected as pseudo-annotation on the unannotated images. Model fusion further enhances performance. Notably, different from the ensemble model, it is embedded in the learning process. By optimizing function control, each model is constrained to have the same prediction for the same objects. Yang[29] presents an iterative method to enhance the performance by learning the features of video-specific through a basic object detector. Wang [30] chooses a set of object detectors based on a small sample of the new task and then fine-tunes them. However, these object detectors are trained on the large-scale object detection dataset. For constraining the distance between different category prototype representation, RepMet[31] adds a hinge like loss to the network, which is denoted as ensemble loss. It is noteworthy that the distance should be more than a threshold. The authors firstly use a trained object-detector to generate ROI features, then an extra model is used to get the features of the support set image. The extra model is trained in one/few shot learning manner with embedding and classification loss. As you can see, all of the methods require related large-scale datasets. The problem they discuss is lack of annotated data, not lack of data. However we’re trying to reinforce the generalization of model in the small dataset scenario.

3 Comparison Detector

3.1 Basic Architecture

There is a serious over-fitting problem on small dataset using state-of-the-art model. Generally, adding inductive bias to the network can be used as a regular term to avoid over-fitting. Ideally, it both improves the search for solutions without substantially diminishing performance, as well as help find solutions which generalize in a desirable way [32]. For example, the inductive bias in the CNN is the local correlation of the data, which is consistent with the image information distribution characteristics. Therefore, performance of CNN in the image domain generally exceeds the fully connected neural network, and regularization of the model is enhanced.

Generally speaking, no-parameters method introduces a stronger inductive bias into the model, and mitigates over-fitting issue to some extent. So we introduce the idea of comparison into our model and replace parameter classifier of FPN [11] with comparison classifier. In fact, the model adds a inductive biases that the distance of same categories’ samples is less than different categories in the embedding space. It is designed to solve the generalization problem of learning from a small amount of data, not sufficient data with few annotated. The framework of the proposed Comparison detector is shown in Fig.1, which divided into three stages to describe.

Figure 1: Comparison detector. Green module is shared for object detection and reference images. The input of red module only is the feature of object detection images and the input of blue module is the feature of reference images. (a): The overall structure of Comparison detector. It is made up of backbone network, feature pyramid network and head. The green circle represents the head. The blue module is for generating prototype representation of categories. (b): The head of Comparison detector. Classification and bounding box regression are calculated independently. is number of categories include background.

At the first stage, as shown in Fig. 1(a), the features of the reference and the object images are computed by backbone network with FPN, without using extra models to encode the reference images. Assuming that there are samples per category with levels pyramid feature in the reference images. Let be the -th categories’ prototype representation of the -level pyramid features, which can be computed by average operation as follows


where and denote the function of the computed -th level feature pyramid and the reference image of class with -th samples, respectively. At the same time, the feature of object image is gerenated by


where indicates -th object image. It should be pointed out that categories in the training and test set are same in our settings unlike one/few shot learning.

The second stage is to generate the prototype representations of each category from the reference images’ pyramid features. We need to find a map function which use all level pyramid features each category as input to compute the final prototype representation for class


The third stage is the design of the head model consisting of a comparison classifier and a bounding box regressor(Fig. 1(b)). Let be a metric function to compute the distance between proposal and prototype representation of the category. It is important to note that and have the same size. Each proposal’s classification and bounding box regression can be obtained by


where denotes the box regression function. The rest of the model is the same as Faster R-CNN with FPN model [11].

Figure 2: Generating prototype representations of background module. Feature of the background should be concated to the starting position of the array.

3.2 Learning the reference background

There are many negative proposals generated by RPN, so the R-CNN [23] adds a background category to represent them. In our Comparison detector, we need to select a number of reference images for each category and therefore we also need to choose reference images for the background category. Due to the overwhelming diversity, selecting background reference is very difficult. Notice that a region is considered to be the proposal indicating that it has certain similarity with categories, but it does not belong to any object class. Therefore, it can be inferred that its features are a combination of different categories in the most case. So we propose to learn it by combining the prototype representations of all the categories in the reference samples, as shown in Fig. 2.

3.3 Generating prototype representations of categories

As shown in Eq. 4, the Comparison detector uses metric function to measure distance or the dissimilarity between the prototype representation of categories and the features of the proposal, then obtains the label of the proposals based on the dissimilarity. Features of proposal may come from any of the four level pyramids, and the prototype representation of the categories is produced according to Eq. 3. For simplicity, we will directly resize the each feature pyramid which is generated by reference images to a fixed size, and then calculate prototype representation by mean operation, i.e.


where is the total number of level feature pyramids, is resize function and is the size of finial features. Different levels pyramid features of the category are resized into fixed size, and then getting the prototype representation by simply averaging them.

Figure 3: (a): The head of baseline model. (b): The share module in our experiments.

3.4 The head for classification and regression

As shown in Fig. 3(a), the structure of the baseline model’s head is to transform the proposal feature firstly and then one branch is used for classification, and another is used to predict the offset of the bounding box. For our Comparison detector, due to the introduction of the reference images, we need to re-organise the head. The are two choices according to whether the reference images are involved in the box regression branch. One is that the reference prototypes are only used for classification, as shown in Fig. 1(b). Unlike the baseline model, the comparison classifier and bounding box regressor in the head of Comparison detector are independent. And the bounding box regressor only uses the features of ROI to predict the offset of the bounding box. It is equivalent to

where . Another choice is to use the reference prototypes for both classification and regression, as shown in Fig. 3(b), which means

We call this method as shared module. They all achieve good performance in our experiments but the independent module performs slightly better.

Figure 4: t-SNE visualization for objects. (a):Visualization before learning. (b):Visualization after learning. In our experiments, features of the objects are reduced to 3D space. This is the case from a certain perspective.

3.5 Reference images sampling

In our Comparison detector, we also need to choose some objects of each category as reference images. We first randomly select about 150 instances of each category from the training sets. The shortest side of these instances is greater than 16 pixels. Therefore we get a total of 1560 instances and from them, we can select suitable instances in these objects as our reference images.

There are three schemes. The fixed mode denote we randomly choose 3 instances of each category (this number is limited by GPU’s memory) as the reference images. The second one is to randomly select 5 candidates of each category in those objects. Then the model randomly selected three of the five candidates as templates during training, but five in testing. The last method is that we map the 1560 objects to the feature space through the baseline model and get the features of each object. Then we use t-SNE (t-distributed Stochastic Neighbor Embedding) [33] for feature dimension reduction (Fig. 4). Finally, we select representative objects in 3D space as our reference images.

4 Experiment and Result

4.1 Materials and experiments

Cervical cancer is one of the most common gynecological nausea tumors. The main method of screening cervical cancer is Liquid-based cytology. Then we make slide image into digital images to form cervical cancer datasets(CCD). This dataset has 11 categories: ascus(ASC-US), asch(ASC-H), lsil(low-grade squamous intraepithelial lesion), hsil(high-grade squamous intraepithelial lesion), scc(squamous-cell carcinoma), agc(atypical glandular cells), trich(the last five classes are microorganism), cand(candida), flora, herps, actin(actinomyces). They are highly important for the examination of cervical cancer. We use the object detection technology based on deep learning to classify and locate these objects. In this way we can diagnose the possibility of cervical cancer. We divide the dataset into training set

which contains 6667 images, test set which contains 419 images for experiment. We randomly choosed about 762 images from the training dataset to form a small dataset of . The number of categories in each dataset is shown in the Fig. 5

Figure 5: The number of categories on different datasets.

In the medical image, annotators are prone to take a higher threshold when label the objects due to the low discrimination of them. At the same time, multiple nearby objects with the same category will be marked as one, so the performance of the model can not be well reflected by mAP. Therefore, the performance of the model is evaluated by using mAP and AR as a supplement on CCD test set. If the mAP does not decrease and the AR improves, it surely signifies the performance is improved. For reference images, We re-scale them such that their side is

which is coincident with pre-trained model. In all experiments, we used ResNet50 as backbone network with ImageNet pre-trained model. The initial learning rate is 0.001, and then decreased by a factor of 10 at 35-th and 50-th epoch. Training is stopped after 60 epochs and the other parameters are the same as FPN

[11]. All experiment is trained on the dataset to guarantee firstly the performance on sufficient data. In our setting, our reference images are the same in each training iteration for the stability of the training model. And test stage is the same. A summary of results can be found in Table 1.

A 34.1 53.3
B 32.7 50.8
C 41.0 51.3
D 38.9 49.8
E 37.7 51.1
F 38.8 52.3
G 43.5 58.9
H 43.7 60.7
Table 1: All experiments train on , and the performance is evaluated on the test set. At the same time, the comparison classifier of all models directly adopts the L2-distance. The reference samples are the same, produced by fixed mode before the experiment.

4.2 Reference background

We first evaluate our scheme to learn the background reference. Experiment shows our method is feasible (See model A in Table 1). It should be noted that because the prototype representation of the background category is learned from the prototype representation of other categories, the gradient propagation will affect the optimization of other prototype representation. In order to make sure whether this effect is beneficial, we stop gradient propagation at the fork position in Fig. 2. The performance of the model has declined with an mAP of 33.0% and a AR of 52.6%. It shows that the effect is beneficial. Besides generating the prototype of background, we can also remove it. But experiment shows it is very important to optimization. Without it, the model will fail to converge.

4.3 Prototype representations of categories

In our approach, as shown in Eq. 6, we use all pyramid features to generate prototype representation of categories. Another choice is to only use the last level pyramid feature as the category of prototype, i.e. . As shown the results of model A and model B, model C and model D in Table 1, our method is better. Because it can combine features from multiple level pyramids, which not only have rich semantics but also take into account objects of different size. We also fuse different levels of pyramid features by using LSTM [34], but the speed is greatly reduced.

4.4 Head model

As mentioned before, in independent module, the box regression function is the same as baseline model because experiment found that removing one layer will make the result worse. The results show shared module (model C) performs much better than independent module (model A). Furthermore, we drop the operation of refining bounding box from the head. As shown in Table 1, it’s dramatic that model E is better than model A. Combined with [15], we infer that the importance of classification should be greater than bounding box regression in our model. So we add a weight coefficient to the classification loss of the model’s head, and after fine-tuning, we select . The results of model F, G and H in Table 1 confirm our analysis. By analyzing model G and model H, we find that the difference between them is not only classification and bbox regression is independent, but also the comparison classifier of model G is semi-parameter. After changing the comparison classifier of model H into semi-parameter(Fig. 1 (b)), the results show that it is better than model G.

comparator L2-distance L2-distance + parameters concat + parameters
mAP 34.1(43.7) 38.2(44.5) 40.7(42.5)
AR 53.3(60.7) 56.8(61.6) 49.1(58.1)
Table 2: Different comparison classifier.
method fixed mode random mode t-SNE
mAP 44.5 42.8 45.3
AR 61.6 61.0 62.8
Table 3: Different way of selecting the reference images.

4.5 Optimizing comparison classifier

There are three comparison classifier. The first is to measure similarity directly by using L2-distance which means .

represents averaging function for tensor. The second is the parameterized L2-distance, such as

. Similar to [19], we also try to make the model to learn the metric function instead of manual design. According to the result of Table 2, our model ultimately adopts parameterized L2-distance. When , the result is shown in brackets. Combining with the results shown in Table 1, it is universal that this trick can improve performance in our model. So we adopt this trick in all the next experiments.

4.6 Reference images sampling

In order to eliminate the effect of randomness, we did three experiment in the model a by randomly select 3 objects. The difference between results is less than 0.5 points with the mAP, which shows the robustness of Comparison detector. In order to save time, all experiment uses once result as the performance of the model. The performance of t-SNE is better as shown in Table 3 . In t-SNE experiment, the hyper-parameters are 30 for perplexity, 1 for learning rate, and 10 for label supervision.

5 Discussion

As shown in Table 4, Comparison detector has the almost same mAP as the baseline model when training on the dataset, but improves the AR by near 4 points. We think it is better than the state-of-art model. Due to the special annotating situation as described in Section 4.1, some correct predictions may be identified as false positives. Therefore, there is an increase in AR, but no improvement in mAP. Notably, the baseline model does not use the trick of balance loss because it will degrade the performance with a mAP of 43% and AR of 56.1%. When training on the , Comparison detector is completely superior to baseline model. It achieves a top result on the CCD test set with a mAP of 26.3% compared to 6.6%, which indicates our method alleviates the over fitting problemto some extent. As for the model size and time efficiency, it is similar to the baseline. Prototype representation in this model is generated by reference images, but in fact it can be generated by any way, such as external memory. In the future work, We expect a better solution for the generation of prototype representations.

method dataset AR mAP ascu asch lsil hsil scc agc trich cand flora herps actin
12.9 6.6 11.0 2.0 23.7 21.6 0.0 3.5 0.0 11.5 0.0 0.0 0.0
35.7 26.3 10.5 1.7 42.8 32.3 0.8 40.5 37.5 24.1 6.9 45.0 46.6
58.9 45.2 27.2 6.7 41.7 35.3 18.6 57.3 46.7 72.2 57.3 83.0 51.4
62.8 45.3 29.1 7.8 43.0 37.8 19.3 56.2 50.4 62.2 59.4 64.4 68.3
Table 4: The results of learning on different size datasets.

6 Conclusion

In this work, we focus on the generalization of object detection model to learn on small datasets. Our research is based on small datasets, not big datasets with few annotated data. Based on the most state-of-art object detection model, we propose the Comparison detector and solves several important problems affecting the generalization. It improves the mAP by 19.7 points when trained on the small dataset, and it also has comparable performance to the baseline model on other scale datasets. These results show that it is effective to introduce the comparison ideas into the object detection model. Our method alleviates the problem of over-fitting and improves the performance of the model. Our research not only promotes the object dedtection take a big step forward in small dataset scenario, but also provide a good foundation for further works.


Appendix A Appendix

a.1 Gallery

Figure 6: Results of Comparison detector