Due to deep neural network can learn robust features that are closely related to tasks, it has achieved great success in many fields, such as speech recognition[1, 2]3], robotic control [4, 5], art [6, 7], especially in various image tasks. ResNet 
has become a standard model for extracting features in the field of image research due to its overwhelming performance, which is better than human on the ImageNet dataset. Faster R-CNN with Feature Pyramid Networks (FPN) [10, 11] achieves top precision in object detection benchmark. Recently, expanding the application field of deep neural networks has become a new trend. The information extraction from medical images and the diagnosis of pathological images have been one of the most popular research domain [12, 13, 14, 15]. It is a well-known fact that sufficient data is needed to obtain good generalization performance for deep neural networks. However, collecting medical images will be limited by law, and some positive samples in pathological images are rare. Furthermore, it’s labor-intensive and time-consuming. It is necessary to explore deep neural networks to obtain strong generalization and better performance on the small datasets. In image recognition, we call the task which learns from one or a few images as one/few-shot learning [16, 17, 18, 19]. Most methods are based on the idea of comparison.
In this work, we migrate the idea of comparison into object detection for small dataset and propose the Comparison detector to alleviate the over-fitting problems which often occur in modern object detection models, while still achieving better performance when the training sample size increases. Specifically, we choose the state-of-the-art object detection method, Faster R-CNN with FPN [10, 11]
, as our baseline model and replace the original parameter classifier with a non-parametric or semi-parameter one which is based on the comparison with the reference images of each category. Instead of manually choosing the reference images of the background by some heuristic rules, we propose to learn them from the data. We also investigate several important factors including generating prototype representations of categories and the design of head model. It maintains the end-to-end paradigm in the training and testing stage.
We evaluate the performance of the proposed Comparison detector on the pathological image dataset of cervical cancer(CCD) test set. When the model is learn from (see Section 4.1 for details) training dataset, our Comparison detector achieves almost the same result with a mean Average Precision(mAP) of 52.3% on CCD test set, and improves nearly 4 points comparing to baseline model with Average Recall (AR). When the model is learned from (see Section 4.1 for details) which is a small dataset, the performance of Comparison detector is obviously better than that baseline model. Comparison detector has an mAP 26.3% and an AR 35.7%. However, baseline model only gains an mAP 6.6% and an AR 12.9%. The experimental results show that our method alleviates the over-fitting problem of deep neural network on small data and has better generalization in object detection.
2 Related work
2.1 One/few shot learning
. To the best of our knowledge, it can be divided into three types of method. The first type is based on metric function. It converts the reasoning problem into judgment problem, which reduces the complexity greatly. The general practice is to extract the features of query image and support set images through a projection function, such as convolution neural network(CNN). Then a metric function is used to measure the similarity of the features. Finially, we can infer the label of query image based on the similarity. Recent related work such as Siamese network, Matching Networks , Prototype network , Relation Network , follow this method. It should be pointed out that categories in the training and test dataset are different in the one/few shot learning, but same in our model setting.
The second is using memory mechanism. Inspired by Neural Turing Machines, Santoro  builds a new module in the network called external memory, which uses the principle of Least Recently Used Access to update the memory with the input features so that the new class of information can be quickly absorbed by the network. By using the content addressing method, it reads the relevant content for classification.
The third is to learn the way of updating the parameters. They  think optimization is key to learn, and propose to learn a good optimizer to improve the generalization problem on small dataset. This model updates classifier by using optimization learner in the training dataset which can gain a good initialization model, then updates the optimization learner using the data of support set in the test dataset. And updating the classifier can make model have a good generalization.
The three methods can learn new knowledge from smaples rapidly, but the first method is simpler and easier to implement. So we use the idea of the first method to solve the problem of object detection in the small dataset scenario.
Since R-CNN  has achieved good performance in the object detection task by using deep neural networks, many research teams devote to improve it. Fast R-CNN  adds the bounding box regression learning into the model to achieve multi-task learning. Faster R-CNN  integrates object proposals into the model and comes into being end-to-end training and inference fashion. FPN  exploits the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids, which makes the model robust to different sizes of objects. Mask R-CNN  uses Align-pooling instead of ROI pooling to maintain the equivariance of features and further improve the performance of object detection model. At present, Faster R-CNN with FPN is state-of-the-art model and fundamental model on object detection task. Many later work is based on it, such as Deformable Convolutional Networks  and Light-Head R-CNN . Similarly, we also use we use it as fundamental and baseline model in this work.
2.3 Object detection with few example
We investigate the research domain of object detection in small dataset scenarios. Most of them use iterative or semi-supervised methods to make use of unannotated data. For example, Dong uses few annotated images to train the model, and the objects of high confidence are selected as pseudo-annotation on the unannotated images. Model fusion further enhances performance. Notably, different from the ensemble model, it is embedded in the learning process. By optimizing function control, each model is constrained to have the same prediction for the same objects. Yang presents an iterative method to enhance the performance by learning the features of video-specific through a basic object detector. Wang  chooses a set of object detectors based on a small sample of the new task and then fine-tunes them. However, these object detectors are trained on the large-scale object detection dataset. For constraining the distance between different category prototype representation, RepMet adds a hinge like loss to the network, which is denoted as ensemble loss. It is noteworthy that the distance should be more than a threshold. The authors firstly use a trained object-detector to generate ROI features, then an extra model is used to get the features of the support set image. The extra model is trained in one/few shot learning manner with embedding and classification loss. As you can see, all of the methods require related large-scale datasets. The problem they discuss is lack of annotated data, not lack of data. However we’re trying to reinforce the generalization of model in the small dataset scenario.
3 Comparison Detector
3.1 Basic Architecture
There is a serious over-fitting problem on small dataset using state-of-the-art model. Generally, adding inductive bias to the network can be used as a regular term to avoid over-fitting. Ideally, it both improves the search for solutions without substantially diminishing performance, as well as help find solutions which generalize in a desirable way . For example, the inductive bias in the CNN is the local correlation of the data, which is consistent with the image information distribution characteristics. Therefore, performance of CNN in the image domain generally exceeds the fully connected neural network, and regularization of the model is enhanced.
Generally speaking, no-parameters method introduces a stronger inductive bias into the model, and mitigates over-fitting issue to some extent. So we introduce the idea of comparison into our model and replace parameter classifier of FPN  with comparison classifier. In fact, the model adds a inductive biases that the distance of same categories’ samples is less than different categories in the embedding space. It is designed to solve the generalization problem of learning from a small amount of data, not sufficient data with few annotated. The framework of the proposed Comparison detector is shown in Fig.1, which divided into three stages to describe.
At the first stage, as shown in Fig. 1(a), the features of the reference and the object images are computed by backbone network with FPN, without using extra models to encode the reference images. Assuming that there are samples per category with levels pyramid feature in the reference images. Let be the -th categories’ prototype representation of the -level pyramid features, which can be computed by average operation as follows
where and denote the function of the computed -th level feature pyramid and the reference image of class with -th samples, respectively. At the same time, the feature of object image is gerenated by
where indicates -th object image. It should be pointed out that categories in the training and test set are same in our settings unlike one/few shot learning.
The second stage is to generate the prototype representations of each category from the reference images’ pyramid features. We need to find a map function which use all level pyramid features each category as input to compute the final prototype representation for class
The third stage is the design of the head model consisting of a comparison classifier and a bounding box regressor(Fig. 1(b)). Let be a metric function to compute the distance between proposal and prototype representation of the category. It is important to note that and have the same size. Each proposal’s classification and bounding box regression can be obtained by
where denotes the box regression function. The rest of the model is the same as Faster R-CNN with FPN model .
3.2 Learning the reference background
There are many negative proposals generated by RPN, so the R-CNN  adds a background category to represent them. In our Comparison detector, we need to select a number of reference images for each category and therefore we also need to choose reference images for the background category. Due to the overwhelming diversity, selecting background reference is very difficult. Notice that a region is considered to be the proposal indicating that it has certain similarity with categories, but it does not belong to any object class. Therefore, it can be inferred that its features are a combination of different categories in the most case. So we propose to learn it by combining the prototype representations of all the categories in the reference samples, as shown in Fig. 2.
3.3 Generating prototype representations of categories
As shown in Eq. 4, the Comparison detector uses metric function to measure distance or the dissimilarity between the prototype representation of categories and the features of the proposal, then obtains the label of the proposals based on the dissimilarity. Features of proposal may come from any of the four level pyramids, and the prototype representation of the categories is produced according to Eq. 3. For simplicity, we will directly resize the each feature pyramid which is generated by reference images to a fixed size, and then calculate prototype representation by mean operation, i.e.
where is the total number of level feature pyramids, is resize function and is the size of finial features. Different levels pyramid features of the category are resized into fixed size, and then getting the prototype representation by simply averaging them.
3.4 The head for classification and regression
As shown in Fig. 3(a), the structure of the baseline model’s head is to transform the proposal feature firstly and then one branch is used for classification, and another is used to predict the offset of the bounding box. For our Comparison detector, due to the introduction of the reference images, we need to re-organise the head. The are two choices according to whether the reference images are involved in the box regression branch. One is that the reference prototypes are only used for classification, as shown in Fig. 1(b). Unlike the baseline model, the comparison classifier and bounding box regressor in the head of Comparison detector are independent. And the bounding box regressor only uses the features of ROI to predict the offset of the bounding box. It is equivalent to
where . Another choice is to use the reference prototypes for both classification and regression, as shown in Fig. 3(b), which means
We call this method as shared module. They all achieve good performance in our experiments but the independent module performs slightly better.
3.5 Reference images sampling
In our Comparison detector, we also need to choose some objects of each category as reference images. We first randomly select about 150 instances of each category from the training sets. The shortest side of these instances is greater than 16 pixels. Therefore we get a total of 1560 instances and from them, we can select suitable instances in these objects as our reference images.
There are three schemes. The fixed mode denote we randomly choose 3 instances of each category (this number is limited by GPU’s memory) as the reference images. The second one is to randomly select 5 candidates of each category in those objects. Then the model randomly selected three of the five candidates as templates during training, but five in testing. The last method is that we map the 1560 objects to the feature space through the baseline model and get the features of each object. Then we use t-SNE (t-distributed Stochastic Neighbor Embedding)  for feature dimension reduction (Fig. 4). Finally, we select representative objects in 3D space as our reference images.
4 Experiment and Result
4.1 Materials and experiments
Cervical cancer is one of the most common gynecological nausea tumors. The main method of screening cervical cancer is Liquid-based cytology. Then we make slide image into digital images to form cervical cancer datasets(CCD). This dataset has 11 categories: ascus(ASC-US), asch(ASC-H), lsil(low-grade squamous intraepithelial lesion), hsil(high-grade squamous intraepithelial lesion), scc(squamous-cell carcinoma), agc(atypical glandular cells), trich(the last five classes are microorganism), cand(candida), flora, herps, actin(actinomyces). They are highly important for the examination of cervical cancer. We use the object detection technology based on deep learning to classify and locate these objects. In this way we can diagnose the possibility of cervical cancer. We divide the dataset into training setwhich contains 6667 images, test set which contains 419 images for experiment. We randomly choosed about 762 images from the training dataset to form a small dataset of . The number of categories in each dataset is shown in the Fig. 5
In the medical image, annotators are prone to take a higher threshold when label the objects due to the low discrimination of them. At the same time, multiple nearby objects with the same category will be marked as one, so the performance of the model can not be well reflected by mAP. Therefore, the performance of the model is evaluated by using mAP and AR as a supplement on CCD test set. If the mAP does not decrease and the AR improves, it surely signifies the performance is improved. For reference images, We re-scale them such that their side is
which is coincident with pre-trained model. In all experiments, we used ResNet50 as backbone network with ImageNet pre-trained model. The initial learning rate is 0.001, and then decreased by a factor of 10 at 35-th and 50-th epoch. Training is stopped after 60 epochs and the other parameters are the same as FPN. All experiment is trained on the dataset to guarantee firstly the performance on sufficient data. In our setting, our reference images are the same in each training iteration for the stability of the training model. And test stage is the same. A summary of results can be found in Table 1.
4.2 Reference background
We first evaluate our scheme to learn the background reference. Experiment shows our method is feasible (See model A in Table 1). It should be noted that because the prototype representation of the background category is learned from the prototype representation of other categories, the gradient propagation will affect the optimization of other prototype representation. In order to make sure whether this effect is beneficial, we stop gradient propagation at the fork position in Fig. 2. The performance of the model has declined with an mAP of 33.0% and a AR of 52.6%. It shows that the effect is beneficial. Besides generating the prototype of background, we can also remove it. But experiment shows it is very important to optimization. Without it, the model will fail to converge.
4.3 Prototype representations of categories
In our approach, as shown in Eq. 6, we use all pyramid features to generate prototype representation of categories. Another choice is to only use the last level pyramid feature as the category of prototype, i.e. . As shown the results of model A and model B, model C and model D in Table 1, our method is better. Because it can combine features from multiple level pyramids, which not only have rich semantics but also take into account objects of different size. We also fuse different levels of pyramid features by using LSTM , but the speed is greatly reduced.
4.4 Head model
As mentioned before, in independent module, the box regression function is the same as baseline model because experiment found that removing one layer will make the result worse. The results show shared module (model C) performs much better than independent module (model A). Furthermore, we drop the operation of refining bounding box from the head. As shown in Table 1, it’s dramatic that model E is better than model A. Combined with , we infer that the importance of classification should be greater than bounding box regression in our model. So we add a weight coefficient to the classification loss of the model’s head, and after fine-tuning, we select . The results of model F, G and H in Table 1 confirm our analysis. By analyzing model G and model H, we find that the difference between them is not only classification and bbox regression is independent, but also the comparison classifier of model G is semi-parameter. After changing the comparison classifier of model H into semi-parameter(Fig. 1 (b)), the results show that it is better than model G.
|comparator||L2-distance||L2-distance + parameters||concat + parameters|
|method||fixed mode||random mode||t-SNE|
4.5 Optimizing comparison classifier
There are three comparison classifier. The first is to measure similarity directly by using L2-distance which means .
represents averaging function for tensor. The second is the parameterized L2-distance, such as. Similar to , we also try to make the model to learn the metric function instead of manual design. According to the result of Table 2, our model ultimately adopts parameterized L2-distance. When , the result is shown in brackets. Combining with the results shown in Table 1, it is universal that this trick can improve performance in our model. So we adopt this trick in all the next experiments.
4.6 Reference images sampling
In order to eliminate the effect of randomness, we did three experiment in the model a by randomly select 3 objects. The difference between results is less than 0.5 points with the mAP, which shows the robustness of Comparison detector. In order to save time, all experiment uses once result as the performance of the model. The performance of t-SNE is better as shown in Table 3 . In t-SNE experiment, the hyper-parameters are 30 for perplexity, 1 for learning rate, and 10 for label supervision.
As shown in Table 4, Comparison detector has the almost same mAP as the baseline model when training on the dataset, but improves the AR by near 4 points. We think it is better than the state-of-art model. Due to the special annotating situation as described in Section 4.1, some correct predictions may be identified as false positives. Therefore, there is an increase in AR, but no improvement in mAP. Notably, the baseline model does not use the trick of balance loss because it will degrade the performance with a mAP of 43% and AR of 56.1%. When training on the , Comparison detector is completely superior to baseline model. It achieves a top result on the CCD test set with a mAP of 26.3% compared to 6.6%, which indicates our method alleviates the over fitting problemto some extent. As for the model size and time efficiency, it is similar to the baseline. Prototype representation in this model is generated by reference images, but in fact it can be generated by any way, such as external memory. In the future work, We expect a better solution for the generation of prototype representations.
In this work, we focus on the generalization of object detection model to learn on small datasets. Our research is based on small datasets, not big datasets with few annotated data. Based on the most state-of-art object detection model, we propose the Comparison detector and solves several important problems affecting the generalization. It improves the mAP by 19.7 points when trained on the small dataset, and it also has comparable performance to the baseline model on other scale datasets. These results show that it is effective to introduce the comparison ideas into the object detection model. Our method alleviates the problem of over-fitting and improves the performance of the model. Our research not only promotes the object dedtection take a big step forward in small dataset scenario, but also provide a good foundation for further works.
D. Amodei, S. Ananthanarayanan, R. Anubhai, J. Bai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, Q. Cheng, G. Chen, et al., Deep speech 2: End-to-end speech recognition in english and mandarin, in: International Conference on Machine Learning, 2016, pp. 173–182.
-  W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, G. Zweig, Achieving human parity in conversational speech recognition, arXiv preprint arXiv:1610.05256.
-  Y. Kim, Y. Jernite, D. Sontag, A. M. Rush, Character-aware neural language models., in: AAAI, 2016, pp. 2741–2749.
-  S. Levine, C. Finn, T. Darrell, P. Abbeel, End-to-end training of deep visuomotor policies, The Journal of Machine Learning Research 17 (1) (2016) 1334–1373.
-  A. A. Rusu, M. Vecerik, T. Rothörl, N. Heess, R. Pascanu, R. Hadsell, Sim-to-real robot learning from pixels with progressive nets, arXiv preprint arXiv:1610.04286.
-  L. A. Gatys, A. S. Ecker, M. Bethge, A neural algorithm of artistic style, arXiv preprint arXiv:1508.06576.
K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, L. Fei-Fei, Imagenet: A large-scale hierarchical image database, in: Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, Ieee, 2009, pp. 248–255.
-  S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, in: Advances in neural information processing systems, 2015, pp. 91–99.
-  T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, S. J. Belongie, Feature pyramid networks for object detection., in: CVPR, Vol. 1, 2017, p. 4.
-  F. Liu, L. Yang, A novel cell detection method using deep convolutional neural network and maximum-weight independent set, in: Deep Learning and Convolutional Neural Networks for Medical Image Computing, Springer, 2017, pp. 63–72.
-  V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros, et al., Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs, Jama 316 (22) (2016) 2402–2410.
-  B. Kong, X. Wang, Z. Li, Q. Song, S. Zhang, Cancer metastasis detection via spatially structured deep network, in: International Conference on Information Processing in Medical Imaging, Springer, 2017, pp. 236–248.
-  Y. Liang, Z. Tang, M. Yan, J. Liu, Object detection based on deep learning for urine sediment examination, Biocybernetics and Biomedical Engineering.
-  G. Koch, R. Zemel, R. Salakhutdinov, Siamese neural networks for one-shot image recognition, in: ICML Deep Learning Workshop, Vol. 2, 2015.
-  O. Vinyals, C. Blundell, T. Lillicrap, D. Wierstra, et al., Matching networks for one shot learning, in: Advances in Neural Information Processing Systems, 2016, pp. 3630–3638.
-  J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, in: Advances in Neural Information Processing Systems, 2017, pp. 4077–4087.
-  F. S. Y. Yang, L. Zhang, T. Xiang, P. H. Torr, T. M. Hospedales, Learning to compare: Relation network for few-shot learning, in: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 2018.
-  A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, One-shot learning with memory-augmented neural networks, arXiv preprint arXiv:1605.06065.
-  S. Ravi, H. Larochelle, Optimization as a model for few-shot learning.
-  A. Graves, G. Wayne, I. Danihelka, Neural turing machines, arXiv preprint arXiv:1410.5401.
-  R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
-  R. Girshick, Fast r-cnn, in: Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
-  K. He, G. Gkioxari, P. Dollár, R. Girshick, Mask r-cnn, in: Computer Vision (ICCV), 2017 IEEE International Conference on, IEEE, 2017, pp. 2980–2988.
-  J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, CoRR, abs/1703.06211 1 (2) (2017) 3.
-  Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, J. Sun, Light-head r-cnn: In defense of two-stage object detector, arXiv preprint arXiv:1711.07264.
-  X. Dong, L. Zheng, F. Ma, Y. Yang, D. Meng, Few-example object detection with model communication, arXiv preprint arXiv:1706.08249.
Y. Yang, G. Shu, M. Shah, Semi-supervised learning of feature hierarchies for object detection in a video, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 1650–1657.
-  Y.-X. Wang, M. Hebert, Model recommendation: Generating object detectors from few samples, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1619–1628.
-  E. Schwartz, L. Karlinsky, J. Shtok, S. Harary, M. Marder, S. Pankanti, R. Feris, A. Kumar, R. Giries, A. M. Bronstein, Repmet: Representative-based metric learning for classification and one-shot object detection, arXiv preprint arXiv:1806.04728.
-  P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner, et al., Relational inductive biases, deep learning, and graph networks, arXiv preprint arXiv:1806.01261.
-  L. v. d. Maaten, G. Hinton, Visualizing data using t-sne, Journal of machine learning research 9 (Nov) (2008) 2579–2605.
S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.