Modern object detectors can rarely achieve short training time, fast inference speed, and high accuracy at the same time. To strike a balance between them, we propose the Training-Time-Friendly Network (TTFNet). In this work, we start with light-head, single-stage, and anchor-free designs, enabling fast inference speed. Then, we focus on shortening training time. We notice that producing more high-quality samples plays a similar role as increasing batch size, which helps enlarge the learning rate and accelerate the training process. To this end, we introduce a novel method using Gaussian kernels to produce training samples. Moreover, we design the initiative sample weights for better information utilization. Experiments on MS COCO show that our TTFNet has great advantages in balancing training time, inference speed, and accuracy. It has reduced training time by more than seven times compared to previous real-time detectors while maintaining state-of-the-art performances. In addition, our super-fast version of TTFNet-18 and TTFNet-53 can outperform SSD300 and YOLOv3 by less than one-tenth of their training time, respectively. Code has been made available at <https://github.com/ZJULearning/ttfnet>.READ FULL TEXT VIEW PDF
Object detection is a basic but challenging task in computer vision, whi...
We present a simple but powerful architecture of convolutional neural
Current state-of-the-art two-stage detectors generate oriented proposals...
One of the key ingredients for successful optimization of modern CNNs is...
In connectomics, scientists create the wiring diagram of a mammalian bra...
Object detection aims at high speed and accuracy simultaneously. However...
We propose an end-to-end framework for training domain specific models (...
Accuracy, inference speed, and training time of object detectors have been widely concerned and continuously improved. However, little work can strike a good balance between them. Intuitively, detectors with faster inference speed should have a shorter training time. But in fact, most real-time detectors require longer training time than non-real-time ones. Namely, the high-accuracy detectors can be roughly classified into one of the two types — those suffer from slow inference speed, and those require a large amount of training time.
The first type of networks [Ren et al.2015, Lin et al.2017b, Tian et al.2019] generally rely on the heavy detection head or complex post-processing. Although these designs are beneficial for accuracy improvement and fast convergence, they significantly slow down the inference speed. Therefore, this type of networks is typically not suitable for real-time applications.
To speed up the inference, researchers strive to simplify the detection head and post-processing while retaining accuracy [Liu et al.2016, Redmon and Farhadi2018]. In a recent study named CenterNet [Zhou, Wang, and Krähenbühl2019]
, the inference time is further shortened — almost the same as the time consumed by the backbone network. However, all these networks inevitably require long training time. This is because these networks are difficult to train due to the simplification, making them heavily dependent on the data augmentation and long training schedule. For example, CenterNet needs 140 epochs training on public dataset MS COCO[Lin et al.2014]. In contrast, the first type of networks usually requires 12 epochs.
In this work, we focus on shortening the training time while retaining state-of-the-art real-time detection performances. Previous study [Goyal et al.2017]
has reported that larger learning rate can be adopted if the batch size is larger, and they follow a linear relationship under most conditions. We note that producing more high-quality training samples from annotated boxes is similar to enlarging the batch size, and we further verify it through experiments. Since the time spent on producing samples and calculating losses is negligible compared with that of feature extraction, we can safely attain faster convergence basically without additional overhead. In contrast, CenterNet, which merely focuses on the object center, loses the opportunity to use the information near the center to produce more training samples. This design is confirmed to be the cause of slow convergence according to our experiments.
To shorten the training time, we propose a novel method using Gaussian kernels to produce high-quality samples for training. It allows the network to make better use of the annotated boxes to produce more supervised signals, which provides the prerequisite for increasing the learning rate. Specifically, a sub-area around the object center is constructed via the kernel, and then dense training samples are extracted from this area. Besides, the Gaussian probabilities are treated as the weights of samples to emphasize those samples near the object center, and we further apply appropriate normalization to take advantage of more information provided by large boxes and retain the information given by small boxes. Our approach can alleviate the intractable ambiguity typically found in anchor-free detectors without the need for multi-scale features. Moreover, it does not require any offset predictions to aid in correcting the results, which is effective, unified, and intuitive.
Together with the light-head, single-stage, and anchor-free designs, this paper presents the first object detector that achieves a good balance between training time, inference speed, and accuracy. Our TTFNet reduces training time by more than seven times compared to CenterNet and other popular real-time detectors while retaining state-of-the-art performances. Besides, the super-fast version of TTFNet-18 and TTFNet-53 can achieve 25.9 AP / 112 FPS only after 1.8 hours and 32.9 AP / 55 FPS after 3.1 hours of training on 8 GTX 1080Ti, which is training-time-friendly, and it is the shortest training time to reach these performances on MS COCO currently as far as we know.
Our contributions can be summarized as follows:
We are the first to discuss and validate the similarity between the batch size and the number of high-quality samples produced by annotated boxes. Further, we experimentally verify the reason for the slow convergence of advanced real-time detector CenterNet.
We propose a novel method using Gaussian kernels to produce training samples for regression in anchor-free detectors. It outperforms previous designs, and it can alleviate the ambiguity usually existed in the anchor-free detectors.
Without bells and whistles, our detector can reduce training time by more than seven times compared to previous real-time detectors while keeping state-of-the-art real-time detection performance.
The proposed detector is friendly to researchers, especially for those with limited computing resources. Besides, it can be easily extended to 3D detection tasks, and it is suitable for training time-sensitive tasks such as Neural Architecture Search (NAS).
YOLO [Redmon et al.2016] and SSD [Liu et al.2016] achieve satisfying results and make the single-stage network gain attention. Since focal loss is proposed [Lin et al.2017b] to solve the imbalance between positive and negative examples, single-stage detectors are considered promising to achieve similar accuracy as two-stage ones. However, after that, the accuracy of single-stage ones stagnates for a long time until CornerNet [Law and Deng2018] is introduced. CornerNet is a keypoint-based single-stage detector, which outperforms a range of two-stage detectors in accuracy, opening a new door for the object detection task.
DenseBox [Huang et al.2015] is the first anchor-free detector, and then UnitBox [Yu et al.2016] upgrades DenseBox for better performance. YOLO is the first successful universal anchor-free detector. However, anchor-based methods [Ren et al.2015, Liu et al.2016] can achieve higher recalls, which offers more potential for performance improvement. Thus, YOLOv2 [Redmon and Farhadi2017] abandons the previous anchor-free design and adopts the anchor design. Yet, CornerNet brings the anchor-free designs back into spotlight. Recently proposed CenterNet [Duan et al.2019] reduces the false detection in CornerNet, which further improves the accuracy. Apart from corner-based anchor-free design, many anchor-free detectors relying on Feature Pyramid Network (FPN) [Lin et al.2017a] are proposed such as FCOS [Tian et al.2019] and FoveaBox [Kong et al.2019]. GARPN [Wang et al.2019a] and FSAF [Zhu, He, and Savvides2019] also adopt the anchor-free design in their methods. On the contrary, CenterNet [Zhou, Wang, and Krähenbühl2019] does not rely on complicated decoding strategies or heavy head designs, which can outperform popular real-time detectors [Liu et al.2016, Redmon and Farhadi2018] while having faster inference speed.
We notice that producing more high-quality samples plays a similar role as increasing the batch size, and both of them can provide more supervised signals for each training step. Reviewing the formulation of Stochastic Gradient Descent (SGD), the weight updating expression can be described as:
where is the weight of the network, is a mini-batch sampled from the training set, is the mini-batch size, is the learning rate and is the loss computed from the labeled image .
As for object detection, the image may incorporate multiple annotated boxes, and these boxes will be encoded to sample . indicates the number of samples produced by all the boxes in image . Therefore (1) can be formulated as:
Linear Scaling Rule is empirically found in [Goyal et al.2017]. It claims that the learning rate should be multiplied by if the batch size is multiplied by , unless the network is rapidly changing, or very large mini-batch is adopted. Namely, executing iterations with small mini-batches and learning rate is basically equivalent to executing iteration with large mini-batches and learning rate , only if we can assume for . This condition is usually met under large-scale, real-world data.
Treating sample as , we have the mini-batch size . Although the samples have a strong correlation, they are still able to contribute information with differences. We can qualitatively draw a similar conclusion: when the number of high-quality training samples in each mini-batch is multiplied by , multiply the learning rate by , where and it is decided by the quality of the training samples.
CenterNet [Zhou, Wang, and Krähenbühl2019], which is several times faster than other popular detectors in inference, suffers from long training time. We note that it merely focuses on the object center and produces few samples during training. Besides, it relies on complex data augmentations, which can increase data diversity but will reduce sample quality to some extent. According to our conclusion above, its learning rate will be limited, and thus, it can not converge quickly.
To verify our speculation, we adopt ResNet-18 [He et al.2016] as the backbone and conduct experiments on CenterNet. Since its learning rate is relatively low, we increase the learning rate by 1.5 times. As shown in Figure 1, we have observed the degradation problem — the AP of CenterNet-R18 with 1.5x learning rate has lower AP during the whole training procedure. Further, to avoid the impact of extreme data distribution produced by the data augmentation, we remove the augmentation and maintain the increased learning rate. Although the network can reach higher precision at first several epochs, it rapidly overfits. The results show that directly increasing the learning rate of CenterNet for faster convergence is infeasible. Besides, due to the inefficient use of annotated information, CenterNet has a strong dependence on the complex data augmentation and long training schedule.
In this work, we presume that a better strategy of producing training samples can address this dilemma, and thus, we propose our novel approach in the next section. More comprehensive experiments in our ablation study can further validate the superiority of our approach.
CenterNet divides object detection tasks into two subtasks -—- center localization and size regression. For localization, it adopts the Gaussian kernel as in CornerNet to produce a heat-map, enabling the network to gradually have the ability to produce higher activations near the object center during training. For regression, it defines the object center as a training sample and directly predicts the height and width of the object. It also predicts the offset to recover discretization error caused by output stride. Since the network can produce higher activations near the object center in inference, the time-consuming NMS can be replaced by other components with little overhead.
In order to eliminate the need for the NMS, we adopt a similar strategy for localization. Specifically, we further consider the aspect ratio of the box since the strategy of CenterNet that does not consider it is obviously sub-optimal.
As for regression, mainstream approaches treat pixels in the whole box [Tian et al.2019] or the sub-rectangle area of the box [Kong et al.2019] as training samples. We propose to treat all pixels in a Gaussian-area as training samples. Besides, weights calculated by object size and Gaussian probability are applied to these samples for better information utilization. Note that our approach does not require any other predictions to help correct the error, as shown in Figure 2, which is more simple and effective.
Given an image, our network separately predicts feature and . The former is used to indicate where the object center may exist, and the latter is used to attain the information related to the object size. , , , , are batch size, number of categories, the height and width of the input image, and output stride. We set and in our experiments, and we omit later for simplify. Gaussian kernels are used in both localization and regression in our approach, and we define scalar and to control the kernel size, respectively.
Given -th ground truth box belongs to category , firstly it is linearly mapped to the feature-map scale. Then, 2D Gaussian kernel K is adopted to produce , where , . The produced is decided by the parameter , center location , and box size . Since the object center may be between pixels, we use () to force the center to be in the pixel as in CenterNet. Then, we update -th channel in by applying element-wise maximum with . is set in our network, and it’s not carefully selected.
The peak of the Gaussian distribution, also the box center, is treated as the positive target while any other pixel is treated as the negative target. The punishments on those negative targets that correspond to larger distribution values will be lighter. We use modified focal loss as[Law and Deng2018, Zhou, Wang, and Krähenbühl2019].
Given the prediction and localization target , we have
Given -th ground truth box on the feature-map scale, another Gaussian kernel is adopted to produce . The kernel size is controlled by as mentioned above. The non-zero part in is named Gaussian-area , as shown in Figure 3. Since is always inside the -box, it is also named sub-area in the rest of the paper.
Each pixel in the sub-area is treated as a training sample. Given pixel in the area , the training target is defined as the distances from to four sides of -th box, represented as a
-dim vector. Therefore, the predicted box at can be represented as:
where is the output stride as mentioned above, and is a fixed scalar used to enlarge the predicted results for easier optimization. is set in our experiments. Note that is on image scale rather than feature-map scale.
If a pixel is not contained in any sub-areas, it will be ignored during training. If a pixel is contained in multiple sub-areas — an ambiguous sample, its training target is corresponding to the object with the smaller area.
Given the prediction and regression target , we gather training targets from and corresponding prediction results from , where stands for the number of regression samples. For all these samples, we decode the predicted boxes and corresponding ground truth boxes of samples as in (5), and GIoU [Rezatofighi et al.2019] is used for regression loss calculation.
where stands for decoded box and is the corresponding -th ground truth box on image scale. is the sample weight, which is used to balance the loss contributed by each sample.
Due to the large scale variance of objects, large objects may produce thousands of samples, whereas small objects may only produce a few. After normalizing the loss contributed by all samples, the losses contributed by small objects are even negligible, which will harm the detection performance on small objects. Therefore, sample weightplays an important role in balancing losses. Suppose is inside the sub-area of -th ground truth box, we have:
where G is the produced Gaussian probabilities and is the area of the -th box. This scheme can make good use of more annotation information contained in large objects and preserve that of small objects. It also can emphasize these samples near the object center, reducing the effect of ambiguous sample.
The total loss is composed of localization loss and regression loss , weighted by two scalar. Specifically, , where and in our setting.
The architecture of TTFNet is shown in Figure 2. We use ResNet and DarkNet [Redmon and Farhadi2018] as the backbone in our experiments. The features extracted by the backbone are up-sampled to 1/4 resolution of the original image. We use Modulated Deformable Convolution (MDCN) [Zhu et al.2019] and up-sample layer to achieve this since the awareness of the object center requires large receptive filed, otherwise plenty of false detections will occur.
Note that although our design can be effective for large or middle size objects, small size objects may make little profit from it since their inherent less annotated information. Thus, it is hard for the precision on small objects to rapidly go up during training, and there is no chance to achieve relatively high precision after a few training steps. To address this challenge, we build a slightly stronger feature extraction network by introducing high-resolution features from the low level, which is conducive to small object detection. In particular, we introduce the shortcut connection from shallow layers, and
convolution is used. The number of convolution layer is set to 3, 2, 1 for stage 2, 3, 4. Shortcut layers are followed by ReLU except for the last layer, and MDCN layers are followed by Batch Normalization (BN)[Ioffe and Szegedy2015] and ReLU.
The up-sampled features then separately go through two heads for different detection tasks. Localization head produces higher activations on those positions closer to the object center while regression head directly predicts the distance from those positions to the four sides of the box. Since the object center corresponds to the local maximum, we can safely suppress non-maximum values with the help of 2D max-pooling as in[Law and Deng2018, Zhou, Wang, and Krähenbühl2019]. Then we use the positions of local maximums to gather regression results. Finally, the detection results can be attained.
Our experiments are based on the challenging MS COCO 2017 benchmark. We use the Train split (115K images) for training and report the performances on Val split (5K images).
We use ResNet and DarkNet as the backbone for experiments. We resize the images to
and do not keep the aspect ratio. Only random flip is used for data augmentation in training. We use unfrozen BN but freeze all parameters in stem and stage 1 of the backbone. For ResNet, the initial learning rate is 0.016, and the mini-batch size is 128. For DarkNet, the initial learning rate is 0.015, and the mini-batch size is 96. The learning rate is reduced by a factor of 10 at epoch 18 and 22, respectively. Our network is trained with SGD for 24 epochs, and a half for super-fast version. Weight decay and momentum are set as 0.0004 and 0.9, respectively. The weight decay for bias is set to 0, and the learning rate of the bias in the network is doubled. Warm-up is applied for the first 500 steps. We initialize our backbone networks with the weights pre-trained on ImageNet[Deng et al.2009]. Our experiments are based on open source detection toolbox mmdetection [Kai Chen2019] with 8 GTX 1080Ti.
Super-fast TTNet-53 is used in the experiments. The AP is reported on COCO 5k-val.
Each ground truth will produce multiple training samples, and it remains a problem of how to balance the weights of the losses contributed by each sample. Naturally treating all samples equally will lead to poor precision, especially for small objects detection as in Table 1. This is because the number of samples produced by large objects may be hundreds of times larger than that of small objects, making the losses contributed by small objects almost negligible.
Another naive method is to normalize the loss produced by each ground truth. Namely, if -th ground truth produces samples, all these samples have same weight . Still, it will lead to sub-optimal results. Small objects inherently contain a small amount of information but more noises, while large objects initially contain more highly relevant information with fewer noises. Paying equal attention to produced samples during training is unfair for large objects.
To utilize more information provided by large boxes and reserve the information of small boxes, we adopt the logarithm of the box area together with the normalized Gaussian probability as the sample weights. As shown in Table 1, our strategy can greatly handle the issues above and retain an intuitive and efficient form.
|w/ Gaussian||w/ Aspect Ratio||AP||Ratio %|
|Stage 2||Stage 3||Stage 4||AP||FPS|
|RetinaNet [Lin et al.2017b] *||R18-FPN||1330, 800||16.3||6.9||30.9||49.6||32.7||15.8||33.9||41.9|
|RetinaNet [Lin et al.2017b] *||R34-FPN||1330, 800||15.0||8.3||34.7||54.0||37.3||18.2||38.6||45.9|
|RetinaNet [Lin et al.2017b]||R50-FPN||1330, 800||12.0||11.0||35.8||55.4||38.2||19.5||39.7||46.6|
|FCOS [Tian et al.2019] *||R18-FPN||1330, 800||20.8||5.0||26.9||43.2||27.9||13.9||28.9||36.0|
|FCOS [Tian et al.2019] *||R34-FPN||1330, 800||16.3||6.0||32.2||49.5||34.0||17.2||35.2||42.1|
|FCOS [Tian et al.2019]||R50-FPN||1330, 800||15.0||7.8||36.6||55.8||38.9||20.8||40.3||48.0|
|SSD [Liu et al.2016]||VGG16||300, 300||44.0||21.4||25.7||43.9||26.2||6.9||27.7||42.6|
|SSD [Liu et al.2016]||VGG16||512, 512||28.4||36.1||29.3||49.2||30.8||11.8||34.1||44.7|
|YOLOv3 [Redmon and Farhadi2018]||D53||320, 320||55.7||26.4||28.2||-||-||-||-||-|
|YOLOv3 [Redmon and Farhadi2018]||D53||416, 416||46.1||31.6||31.0||-||-||-||-||-|
|YOLOv3 [Redmon and Farhadi2018]||D53||608, 608||30.3||66.7||33.0||57.9||34.4||18.3||25.4||41.9|
|CenterNet [Zhou, Wang, and Krähenbühl2019]||R18||512, 512||128.5||26.9||28.1||44.9||29.6||-||-||-|
|CenterNet [Zhou, Wang, and Krähenbühl2019]||R101||512, 512||44.7||49.3||34.6||53.0||36.9||-||-||-|
Sometimes multiple objects are spatially overlapped, and thus it’s hard to define the regression targets in the overlapping area. This situation is called ambiguity in anchor-free design. To alleviate it, previous works either place those objects of different scale in different levels in FPN [Tian et al.2019, Kong et al.2019], or simply produce single training sample based on the ground truth box [Zhou, Wang, and Krähenbühl2019]. Since all pixels in the sub-area are treated as samples in TTFNet, we also face this problem. Given an annotated box, the relative size of the sub-area is defined by . Larger indicates that more annotated information is available, but meanwhile, the ambiguity will become more serious.
We use a more mundane form, i.e., rectangle as the sub-area to analyze the relationship between the precision and . In particular, means only the box center is treated as a regression sample while means all pixels in the rectangle box are treated as regression samples. We train a series of networks with changing from 0.01 to 0.9, and we try the class-aware regression to find out whether it helps to reduce ambiguity and improve accuracy.
As shown in Table 2, for both class-agnostic and class-aware regression, the AP first rises and then falls as increases. The rise indicates the annotated information near the object center also matters — the AP when is much higher than that when . Therefore, the strategy of CenterNet that merely considers the object center is sub-optimal. On the contrary, the fall indicates the ambiguity has a negative impact. Besides, those samples which are too far away from object center are harmful since in class-aware regression also meets the accuracy degradation even the ambiguity is not too serious. Class-aware regression can reduce the ambiguity by five times, but the AP is consistently lower. We notice that the convergences of class-aware networks are much slower than class-agnostic ones during the initial training phase. Adopting other initialization methods is also not helpful. Thus we believe that this is due to the difficulty of optimization caused by the redundant parameters in the last layer.
We use Gaussian kernel to produce sub-area for training samples, and we compare this approach with the mundane rectangle sub-area. As shown in Table 3, the Gaussian sub-area can achieve better results, and the ambiguity decreases. Note that most ambiguous samples have lower sample weights, as shown in 3, which can further reduce the negative impact of ambiguity.
CenterNet adopts the same strategy as CornetNet to produce heat-map without considering the aspect ratio of the box. We compare it with ours, as in Table 3. Whether we adopt Gaussian sub-area or not, considering the box ratio can improve precision.
We introduce the shortcut connection for achieving higher precision. The number of convolution layer is set to 3, 2, 1 for stage 2, 3, 4. We also use other combinations, and the precision is listed in Table 4. Although deeper shortcuts can promote precision, it also leads to slower inference speed. Finally, we choose the combination of 3, 2, 1, and it is not carefully selected.
To verify the similarity between the batch size and the number of high-quality samples, we set a group of , and adopt different learning rates and training schedules in this experiment. Note that larger indicates more training samples, but also leads to more serious ambiguity, which is harmful to the training.
As shown in Table 5, we can observe that larger guarantees larger learning rate and better performance. Although more samples are accompanied by more ambiguous samples, increasing does help improve the learning rate. Besides, the trend is more noticeable when is smaller since there are fewer ambiguous samples. In other words, having more high-quality samples is like enlarging the batch size, which helps to increase the learning rate further.
Our TTFNet adopts ResNet-18/34 and DarkNet-53 as the backbone, and they are marked as TTFNet-18/34/53. The results, as listed in Table 6, include AP, training time, and inference speed. Our network can be more than seven times faster than other real-time detectors in training time while achieving state-of-the-art results with real-time inference speed. Compared with SSD300, our super-fast TTFNet-18 can achieve slightly higher precision, but our training time is ten times less, and the inference is more than two times faster. As for YOLOv3, our TTFNet-53 can achieve 2 points higher precision in just one-tenth training time, and it’s almost two times faster than YOLOv3 in inference. The super-fast TTFNet-53 can reach the precision of YOLOv3 in just one-twentieth training time.
As for the recently proposed anchor-free detector, our TTFNet shows great advantages. FCOS can achieve high precision without requiring long training time, but its slow inference speed will seriously limit its mobile application. We list the performance of adopting lighter backbone such as ResNet18 and ResNet34 in advanced RetinaNet and FCOS. Unfortunately, they can not achieve comparable performance due to the heavy head design. As for the real-time detector CenterNet, it has very fast inference speed and high precision, but it requires long training time. Our TTFNet only needs one-seventh training time compared with CenterNet, and it’s superior in balancing training time, inference speed, and accuracy.
We empirically show that more high-quality samples help enlarge learning rate and propose the novel method of using Gaussian kernel for training. It is an elegant and effective solution for balancing training time, inference speed, and accuracy, which can provide more potentials and possibilities to explore for more excellent structure for speciﬁc detection tasks with the help of NAS [Zoph and Le2017, Zoph et al.2018, Ghiasi, Lin, and Le2019, Wang et al.2019b, Howard et al.2019].
Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, 448–456.
Neural architecture search with reinforcement learning.In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings.