TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral
One-stage object detection is commonly implemented by optimizing two sub-tasks: object classification and localization, using heads with two parallel branches, which might lead to a certain level of spatial misalignment in predictions between the two tasks. In this work, we propose a Task-aligned One-stage Object Detection (TOOD) that explicitly aligns the two tasks in a learning-based manner. First, we design a novel Task-aligned Head (T-Head) which offers a better balance between learning task-interactive and task-specific features, as well as a greater flexibility to learn the alignment via a task-aligned predictor. Second, we propose Task Alignment Learning (TAL) to explicitly pull closer (or even unify) the optimal anchors for the two tasks during training via a designed sample assignment scheme and a task-aligned loss. Extensive experiments are conducted on MS-COCO, where TOOD achieves a 51.1 AP at single-model single-scale testing. This surpasses the recent one-stage detectors by a large margin, such as ATSS (47.7 AP), GFL (48.2 AP), and PAA (49.0 AP), with fewer parameters and FLOPs. Qualitative results also demonstrate the effectiveness of TOOD for better aligning the tasks of object classification and localization. Code is available at https://github.com/fcjian/TOOD.READ FULL TEXT VIEW PDF
One-stage object detectors rely on the point feature to predict the dete...
One-stage object detectors are trained by optimizing classification-loss...
The complex nature of combining localization and classification in objec...
As the scale of object detection dataset is smaller than that of image
One-stage object detectors are trained by optimizing classification-loss...
The 2D object detection in clean images has been a well studied topic, b...
We propose a dense object detector with an instance-wise sampling strate...
TOOD: Task-aligned One-stage Object Detection, ICCV2021 Oral
Object detection aims to localize and recognize objects of interest from natural images, and is a fundamental yet challenging task in computer vision. It is commonly formulated as a multi-task learning problem by jointly optimizing object classification and localization[5, 6, 15, 21, 26, 31]. The classification task is designed to learn discriminative features that focus on the key or salient part of an object, while the localization task works on precisely locating the whole object with its boundaries. Due to the divergence of learning mechanisms for classification and localization, spatial distributions of the learned features by the two tasks can be different, causing a certain level of misalignment when predictions are made by using two separate branches.
|Result Score IoU|
Recent one-stage object detectors attempted to predict consistent outputs of the two separate tasks, by focusing on the center of an object [3, 9, 26, 29]. They assume that an anchor (, an anchor-point for an anchor-free detector, or an anchor-box for an anchor-based detector) at the center of the object is likely to give more accurate predictions for both classification and localization. For example, recent FCOS  and ATSS  both use a centerness branch to enhance classification scores predicted from the anchors near the center of the object, and assign larger weights to the localization loss for the corresponding anchors. Besides, FoveaBox 
regards the anchors inside a predefined central region of the object as positive samples. Such heuristic designs have achieved excellent results, but these methods might suffer from two limitations:
Recent one-stage detectors perform object classification and localization independently by using two separate branches in parallel (, heads). Such a two-branch design might cause a lack of interaction between the two tasks, leading to an inconsistency in predictions when performing them. As shown in the ‘Result’ column in Figure 1, an ATSS detector recognizes an object of ‘Dining table’ (indicated by the anchor shown with a red patch), but localizes another object of ‘Pizza’ more accurately (red bounding box).
Most anchor-free detectors use a geometry-based assignment scheme to select anchor-points near the center of an object for both classification and localization [3, 9, 29], while anchor-based detectors often assign anchor-boxes by computing IoUs between the anchor boxes and ground truth [21, 22, 29]. However, the optimal anchors for classification and localization are often inconsistent, and may vary considerably depending on the shape and characteristics of the objects. The widely used sample assignment scheme is task agnostic, and thus may be difficult to make an accurate yet consistent prediction for the two tasks, as demonstrated in ‘Score’ and ‘IoU’ distributions of ATSS in Figure 1. The ‘Result’ column also illustrates that a spatial location of the best localization anchor (green patch) can be not at the center of the object, and it is not well aligned with the best classification anchor (red patch). As a result, a precise bounding box may be suppressed by the less accurate one during Non-Maximum Suppression (NMS).
To address such limitations, we propose a Task-aligned One-stage Object Detection (TOOD) that aims to align the two tasks more accurately by designing a new head structure with an alignment-oriented learning approach:
In contrast to the conventional head in one-stage object detection where classification and localization are implemented separately by using two branches in parallel, we design a Task-aligned head (T-head) to enhance an interaction between the two tasks. This allows the two tasks to work more collaboratively, which in turn aligns their predictions more accurately. T-head is conceptually simple: it computes task-interactive features, and makes predictions via a novel Task-Aligned Predictor (TAP). Then it aligns spatial distributions of the two predictions according to the learning signals provided by a task alignment learning, as described next.
To further overcome the misalignment problem, we propose a Task Alignment Learning (TAL) to explicitly pull closer the optimal anchors for the two tasks. It is performed by designing a sample assignment scheme and a task-aligned loss. The sample assignment collects training samples (, positives or negatives) by computing a degree of task-alignment at each anchor, whereas the task-aligned loss gradually unifies the best anchors for predicting both classification and localization during the training. Therefore, at inference, a bounding box with the highest classification score and jointly having the most precise localization can be preserved.
The proposed T-head and learning strategy can work collaboratively towards making predictions with high quality in both classification and localization. The main contributions of this work can be summarized as follows: (1) we design a new T-head to enhance the interaction between classification and localization while maintaining their characteristics, and further align the two tasks at the predictions; (2) we propose TAL to explicitly align the two tasks at the identified task-aligned anchors, as well as providing learning signals for the proposed predictor; (3) we conducted extensive experiments on MSCOCO , where our TOOD achieved a 51.1 AP, surpassing recent one-stage detectors such as ATSS , GFL  and PAA , by a large margin. Qualitative results further validate the effectiveness of our task-alignment approaches.
OverFeat  is one of the earliest CNN-based one-stage detectors. Afterward, YOLO  was developed to directly predict bounding boxes and classification scores, without an additional stage to generate region proposals. SSD  introduces anchors with multi-scale predictions from multi-layer convolutional features, and Focal loss  was proposed to address the problem of class imbalance for one-stage detectors like RetinaNet. Keypoint-based detection methods, such as [3, 10, 32], address the detection problem by identifying and grouping multiple key points of a bounding box. Recently, FCOS  and FoveaBox  were developed to locate objects of interest via anchor-points and point-to-boundary distances. Most mainstream one-stage detectors are composed of two FCN-based branches for classification and localization, which may lead to the misalignment between the two tasks. In this paper, we enhance the alignment between the two tasks via a new head structure and an alignment-oriented learning approach.
Most anchor-based detectors such as [21, 29], collect training samples by computing IoUs between proposals and ground truth, while an anchor-free detector regards the anchors inside the center region of an object as positive samples [3, 9, 26]. Recent studies attempted to train the detectors more effectively by collecting more informative training samples using output results. For example, FSAF  selects meaningful samples from feature pyramids based on the computed loss, and similarly, SAPD  provides a soft-selection version of FSAF by designing a meta-selection network. FreeAnchor  and MAL  identify the best anchor-box by computing the losses in an effort to improve the matching between anchors and objects. PAA 
adaptively separates the anchors into positive and negative samples by fitting a probability distribution to the anchor scores. Different from the positive/negative sample assignment, PISA re-weights the training samples according to the precision rank of the outputs. Noisy Anchor  assigns soft labels to the training samples, and re-weights the anchor-boxes using a cleanliness score to mitigate the noise incurred by binary labels. GFL  replaces the binary classification label with an IoU score to integrate the localization quality into classification. These excellent approaches inspired the current work to develop a new assignment mechanism from a task-alignment point of view.
Similar to recent one-stage detectors such as [13, 29], the proposed TOOD has an overall pipeline of ‘backbone-FPN-head’. Moreover, by considering efficiency and simplicity, TOOD uses a single anchor per location (same as ATSS ), where the ‘anchor’ means an anchor point for an anchor-free detector, or an anchor box for an anchor-based detector. As discussed, existing one-stage detectors have limitations of task misalignment between classification and localization, due to the divergence of two tasks which are often implemented using two separate head branches. In this work, we propose to align the two tasks more explicitly using a designed Task-aligned head (T-head) with a new Task Alignment Learning (TAL). As illustrated in Figure 2, T-head and TAL can work collaboratively to improve the alignment of two tasks. Specifically, T-head first makes predictions for the classification and localization on the FPN features. Then TAL computes task alignment signals based on a new task alignment metric which measures the degree of alignment between the two predictions. Lastly, T-head automatically adjusts its classification probabilities and localization predictions using learning signals computed from TAL during back propagation.
Our goal is to design an efficient head structure to improve the conventional design of the head in one-stage detectors (as shown in Figure 3(a)). In this work, we achieve this by considering two aspects: (1) increasing the interaction between the two tasks, and (2) enhancing the detector ability of learning the alignment. The proposed T-head is shown in Figure 3(b), where it has a simple feature extractor with two Task-Aligned Predictors (TAP).
To enhance the interaction between classification and localization, we use a feature extractor to learn a stack of task-interactive features from multiple convolutional layers, as shown by the blue part in Figure 3(b). This design not only facilitates the task interaction, but also provides multi-level features with multi-scale effective receptive fields for the two tasks. Formally, let denotes the FPN features, where , and indicate height, width and the number of channels, respectively. The feature extractor uses
consecutive conv layers with activation functions to compute thetask-interactive features:
where and refer to the -th conv layer and a function, respectively. Thus we extract rich multi-scale features from the FPN features using a single branch in the head. Then, the computed task-interactive features will be fed into two TAP for aligning classification and localization.
We perform both object classification and localization on the computed task-interactive features, where the two tasks can well perceive the state of each other. However, due to the single branch design, the task-interactive features inevitably introduce a certain level of feature conflicts between two different tasks, which have also been discussed in [25, 27]. Intuitively, the tasks of object classification and localization have different targets, and thus focus on different types of features (, different levels or receptive fields). Consequently, we propose a layer attention mechanism to encourage task decomposition by dynamically computing such task-specific features at the layer level. As shown in Figure 3(c), the task-specific features are computed separately for each task of classification or localization:
where is the -th element of the learned layer attention . is computed from the cross-layer task-interactive features, and is able to capture the dependencies between layers:
where and refer to two fully-connected layers. is a function, and is obtained by applying an average pooling to which are the concatenated features of . Finally, the results of classification or localization are predicted from each :
where is the concatenated features of , and is a 11 conv layer for dimension reduction. is then converted into dense classification scores using function, or object bounding boxes with -- conversion as applied in [26, 29].
At the prediction step, we further align the two tasks explicitly by adjusting the spatial distributions of the two predictions: and . Different from the previous works using a centerness branch  or an IoU branch  which can only adjust the classification prediction based on either classification features or localization features, we align the two predictions by considering both tasks jointly using the computed task-interactive features. Notably, we perform the alignment method separately on the two tasks. As shown in Figure 3(c), we use a spatial probability map to adjust the classification prediction:
where is computed from the interactive features, allowing it to learn a degree of consistency between the two tasks at each spatial location.
Meanwhile, to make an alignment on localization prediction, we further learn spatial offset maps from the interactive features, which are used to adjust the predicted bounding box at each location. Specifically, the learned spatial offset enables the most aligned anchor point to identify the best boundary predictions around it:
where an index denotes the -th spatial location at the
-th channel in a tensor. Eq.(6
) is implemented by bilinear interpolation, and its computational overhead is negligible due to the very small channel dimension of. Noteworthily, offsets are learned independently for each channel, which means each boundary of the object has its own learned offset. This allows for a more accurate prediction of the four boundaries because each of them can individually learn from the most precise anchor point near it. Therefore, our method not only aligns the two tasks, but also improves the localization accuracy by identifying a precise anchor point for each side.
The alignment maps and are learned automatically from the stack of interactive features:
where and are two 11 conv layers for dimension reduction. The learning of and is performed by using the proposed Task Alignment Learning (TAL) which will be described next. Notice that our T-head is an independent module and can work well without TAL. It can be readily applied to various one-stage object detectors in a plug-and-play manner to improve detection performance.
We further introduce a Task Alignment Learning (TAL) that further guides our T-head to make task-aligned predictions. TAL differs from previous methods [1, 7, 8, 11, 13, 30] in two aspects. First, it is designed from a task-alignment point of view. Second, it considers both anchor assignment and weighting simultaneously. It comprises a sample assignment strategy and new losses designed specifically for aligning the two tasks.
To cope with NMS, the anchor assignment for a training instance should satisfy the following rules: (1) a well-aligned anchor should be able to predict a high classification score with a precise localization jointly; (2) a misaligned anchor should have a low classification score and be suppressed subsequently. With the two objectives, we design a new anchor alignment metric to explicitly measure the degree of task-alignment at the anchor level. The alignment metric is integrated into the sample assignment and loss functions to dynamically refine the predictions at each anchor.
Considering that a classification score and an IoU between the predicted bounding box and the ground truth indicate the quality of the predictions by the two tasks, we measure the degree of task-alignment using a high-order combination of the classification score and the IoU. To be specific, we design the following metric to compute anchor-level alignment for each instance:
where and denote a classification score and an IoU value, respectively. and are used to control the impact of the two tasks in the anchor alignment metric. Notably, plays a critical role in the joint optimization of the two tasks towards the goal of task-alignment. It encourages the networks to dynamically focus on high-quality (, task-aligned) anchors from the perspective of joint optimization.
As discussed in [29, 30], training sample assignment is crucial to the training of object detectors. To improve the alignment of two tasks, we focus on the task-aligned anchors, and adopt a simple assignment rule to select the training samples: for each instance, we select anchors having the largest values as positive samples, while using the remaining anchors as negative ones. Again, the training is performed by computing new loss functions designed specifically for aligning the tasks of classification and localization.
|Method||Head||Head/full Params (M)||Head/full FLOPs (G)||AP||AP||AP|
|FoveaBox ||Parallel head||4.92/36.20||104.87/206.28||37.3||56.2||39.7|
|FCOS w/ imprv ||Parallel head||4.92/32.02||104.91/200.50||38.6||57.2||41.7|
|ATSS (anchor-based) ||Parallel head||4.92/32.07||104.87/205.21||39.3||57.5||42.8|
|ATSS (anchor-free) ||Parallel head||4.92/32.07||104.87/205.21||39.2||57.4||42.2|
To explicitly increase classification scores for the aligned anchors, and at the same time, reduce the scores of the misaligned ones (, having a small ), we use to replace the binary label of a positive anchor during training. However, we found that the network cannot converge when the labels (, ) of the positive anchors become small with the increase of and . Therefore, we use a normalized , namely , to replace the binary label of the positive anchor, where is normalized by the following two properties: (1) to ensure effective learning of hard instances (which usually have a small for all corresponding positive anchors); (2) to preserve the rank between instances based on the precision of the predicted bounding boxes. Thus, we adopt a simple instance-level normalization to adjust the scale of : the maximum of is equal to the largest IoU value () within each instance. Then () computed on the positive anchors for the classification task can be rewritten as,
where denotes the -th anchor from the positive anchors corresponding to one instance. Following , we employ a focal loss for classification to mitigate the imbalance between the negative and positive samples during training. The focal loss computed on the positive anchors can be reformulated by Eq.(10), and the final loss function for the classification task is defined as follows:
where denotes the -th anchor from the negative anchors, and is the focusing parameter .
A bounding box predicted by a well-aligned anchor (, having a large ) usually has both a large classification score with a precise localization, and such a bounding box is more likely to be preserved during NMS. In addition, can be applied for selecting high-quality bounding boxes by weighting the loss more carefully to improve the training. As discussed in , learning from high-quality bounding boxes is beneficial to the performance of a model, while the low-quality ones often have a negative impact on the training by producing a large amount of less informative and redundant signals to update the model. In our case, we apply the value for measuring the quality of a bounding box. Thus, we improve the task alignment and regression precision by focusing on the well-aligned anchors (with a large ), while reducing the impact of the misaligned anchors (with a small ) during bounding box regression. Similar to the classification objective, we re-weight the loss of bounding box regression computed for each anchor based on , and a loss ()  can be reformulated as follows:
where and denote the predicted bounding boxes and the corresponding ground-truth boxes. The total training loss for TAL is the sum of and .
All experiments are implemented on the large-scale detection benchmark MS-COCO . Following the standard practice [14, 15], we use the set (115 images) for training and set (5 images) as validation for our ablation study. We report our main results on the - set for comparison with the state-of-the-art detectors. The performance is measured by COCO Average Precision (AP) .
4d pre-trained on ImageNet. Similar to ATSS , TOOD tiles one anchor per location. Unless specified, we report experimental results of an anchor-free TOOD (an anchor-based TOOD can achieve a similar performance as shown in Table 3). The number of interactive layers is set as 6 to make T-head have a similar number of parameters as the conventional parallel head, and the focusing parameter is set to 2 as used in [13, 15]. More implementation and training details are presented in Supplementary Material (SM).
For an ablation study, we use the ResNet-50 backbone and train the model for 12 epochs unless specified. The performances are reported on COCOset.
We compare our T-head with the conventional parallel head in Table 1. It can be adopted in various one-stage detectors in a plug-and-play manner, and consistently outperforms the conventional head by 0.7 to 1.9 AP, with fewer parameters and FLOPs. This validates the effectiveness of our design, and demonstrates that T-head can work more efficiently with higher performance, by introducing task interaction and prediction alignment.
|Center sampling ||fixed||fixed||37.3||56.2||39.3|
|PAA+IoU pred. ||ada||fixed||40.9||59.4||43.9|
|TAL + TAP||ada||ada||42.5||60.3||46.4|
To demonstrate the effectiveness of TAL, we compare TAL with other learning methods using different sample assignment methods, as shown in Table 2. Training sample assignment can be divided into the fixed assignment and adaptive assignment according to whether it is a learning-based method. Different from the existing assignment methods, TAL adaptively assigns both positive and negative anchors, and at the same time, computes the weights of positive anchors more carefully, resulting in higher performance. To compare with PAA (+IoU pred.) which has an additional prediction structure, we integrate TAP into TAL, resulting in a higher AP of 42.5. More discussions on the differences between TAL and previous methods are presented in SM.
We evaluate the performance of the complete TOOD (T-head + TAL). As shown in Table 3, the anchor-free TOOD and anchor-based TOOD can achieve similar performance, , 42.5 AP and 42.4 AP. Compared with ATSS, TOOD improves the performance of 3.2 AP. To be more specific, the improvements on AP are significant, which yields 3.8 points higher AP in TOOD. This validates that aligning the two tasks can improve the detection performance. Notably, TOOD brings a higher improvement (+3.3 AP) than the sum of the individual improvements by T-head + ATSS (+1.9 AP) and Parallel head + TAL (+1.1 AP), as shown in Table 6. It suggests that T-head and TAL can compensate strongly to each other.
We first investigate the performance using different values of and for TAL, which control the influence of classification confidence and localization precision on anchor alignment metric via . Through a coarse search shown in Table 4, we adopt and for our TAL. We then conduct several experiments to study the robustness of the hyper-parameter , which is used to select positive anchors. We use different values of in [5, 9, 13, 17, 21], and achieve results in a range of 42.042.5 AP, which suggests the performance is insensitive to . Thus, we adopt in all our experiments.
|FCOS w/ imprv ||ICCV19||ResNet-101||43.0||61.7||46.3||26.0||46.8||55.0|
|Noisy Anchor ||CVPR20||ResNet-101||41.8||61.1||44.9||23.4||44.9||52.9|
|Method||AP||PCC (top-50)||IoU (top-10)||#Correct boxes||#Redundant boxes||#Error boxes|
|Parallel head + ATSS ||39.2||0.408||0.637||30,261||25,428||92,677|
|T-head + ATSS ||41.1||0.440||0.644||30,601||21,838||79,189|
|Parallel head + TAL||40.3||0.415||0.643||30,506||15,927||72,320|
|T-head + TAL||42.5||0.452||0.661||30,734||15,242||69,013|
We compare our TOOD with other one-stage detectors on the COCO - in Table 5. The models are trained with scale jitter (480-800) and for 2 learning schedule (24 epochs) as the most current method . For a fair comparison, we report results of single model and single testing scale. With ResNet-101 and ResNeXt-101-644d, TOOD achieves 46.7 AP and 48.3 AP, outperforming the most current one-stage detectors such as ATSS  (by 3 AP) and GFL  (by 2 AP). Furthermore, with ResNet-101-DCN and ResNeXt-101-644d-DCN, TOOD brings a larger improvement, comparing to other detectors. For example, it obtains an improvement of 2.8 AP (48.351.1 AP) while ATSS has a 2.1 AP (45.647.7 AP) improvement. This validates that TOOD can cooperate with Deformable Convolutional Networks (DCN)  more efficiently, by adaptively adjusting the spatial distribution of the learned features for task-alignment. Note that in TOOD, DCN is applied to the first two layers in the head tower. As shown in Table 5, TOOD achieves a new state-of-the-art result with 51.1 AP in one-stage object detection.
We quantitatively analyze the effect of the proposed methods on the alignment of two tasks. Without NMS, we calculate a Pearson Correlation Coefficient (PCC) between the rankings  of classification and localization by selecting top-50 confident predictions for each instance, and a mean IoU of the top-10 confident predictions, averaged over instances. As shown in Table 6, the mean PCC and IoU are improved by using T-head and TAL. Meanwhile, with NMS, the number of the correct boxes (IoU0.5) increases while those of the redundant (IoU0.5) and error boxes (0.1IoU0.5) decrease substantially when applying T-head and TAL. The statistics suggest that TOOD is more compatible with NMS, by preserving more correct boxes, and suppressing the redundant/error boxes significantly. At last, detection performance is improved by 3.3 AP in total. Several detection examples are illustrated in Figure 4.
In this work, we illustrate the misalignment between classification and localization in the existing one-stage detectors, and propose TOOD to align the two tasks. In particular, we design a task-aligned head to enhance the interaction of two tasks, and then improve its ability of learning the alignment. Furthermore, a new task-aligned learning strategy is developed by introducing a sample assignment scheme and new loss functions, both of which are computed via an anchor alignment metric. With these improvements, TOOD achieved a 51.1 AP on MS-COCO, surpassing the state-of-the-art one-stage detectors by a large margin.
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11583–11591. Cited by: §2, §3.2, Table 2.
Generalized focal loss v2: learning reliable localization quality estimation for dense object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11632–11641. Cited by: Table 5.