1 Introduction
Object detection is one of the most widely studied tasks in computer vision with many applications to other vision tasks such as object tracking, instance segmentation, and image captioning. Object detection architectures can be sorted into two categories; singleshot detectors
[6, 3] and twostage detectors [4]. Twostage detectors leverage a region proposal network to find a fixed number of object candidates, then a second network is used to predict a score for each candidate and to refine its bounding box.Singleshot detectors can also be split into two categories; anchor based detectors [6, 7] and keypoint based detectors [3, 2]. Anchorbased detectors contain many anchor boxes and then predict offsets and classes for each template. The most famous anchorbased architecture is RetinaNet [6]
, which proposed the focal loss function to help correct for the class imbalance of positive to negative anchor boxes. The highest performing anchorbased detector is FSAF
[7]. FSAF ensembles the anchorbased output with an anchor free output head to further improve performance.On the other hand, keypoint based detectors predict topleft and bottomright corner heatmaps and match them together using feature embeddings. The original keypoint based detector is CornerNet [3], which leverages a special corner pooling layer to accurately detect objects of different sizes. Since then, CenterNet [2] substantially improved CornerNet architecture by predicting object centers along with corners.
Detecting objects at different scales is a major challenge for object detection. One of the biggest advancements in scale aware architectures is Feature Pyramid Networks (FPNs), introduced by Lin et al. [5]. FPNs were designed to be scale invariant by having multiple layers with different receptive fields so that objects are mapped to layers with relevant receptive fields. Small objects are mapped to earlier layers in the pyramid, and larger objects are mapped to later layers. Since the size of the objects relative to the downsampling of the layer are kept nearly uniform across pyramid layers, a single output subnetwork can be shared across all layers. Although FPNs provided an elegant way for handling objects of different sizes, they didn’t provide any solution for objects of different aspect ratios. A high tower, a giraffe, or a knife introduce a design difficulty for FPNs: Does one map these objects to layers according to their width or height? Assigning the object to a layer according to its larger dimension would result in loss of information along the smaller dimension due to aggressive downsampling, and vice versa. To solve this, we introduce Matrix Networks, a new scale and aspect ratio aware CNN architecture. Nets, as shown in Fig. 2, have several matrix layers, each layer handles an object of specific size and aspect ratio. Nets assign objects of different sizes and aspect ratios to layers such that object sizes within their assigned layers are close to uniform. This allows a square output convolution kernel to equally gather information about objects of all aspect ratios and scales. xNets can be applied to any backbone, similar to FPNs. We denote this by appending a ”X” to the backbone, i.e. ResNet50X.
As an application, we use Nets for keypoint based object detection. While keypoint based singleshot detectors are the current stateoftheart [2], they have two limitations due to using a single output layer: they require very large, computationally expensive backbones, and special pooling layers for the model to converge. Second, they have difficulty accurately matching topleft, and bottomright corners. To solve these limitations, we introduce keypointmatrixnet (KPNet) architecture, an architecture that leverages Net to achieve stateoftheart results using ResNet50, Resnet101, and ResNeXt101 backbones. We detect corners for objects of different sizes and aspect ratios using different matrix layers, and simplify the matching process by removing the embedding layer entirely and regressing the object centers directly. We show that KPNet outperforms all existing singleshot detectors by achieving 47.8% mAP on the MS COCO benchmark.
The rest of the paper is structured as follows: Section 2 formalizes the idea of MatrixNets, while Section 3 discusses keypoint based object detection, and our method for applying MatrixNets for keypoint based object detection. Section 4 covers experiments, results, and comparisons, and finally Section 5 is the conclusion.
2 Matrix Nets
Matrix nets (Nets) as shown in Fig. 2 model objects of different sizes and aspect ratio using a matrix of layers where each entry in the matrix represents a layer, , with width downsampling of and height downsampling of with respect to the top left layer, in the matrix. The diagonal layers are square layers of different sizes, equivalent to an FPN, while the off diagonal layers are rectangle layers, unique to Nets. Layer is the largest layer in size, every step to the right cuts the width of the layer by half, while every step down cuts the height by half. For example . Diagonal layers model objects with squarelike aspect ratios, while off diagonal layers model objects with more extreme aspect ratios. Layers close to the top right or bottom left corners of the matrix model objects with very high or very low aspect ratios. Such objects are very rare, so these layers can be pruned for efficiency.
2.1 Layer Generation
Generating matrix layers is a crucial step since it impacts the number of model parameters. The more parameters, the more expressive the model and the harder the optimization problem is, hence we chose to introduce as few new parameters as possible. The diagonal layers can be obtained from different stages of the backbone or using a feature pyramid backbone [5]
. The upper triangular layers are obtained by applying a series of shared 3x3 convolutions with stride 1x2 on the diagonal layers. Similarly, the bottom left layers are obtained using shared 3x3 convolutions with stride 2x1. The parameters are shared across all downsampling convolutions to minimize the number of new parameters.
2.2 Layer Ranges
Each layer in the matrix models objects of certain widths and heights, hence we need to define the range of widths and heights of objects assigned to each layer in the matrix. The ranges need to reflect the receptive field of the feature vectors of the matrix layers. Each step to the right in the matrix effectively doubles the receptive field in the horizontal dimension, and each step down doubles the receptive field in the vertical dimension. Hence, the range of the widths or heights needs to be doubled as we advance to the right or down in the matrix. Once the range for the first layer is defined, we can generate the ranges for the rest of the matrix layers using the above rule. For example, if the range for layer is , , the range for layer will be , .
Objects on the boundaries of these ranges could destabilize training since layer assignment would change if there’s a slight change in object size. To avoid this problem, we relax the layer boundaries by extending them in both directions. This is accomplished by multiplying the lower end of the range by a number less than one, and the higher end by a number greater than one, in all our experiments, we use 0.8, and 1.3 respectively.
2.3 Advantages of Matrix Nets
The key advantage of Matrix Nets is they allow a square convolutional kernel to accurately gather information about different aspect ratios. In traditional object detection models, such as RetinaNet, a square convolutional kernel is required to output boxes of different aspect ratios and scales. This is counterintuitive since boxes of different aspect ratios and scales require different contexts. In Matrix Nets, the same square convolutional kernel can be used for detecting boxes of different scales and aspect ratios since the context changes in each matrix layer. Since object sizes are nearly uniform within their assigned layers, the dynamic range of the widths and heights is smaller compared to other architecture such as FPNs. Hence, regressing the heights and widths of objects becomes an easier optimization problem. Finally MatrixNets can be used as a backbone to any object detection architecture, anchorbased or keypointbased, onestage or twostage detectors.
3 Keypoint Based Object Detection
CornerNet [3] was proposed as an alternative to anchorbased detectors, CornerNet predicts a bounding box as a pair of corners: topleft, and bottomright. For each corner, CornerNet predicts heatmaps, offsets, and embeddings. Topleft, and bottomright corner candidates are extracted from the heatmaps. Embeddings are used to group the topleft, and bottomright corners that belong to the same object. Finally, offsets are used to refine the bounding boxes producing tighter bounding boxes. This approach has three main limitations.
(1) CornerNet handles objects from different sizes and aspect ratios using a single output layer. As a result, predicting corners for large objects presents a challenge since the available information about the object at the corner location isn’t always available with regular convolutions. To solve this challenge, CornerNet introduced the corner pooling layer that uses a max operation on the horizontal and vertical dimensions. The top left corner pooling layer scans the entire right bottom image to detect any presence of a corner. Although, experimentally they show that corner pooling stabilizes the model, we know that max operations lose information. For example, if two objects share the same location for the top edge, only the object with the max features will contribute to the gradient. So, we can expect to see false positive predictions due to corner pooling layers.
(2) Matching the top left and bottom right corners is done with feature embeddings. There are two problems that arise from using embeddings in this setting. First, the pairwise distances need to be optimized during the training, so as the number of objects in an image increases, the number of pairs increases quadratically, which affects the scalability of the training when dealing with dense object detection. The second problem is learning embeddings themselves. CornerNet tries to learn the embedding for each object corner conditioned on the appearance of the other corner of the object. Now, if the object is too big, the appearance of both corners can be very different due to the distance between them, as a result the embeddings at each corner can be different as well. Also, if there are multiple objects in the image with similar appearance, the embeddings for their corners will likely be similar. This is why we saw examples where CornerNet merged persons, or traffic lights together.
(3) As a result of the previous two problems, CornerNet is forced to use the Hourglass104 backbone to achieve stateoftheart performance. This has over 200M parameters, very slow and unstable training, requiring 10 GPUs with 12GB memory to ensure a large enough batch size for stable convergence.
3.1 Keypoint Based Object Detection Using Matrix Nets
Fig. 3 shows our proposed architecture for keypoint based object detection, KPNet. KPNet consists of 4 stages. (ab) We use a Net backbone as defined in Section 2. (c) Using a shared output subnetwork, for each matrix layer we predict the topleft and bottomright corner heatmaps, corner offsets, and center predictions for objects within their layers. (d) We match corners within the same layer using the center predictions, and then combine the outputs of all layers with soft nonmaximum suppression to achieve the final output.
Corner Heatmaps Using Nets ensures that the context required for objects within a layer is bounded by the receptive field of a single feature map in that layer. As a result, corner pooling is no longer needed, regular convolutional layers can be used to predict the heatmaps for the top left and bottom right corners. Similar to CornerNet, we use focal loss to deal with unbalanced classes.
Corner Regression Due to image downsampling, refining the corners is important to have tighter bounding boxes. When scaling down a corner to , location in a layer, we predict the offsets so that we can scale up the corner to the original image size without losing precision. We keep the offset values between , and , and we use smooth L1 loss to optimize the parameters.
Center Regression
Since the matching is done within each individual matrix layer, the width and height of the object is guaranteed to be within a certain range. The center of the object can be regressed easily because the range for the center is small. In CornerNet, the dynamic range for the centers is large, trying to regress centers in a single output layer would probably fail. Once the centers are obtained, the corners can be matched together by comparing the regressed center to the actual center between the two corners. During the training, center regression scales linearly with the number of objects in the image compared to quadratic growth in the case of learning embeddings. To optimize the parameters, we use smooth L1 loss.
KPNet solves problem (1) of CornerNets because all the matrix layers represent different scales and aspect ratios rather than having them all in a single layer. This also allows us to get rid of the corner pooling operation. (2) is solved since we no longer predict embeddings, instead we regress centers directly. By solving the first two problems of CornerNets, we will show in the experiments that we can achieve higher results than CornerNet using a smaller network, and less of computational resources.
4 Experiments
We train all of our networks on a server with Titan XP GPUs. We use a batch size of 20, that requires 3 GPUs for resnet50X, and 4 GPUs for resnet101X, and ResNeXt101X. For our final ResNeXt101X experiment we train on an AWS p3.16xlarge instance with 8 V100 GPUs to allow for a larger batch size of 55. This improves performance by 0.7% mAP. During the training, we use crops of sizes 512x512, and we use standard scale jitter of 0.61.5 and a custom cutout [1]
implementation. For optimization, we use the Adam optimizer and set an initial learning rate of 5e5, and cut it by 1/10 after 60 epochs, training for a total of 80 epochs. For our matrix layer ranges, we set
to be [24px48px]x[24px48px] and then scale the rest as described in Section 2.2. At test time, we resize the image so that the max side of the image is 900. We trained our model on MS COCO ’trainval35k’ set (i.e., 80K training images and 35K validation images), and tested on the ’testdev2017’ set. Using this setup, we achieve 41.7, 42.7, 44.7 mAP single scale, and 43.9, 44.8, and 47.8 mAP multiscale.4.1 Comparisons
Architecture  Backbone  mAP 
CornerNet [3]  Hourglass104  40.8 
CornerNet (MultiScale) [3]  Hourglass104  42.1 
RetinaNet [6]  ResNeXt101FPN  40.8 
FSAF [7]  ResNeXt101FPN  42.3 
FSAF (MultiScale) [7]  ResNeXt 101FPN  44.6 
CenterNet [2]  Hourglass104  44.9 
CenterNet (MultiScale) [2]  Hourglass104  47.0 
KPNet  ResNeXt101X  44.7 
KPNet (MultiScale)  ResNeXt101X  47.8 
As shown in Table 1, our final model, KPNet (MultiScale) with ResNext101X backbone achieve higher mAP than the next best model and more than 5.7% mAP over the original CornerNet architecture. The second best architecture, CenterNet (MultiScale), is trained with a backbone twice the size and 3x the training iterations, and 2x GPU memory. CenterNet ran 480k training iterations on the Hourglass104 backbone whereas our model converged in 180k training iterations on the smaller ResNeXt101X backbone. Even RetinaNet takes 150 epochs to converge, which is 1.8x more than ours which takes only 80 epochs.
We also compare our model with other models on different backbones based on the number of parameters. In Fig. 1, we show that KPNet outperforms all other architectures at all parameter levels. Results are obtained from each individual paper. We believe this is because KPNet uses a scale and aspect ratio aware architecture.
5 Conclusion
In this work, we introduced MatrixNet, a scale and aspect ratio aware architecture for object detection. We showed how to use MatrixNets to solve fundamental limitations of keypoints object detection. Our model achieves the stateoftheart accuracy on MS COCO among single shot detectors.
References

[1]
(2017)
Improved regularization of convolutional neural networks with cutout
. arXiv preprint arXiv:1708.04552. Cited by: §4.  [2] (2019) CenterNet: object detection with keypoint triplets. arXiv preprint arXiv:1904.08189. Cited by: §1, §1, §1, Table 1.
 [3] (2018) Cornernet: detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 734–750. Cited by: §1, §1, §1, §3, Table 1.
 [4] (2019) Scaleaware trident networks for object detection. arXiv preprint arXiv:1901.01892. Cited by: §1.

[5]
(2017)
Feature pyramid networks for object detection.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 2117–2125. Cited by: Figure 2, §1, §2.1.  [6] (2017) Focal loss for dense object detection. In Proceedings of the IEEE international conference on computer vision, pp. 2980–2988. Cited by: §1, §1, Table 1.
 [7] (2019) Feature selective anchorfree module for singleshot object detection. arXiv preprint arXiv:1903.00621. Cited by: §1, Table 1.
Comments
There are no comments yet.