Image-Based Parking Space Occupancy Classification: Dataset and Baseline

07/26/2021 ∙ by Martin Marek, et al. ∙ 12

We introduce a new dataset for image-based parking space occupancy classification: ACPDS. Unlike in prior datasets, each image is taken from a unique view, systematically annotated, and the parking lots in the train, validation, and test sets are unique. We use this dataset to propose a simple baseline model for parking space occupancy classification, which achieves 98 accuracy on unseen parking lots, significantly outperforming existing models. We share our dataset, code, and trained models under the MIT license.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 5

page 6

page 7

page 8

Code Repositories

parking-space-occupancy

Official repository for the "Image-Based Parking Space Occupancy Classification: Dataset and Baseline" paper.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Live information about parking space occupancy can be used to effectively navigate drivers, reducing congestion, emissions, and saving time.

There are two common approaches to monitor the occupancy of individual parking spaces: installing a sensor on every parking space or using a camera to monitor multiple parking spaces at once. In general, if a single camera can capture tens of parking spaces, the cost per parking space can be significantly lower compared to a parking sensor.

A camera can be used in two basic ways to monitor the occupancy of parking spaces. One way is to capture and process a live video feed allowing for methods such as motion tracking [13, 4, 14]. The other way is to capture images at a larger interval and process them individually [6, 2, 1, 12]. In this study, we focus only on image-based models, for several reasons. First, an image can be captured using a long exposure time, allowing for good color reproduction even in low-light; in contrast, capturing a video limits the camera’s exposure time to the inverse of the framerate, likely 1/10 of a second or shorter. Additionally, a burst of images can be merged to increase dynamic range, especially in direct sunlight [9]. Second, if inference is done off-camera, capturing images at longer intervals significantly decreases data flow. Third, if the camera feed temporarily failed, an image-based model could immediately recover, as it does not rely on history; in contrast, a video-based model might lack the necessary context for inference. Lastly, it is easier to capture and annotate a dataset composed of images, compared to videos.

In prior works on camera-based parking space occupancy classification, the authors generally train their model on a large but generic object-detection dataset [13, 4, 14] or on an application-specific dataset consisting of just a few parking lots [6, 2, 1]. To evaluate model generalization, the authors use a separate dataset consisting of previously-unseen parking lots, achieving 89% - 96% accuracy.

We introduce a challenging new dataset where each image is taken from a unique view, corresponding to over 11,000 unique parking space annotations – almost an order-of-magnitude more than the largest previously-published dataset [2]. Our dataset is the first publicly available dataset that directly tests model generalization, by using separate parking lots for the train, validation, and test sets. It is also the first dataset with a consistent annotation format, enabling new augmentations and pooling methods. We use this dataset to develop a simple baseline model that achieves over 98% accuracy on previously-unseen parking lots, significantly outperforming existing models.

Figure 1: A sample image from our dataset. Each parking space is annotated using a quadrilateral corresponding to the edges of the parking space.

2 Related work

Existing models can be split into two main groups: video-based and image-based.

2.1 Video-based models

Video-based models rely on a continuous video feed to perform inference.

Ke [13] use a Single Shot MultiBox Detector (SSD) [18] together with a standardized tracking algorithm [3] and match the detected objects to parking spaces. Their model achieves 95.6% accuracy in a parking garage not included in the training set, but the authors have adjusted model parameters to optimize model performance for this garage.

Cai [4] use Mask-RCNN [10] combined with a memory characterizing features of past cars detected. They build their own dataset for evaluation and achieve 88% sensitivity (the authors do not state the accuracy of their model).

Li [14]

take a fundamentally different approach to locating vacant parking spaces. They utilize a surround camera system mounted on a car to detect vacant parking spaces. They build their own model and dataset and achieve precision and recall of 98% and 92%, respectively (the authors do not state the accuracy of their model).

2.2 Image-based models

Almeida and Oliveira [6] introduce a dataset named PKLOT, consisting of 12,417 images, taken from 3 views of 2 parking lots. However, they suggest that the dataset be split into train and test sets based on the date of capture of each image, not based on the parking lot or parking space. As a result, both training and evaluating a model on this dataset will lead to an overestimated accuracy.

Amato [2]

introduce a new dataset of roughly 150,000 images captured from 9 views of a single parking lot containing 164 parking spaces. They train a classifier on this dataset and achieve 98% accuracy when evaluated on the same dataset. However, to test the generalization performance of their model, they also evaluate it on the

PKLOT dataset, but only achieve 93.7% accuracy.

Acharya [1] train a classifier on PKLOT and evaluate it on their own dataset, achieving 96.6% accuracy. The reason they achieve a relatively high accuracy is likely because their evaluation dataset only contains 30 parking spaces and no occlusions.

Hsieh [12] take a different approach to monitor parking lot occupancy: they use a drone to detect and count vehicles. They build their own dataset of drone images, capturing 4 parking lots and nearly 90,000 cars, and annotate these images using bounding boxes. They use an object detector with a custom region proposal network to detect vehicles in images. The final output of their model is vehicle locations and the total count of vehicles; they do not classify the occupancy of individual parking spaces. Their model scores a mean absolute error of 22 for the vehicle count.

3 Dataset

Figure 2: A sample of annotated images from our dataset. We collected images from different parking lots, under varied weather and lighting conditions, at different occupancy levels.

We introduce a new dataset for parking space occupancy classification: Action-Camera Parking Dataset (ACPDS). The goal of our dataset is to improve and correctly test for model performance on previously unseen parking lots. To this end, the dataset captures various parking lots, each image is captured from a unique view, systematically annotated, and unique parking lots are used for the train, validation, and test sets. As a result, the validation and test accuracy obtained on our dataset correspond to model performance on a previously unseen parking lot.

3.1 Data collection

We mounted a GoPro Hero 6 action camera to a 12-meter telescoping pole and used a smartphone to view a live feed from the camera and control the shutter. This enabled us to walk around with the setup and capture each image from a unique view. We captured tens of different parking lots and streets, under various weather and lighting conditions (see Figure 2). However, we notably didn’t capture any images that include snow.

Each image is captured by the same camera in the “wide” field of view setting, in full (4000 x 3000) resolution. Moreover, each image is captured from a roughly 12-meter height, corresponding to a common height of lamp posts. We consider this similarity to be crucial for practical applications. If a camera were installed on a lamppost, it could typically be at most 12 meters above the ground, resulting in the same view angles and same levels of occlusions as what we have captured. In contrast, a dataset captured from high above the ground would not include significant occlusions, and any model trained on such a dataset would generalize poorly to a low installation height (with strong occlusions between individual parking spaces).

3.2 Annotation

We labeled each image in the same way, as illustrated in Figure 1, using Labelbox. To annotate one parking space, we drew a quadrilateral around it, ensuring that the edges of the parking space and the quadrilateral are aligned. In a few instances, however, when part of a parking space was cut off at the edge of an image, the shape of the visible parking space projection became a pentagon. For consistency, even in these instances, we labeled such parking spaces using a quadrilateral. This resulted in 2 of the 4 edges of the annotated quadrilateral not being aligned with the parking space. This is visible on the bottom side of each image in Figure 7.

We only labeled those parking spaces where we were confident about their coordinates and occupancy. Often, we needed to rely on the consistency of the parking lot layout: even when a parking space was fully occluded, we were still able to label its coordinates by knowing the coordinates of the surrounding parking spaces. Other times, when vehicles were heavily occluded, we needed to count them and compare their relative positions: only this allowed us to decide on the occupancy of a given parking space. Both of these challenges can be seen in Figure 1, especially on the left side of the image, where heavy occlusions are present.

We checked each annotation in the train set at least once; in the validation and test sets, we checked each annotation at least twice.

3.3 Dataset size

We captured 293 images containing 11,236 unique views of a parking space, spread across tens of different parking lots and streets. 5,376 of these parking spaces were occupied, corresponding to 48% of all parking spaces. We used 231, 35, and 27 images for the train, validation, and test sets, respectively. While this may sound like a low number of images, both the validation and test sets contain over 1,400 unique views of a parking space. The relatively high costs of capturing and annotating images using our methodology prevented us from building a larger dataset.

4 Models

Figure 3: We propose two models for parking space occupancy classification, both inspired by two-stage object detectors. We replace region proposals by the coordinates of parking spaces. The output of our models is an occupancy score for each parking space.

In addition to building a new dataset, we design and train two simple models on this dataset. We intend these models to function as a simple baseline for our dataset.

Our models are inspired by object detectors. However, regular object detectors cannot be applied directly to our dataset to classify parking space occupancy. Both single-stage and two-stage object detectors rely on non-maximum suppression with an arbitrary intersection-over-union (IoU) threshold to filter proposals/detections. There is no clear IoU threshold for our application: for unoccluded parking spaces, we would prefer a high IoU threshold such that we don’t double-count a single vehicle; on the other hand, to detect heavily occluded vehicles, the IoU threshold must be very low. Single-stage detectors are further constrained by using a set of default bounding boxes. Even if we solved the aforementioned limitations (which has been achieved by Carion [5], for example), we would need to assign each detected vehicle to a specific parking space. This is not a trivial issue: reliably assigning occluded vehicles to specific parking spaces would require knowledge of the camera calibration. Lastly, using a regular object detector would be a missed opportunity to take into account our knowledge of the parking lot layout – before passing the parking lot image through our model, we already know which regions we are interested in, based on the location of each parking space.

To leverage our knowledge of parking space locations, we implement two custom models inspired by two two-stage object detectors (R-CNN [8], and Faster R-CNN FPN [21, 16]). In both of these models, we use the annotated parking space coordinates as the regional proposals. We discuss the details below.

4.1 R-Cnn

Our first proposed model takes inspiration from R-CNN [8], as well as prior models used for image-based parking space occupancy classification [6, 2, 1]. First, we pool image patches corresponding to each parking space directly from the image. Afterward, these patches are passed separately through a binary classifier (ResNet50 [11]). The output of the classifier is the occupancy score for each parking space. The model is illustrated in part (a) of Figure 3. We discuss our pooling technique in section 4.3.

While models based on the R-CNN architecture are no longer considered efficient for general-purpose object detection, our specific application actually makes the architecture desirable. First, we capture images in (4000 x 3000) resolution. An image of this resolution is too large to be passed directly through a deep convolutional network, but we can easily pass small image patches through a ResNet50. In the original R-CNN model, the number of region proposals is large (2,000), and the pooling resolution of each patch is high (224 x 224). In our application, however, we pass at most around 100 image patches through the classifier, and as we show in section 5.2, we can get away with a patch resolution of just (128 x 128). These differences combine to a roughly 60-fold inference time decrease compared to the original R-CNN model. We further discuss inference time in section 5.3.

A notable disadvantage of this model is that there can never be any information flow between pooled image patches. This might make it very difficult for the model to reason about occlusions.

4.2 Faster R-CNN FPN

Our second proposed model takes inspiration from Faster R-CNN FPN [21, 16]. First, we pass a resized image through a ResNet50 combined with a feature pyramid network [16]. Afterward, we pool features corresponding to each parking space from the feature pyramid and pass them separately through a classification head to obtain the final occupancy scores. The model is illustrated in part (b) of Figure 3

. We use the same heuristic as Lin x

[16] to decide which pyramid layers to pool from. We also use the same pooling resolution (7 x 7).

Unlike our first proposed model, this model cannot utilize the full resolution of our dataset. On the other hand, the architecture does allow for information flow between parking spaces, so we believe there is a better potential for the model to reason about occlusions.

4.3 Pooling

Both of our proposed models rely on pooling features corresponding to each parking space. Here, we discuss two different methods to perform the pooling.

Since we annotate each parking space using a quadrilateral, our first proposed pooling method is to warp the pixels inside each quadrilateral and project them to an (S x S) patch directly. Our second proposed method is to consider a minimum bounding square around each parking space annotation and interpolate this square to an (S x S) patch. We illustrate the two methods in Figure

4.

A notable difference between these pooling methods is that one only pools information from within a parking space quadrilateral, while the other also pools information from surrounding pixels. When the quadrilateral has an uneven aspect ratio (, near an image edge) or it includes heavy occlusions, the context obtained from the surrounding pixels can be useful. This difference is especially relevant for the model based on R-CNN; the model based on Faster R-CNN FPN can utilize its backbone to obtain context from surrounding pixels either way.

We implement both of these pooling methods natively in PyTorch

[20] with CUDA and TorchScript support. This allows for inference on GPU, CPU, and mobile.

Figure 4: We illustrate our two proposed methods to pool features from parking spaces. The parking space annotations are drawn in image (a), using 4 colored quadrilaterals. In method (a), we interpolate pixels from these quadrilaterals directly. In method (b), we consider a minimum bounding square around each parking space and interpolate pixels from these squares.

5 Evaluation

We train both of our models with different hyperparameter configurations on our dataset. We do not train our models on any other datasets. To the best of our knowledge, our dataset is currently the only publicly available dataset with precise annotations for each parking space edge. Other datasets typically draw squares

[2] or quadrilaterals [6] around each parking space without any consistency; our models are not intended for this label format. Moreover, prior works used a separate dataset for training and evaluation to test model generalization. Since our dataset is already split into train, validation, and test sets by parking lots, we do not need a second dataset to test for generalization.

5.1 Training details

For consistency, both of our models use a pre-trained ResNet50 backbone. The model based on R-CNN uses weights from ImageNet

[7]

training, while the model based on Faster R-CNN FPN uses weights obtained on the COCO dataset

[17]

. The reason behind this difference is that the model based on R-CNN is essentially just a regular classifier, apt for Imagenet training, whereas training a model with a feature pyramid network requires the use of an object detection dataset (COCO). We obtained these weights from the torchvision package.

Following a widely adopted practice in object detection, our backbone batch normalization weights and statistics are frozen. We additionally freeze layers 1 and 2 in our backbone, to reduce overfitting to our small dataset.

We train both of our models in each configuration using AdamW [19], with a learning rate of

for 50 epochs, and a learning rate of

for additional 50 epochs. We are aware that SGD with momentum is the standard optimizer for vision tasks and typically results in better generalization than adaptive methods [22]. We did not perform an exhaustive test of different optimizers and their respective parameters. However, from a few simple training runs, we found AdamW to result in better validation loss than SGD with momentum. It is plausible that this difference would disappear if we performed a more exhaustive search for optimizer parameters.

We find that training our models with a high learning rate for the first 50 epochs and then dropping the learning rate results in better validation loss as compared to training with a constant learning rate throughout. This is consistent with published results [15]. Since we use relatively high-resolution images for training, we only use a mini-batch of size 1. This allows us to perform training on a GPU with memory.

We augment each training example using a random left-right flip, random rotation, and a random adjustment to brightness, contrast, saturation, and hue. We show an example of the augmentation in Figure 5. In order to rotate the image, it is important that our parking space annotations are in the form of precise quadrilaterals. This way, we can find the correct minimum bounding square for each parking space (as described in section 4.3) even after rotating our annotations. In contrast, if we labeled parking spaces using minimum bounding squares in the first place, and we rotated the image together with the annotations, these squares would no longer represent the minimum bounding squares for each parking space. The exact parameters of all the mentioned augmentations can be found in our code.

Figure 5: We augment each training example using a random left-right flip, random rotation, and a random adjustment to brightness, contrast, saturation, and hue.

In order to achieve fast training speeds, we cache the whole dataset in main memory, without any augmentations, and perform all augmentations on the GPU, before passing a mini-batch through the model. These optimizations enable us to train for a full epoch with (4000 x 3000) images in around 40 seconds, on an Nvidia RTX 2080 Ti GPU.

Architecture Pooling Resolution Valid. accuracy [%] Test accuracy [%]
Faster R-CNN FPN square 1440 98.58 0.07 98.51 0.10
Faster R-CNN FPN square 1100 98.54 0.11 98.52 0.11
Faster R-CNN FPN square 800 98.36 0.07 98.31 0.08
Faster R-CNN FPN quadrilateral 1440 98.34 0.10 98.31 0.09
Faster R-CNN FPN quadrilateral 1100 98.28 0.14 98.00 0.08
Faster R-CNN FPN quadrilateral 800 97.80 0.07 97.97 0.14
R-CNN square 256 98.11 0.07 97.62 0.07
R-CNN square 128 98.38 0.06 97.97 0.07
R-CNN square 64 98.00 0.08 97.73 0.13
R-CNN quadrilateral 256 96.39 0.15 96.08 0.12
R-CNN quadrilateral 128 96.27 0.17 96.63 0.15
R-CNN quadrilateral 64 95.87 0.11 96.39 0.17
Table 1:

Accuracy for different model configurations. The resolution column refers in case of R-CNN to the pooling resolution, and in case of Faster R-CNN FPN to the input image resolution. The uncertainty estimate is an estimated standard error from 5 training runs.

5.2 Results

We have trained both of our models in each configuration 5 times, each time with random initialization. We use these 5 training runs to report the uncertainty in each of our experiments, using the mean accuracy and its estimated standard error.

For both models, we test both pooling methods from section 4.3. For the model based on R-CNN, we test 3 pooling resolutions: {64, 128, 256}. For the model based on Faster R-CNN FPN, we test 3 input resolutions, defined by the size of the smaller image edge: {800, 1100, 1440}. We do not alter any other hyperparameters, since our models are heavily based on existing, well-researched architectures.

We report the validation and test accuracy for each model configuration in Table 1

. The configuration with the highest validation accuracy is highlighted for both architectures. We consider it important to select a model configuration based on the validation set accuracy and leave the test accuracy as an unbiased estimate of model generalization.

We observe that in general, the Faster R-CNN FPN architecture performs better than the R-CNN architecture. We find square pooling to be always preferable to quadrilateral pooling, especially for the R-CNN architecture. For the R-CNN architecture, we find the highest accuracy at resolution (128 x 128). This is also the resolution we expected to perform the best: the smallest parking spaces in our dataset take up around (100 x 100) pixels and we want to minimize any patch upsampling; at the same time, we want to avoid discarding useful information by choosing too small a resolution (, 64 x 64). For the Faster R-CNN FPN architecture, we found the highest resolution (1920 x 1440) to perform the best. This is still a significantly lower resolution than what our camera captured (4000 x 3000), but our GPU memory prevented us from testing a higher resolution. Moreover, using a large input resolution significantly increases inference time – we discuss this in section 5.3.

It is also interesting to observe that the Faster R-CNN FPN architecture generalizes better to the test dataset than the R-CNN architecture. We suspect this is because the test set systematically differs from the validation set (, in the number and strength of occlusions).

We show predictions for the model with the highest validation accuracy for 3 challenging images in Figure 7. Each of the images in Figure 7 represents a failure of the model. In the top image, the model fails to reason about heavy occlusions caused by surrounding cars. It is not clear to us whether this problem would go away if we built a larger dataset or whether this would also require a new model architecture. On the other hand, the failure in the bottom image is simply caused by our dataset not including enough parking spaces occluded by trees; we are confident we could fix this by building a larger dataset or implementing a more efficient training method (, using generated occlusions as a form of data augmentation). The middle image shows a large vehicle taking up multiple parking spaces; this is very rare in our dataset.

Figure 6: Inference time for various model configurations on an Intel Core i9-9900K CPU.
Figure 7: Predictions by our model based on Faster R-CNN FPN for 3 challenging images. Model predictions are shown by transparent quadrilaterals; where these predictions differ from the labels, the labeled occupancy is displayed by a large cross across the whole parking space. The top image shows heavy occlusions by cars; the middle image shows a large vehicle taking up multiple parking spaces; the bottom image shows an occlusion by trees.

5.3 Inference time

In Figure 6, we compare the inference time of our proposed models under different configurations, as we vary the number of parking spaces. We measure the inference time on an Intel Core i9-9900K CPU, to simulate CPU deployment. We observe that models based on the R-CNN architecture have an almost perfectly linear relationship between the number of parking spaces and the total inference time. In contrast, for models based on the Faster R-CNN FPN architecture, passing an image through the backbone is a very expensive operation, while pooling and passing features through a classification head takes negligible compute. As a result, these models have an almost perfectly constant inference time.

5.4 Model comparison

We consider the R-CNN architecture preferable for practical deployment. While it does not perform as well as the Faster R-CNN FPN architecture on the test set, our dataset is intentionally very challenging; for unoccluded parking spaces, we would expect both of our models to achieve accuracy over 99%. For real-world applications, the R-CNN architecture is the more flexible one. It should be possible to use a very high-resolution camera to capture hundreds of parking spaces at once; the model only cares that each pooled image patch is of high-enough resolution. Conversely, it should also be possible to use a low-resolution camera to capture only a few parking spaces, as long as the resolution of each pooled image patch is high enough. In contrast, the Faster R-CNN architecture requires a specific input resolution. It might generalize poorly to new resolutions and it is constrained by memory and compute at very high resolutions.

6 Conclusion

We present a new dataset for parking space occupancy classification, ACPDS, where each image is taken from a unique view. Our dataset is split into train, validation, and test sets based on parking lots and has a consistent annotation format. As a result, our dataset allows for good generalization and tests generalization directly. We design and train a practical model on this dataset and achieve accuracy over 98%, significantly outperforming existing models. Our model is suitable for CPU deployment and can be used with images of different resolutions. The model has learned to tolerate moderate occlusions but fails under heavy occlusions. We intentionally made our dataset challenging by including heavily occluded parking spaces, so that future models can continue to be benchmarked on this dataset, and improve upon our results. Thanks to our streamlined data collection and annotation process, our dataset can also be extended by other researchers.

References

  • [1] D. Acharya, W. Yan, and K. Khoshelham (2018-04)

    Real-time image-based parking occupancy detection using deep learning

    .
    In Proceedings of the 5th Annual Conference of Research@Locate, Vol. 2087, pp. 33–40. Cited by: §1, §1, §2.2, §4.1.
  • [2] G. Amato, F. Carrara, F. Falchi, C. Gennaro, C. Meghini, and C. Vairo (2017-04) Deep learning for decentralized parking lot occupancy detection. Expert Systems with Applications 72, pp. 327–334 (en). External Links: ISSN 09574174, Link, Document Cited by: §1, §1, §1, §2.2, §4.1, §5.
  • [3] A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft (2016) Simple online and realtime tracking. In 2016 IEEE International Conference on Image Processing (ICIP), Vol. , pp. 3464–3468. External Links: Document Cited by: §2.1.
  • [4] B. Y. Cai, R. Alvarez, M. Sit, F. Duarte, and C. Ratti (2019-10) Deep Learning-Based Video System for Accurate and Real-Time Parking Measurement. IEEE Internet of Things Journal 6 (5), pp. 7693–7701. External Links: ISSN 2327-4662, 2372-2541, Link, Document Cited by: §1, §1, §2.1.
  • [5] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko (2020) End-to-end object detection with transformers. In Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J. Frahm (Eds.), Cham, pp. 213–229. Cited by: §4.
  • [6] P. R.L. de Almeida, L. S. Oliveira, A. S. Britto, E. J. Silva, and A. L. Koerich (2015-07) PKLot – A robust dataset for parking lot classification. Expert Systems with Applications 42 (11), pp. 4937–4949 (en). External Links: ISSN 09574174, Link, Document Cited by: §1, §1, §2.2, §4.1, §5.
  • [7] J. Deng, W. Dong, R. Socher, L. Li, Kai Li, and Li Fei-Fei (2009) ImageNet: a large-scale hierarchical image database. In

    2009 IEEE Conference on Computer Vision and Pattern Recognition

    ,
    Vol. , pp. 248–255. External Links: Document Cited by: §5.1.
  • [8] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014-06) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, pp. 580–587. External Links: ISBN 9781479951185, Link, Document Cited by: §4.1, §4.
  • [9] S. W. Hasinoff, D. Sharlet, R. Geiss, A. Adams, J. T. Barron, F. Kainz, J. Chen, and M. Levoy (2016-11) Burst photography for high dynamic range and low-light imaging on mobile cameras. ACM Trans. Graph. 35 (6). External Links: ISSN 0730-0301, Link, Document Cited by: §1.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2020) Mask r-cnn. IEEE Transactions on Pattern Analysis and Machine Intelligence 42 (2), pp. 386–397. External Links: Document Cited by: §2.1.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 770–778. External Links: Document Cited by: §4.1.
  • [12] M. Hsieh, Y. Lin, and W. H. Hsu (2017-10) Drone-Based Object Counting by Spatially Regularized Regional Proposal Network. In 2017 IEEE International Conference on Computer Vision (ICCV), Venice, pp. 4165–4173. External Links: ISBN 9781538610329, Link, Document Cited by: §1, §2.2.
  • [13] R. Ke, Y. Zhuang, Z. Pu, and Y. Wang (2020)

    A Smart, Efficient, and Reliable Parking Surveillance System With Edge Artificial Intelligence on IoT Devices

    .
    IEEE Transactions on Intelligent Transportation Systems, pp. 1–13. External Links: ISSN 1524-9050, 1558-0016, Link, Document Cited by: §1, §1, §2.1.
  • [14] L. Li, L. Zhang, X. Li, X. Liu, Y. Shen, and L. Xiong (2017-07) Vision-based parking-slot detection: A benchmark and a learning-based approach. In 2017 IEEE International Conference on Multimedia and Expo (ICME), Hong Kong, Hong Kong, pp. 649–654. External Links: ISBN 9781509060672, Link, Document Cited by: §1, §1, §2.1.
  • [15] Y. Li, C. Wei, and T. Ma (2019)

    Towards explaining the regularization effect of initial large learning rate in training neural networks

    .
    In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32, pp. . External Links: Link Cited by: §5.1.
  • [16] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. , pp. 936–944. External Links: Document Cited by: §4.2, §4.
  • [17] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In Computer Vision – ECCV 2014, D. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars (Eds.), Cham, pp. 740–755. Cited by: §5.1.
  • [18] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. C. Berg (2016) SSD: single shot multibox detector. In Computer Vision – ECCV 2016, B. Leibe, J. Matas, N. Sebe, and M. Welling (Eds.), Cham, pp. 21–37. External Links: ISBN 978-3-319-46448-0 Cited by: §2.1.
  • [19] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, External Links: Link Cited by: §5.1.
  • [20] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. External Links: 1912.01703 Cited by: §4.3.
  • [21] S. Ren, K. He, R. Girshick, and J. Sun (2017-06) Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (6), pp. 1137–1149. External Links: ISSN 0162-8828, 2160-9292, Link, Document Cited by: §4.2, §4.
  • [22] P. Zhou, J. Feng, C. Ma, C. Xiong, S. Hoi, and W. E (2020) Towards theoretically understanding why sgd generalizes better than adam in deep learning. In Neural Information Processing Systems, Cited by: §5.1.