Log In Sign Up

The ALOS Dataset for Advert Localization in Outdoor Scenes

The rapid increase in the number of online videos provides the marketing and advertising agents ample opportunities to reach out to their audience. One of the most widely used strategies is product placement, or embedded marketing, wherein new advertisements are integrated seamlessly into existing advertisements in videos. Such strategies involve accurately localizing the position of the advert in the image frame, either manually in the video editing phase, or by using machine learning frameworks. However, these machine learning techniques and deep neural networks need a massive amount of data for training. In this paper, we propose and release the first large-scale dataset of advertisement billboards, captured in outdoor scenes. We also benchmark several state-of-the-art semantic segmentation algorithms on our proposed dataset.


page 2

page 3


Localizing Adverts in Outdoor Scenes

Online videos have witnessed an unprecedented growth over the last decad...

ADNet: A Deep Network for Detecting Adverts

Online video advertising gives content providers the ability to deliver ...

The CASE Dataset of Candidate Spaces for Advert Implantation

With the advent of faster internet services and growth of multimedia con...

EDEN: Multimodal Synthetic Dataset of Enclosed GarDEN Scenes

Multimodal large-scale datasets for outdoor scenes are mostly designed f...

Identifying Candidate Spaces for Advert Implantation

Virtual advertising is an important and promising feature in the area of...

End-to-End Time-Lapse Video Synthesis from a Single Outdoor Image

Time-lapse videos usually contain visually appealing content but are oft...

Introducing the Simulated Flying Shapes and Simulated Planar Manipulator Datasets

We release two artificial datasets, Simulated Flying Shapes and Simulate...

I Introduction

With the advent of rapid growth in internet services and the increase in the number of online viewers, there has been a massive increase in the number of online videos. Now-a-days, the marketing strategies involve product placement, wherein new adverts are seamlessly integrated into the original videos [1, 2]. This provides opportunities for the advertisement and marketing agencies to reach out to the diverse audience via advertisements. During the post-processing stage of the video, the editors replace the existing advert object in the scene, with a new advert.

With the advent of high computing powers, there has been a massive advancement in the field of computer vision and image processing. The task of object recognition and object classification has become easier for the past decade – thanks to the release of large-scale datasets and their associated challenges. In order to train the machine-learning algorithms, it is important to have a large-scale dataset 

[3], along with manually annotated class labels. Currently, there are no available datasets of billboard advertisement images, along with manually annotated labels of advertisement position.

The main contributions of this paper are as follows: (a) we propose and release the first large-scale dataset of billboard images, along with high quality annotations of billboard location, (b) we also provide a detailed and systematic evaluation of popular segmentation algorithms on our proposed dataset. In this paper, we restrict our domain to outdoor scene images, that are mostly captured using car dashboard cameras. We collaborated with Mapillary 111 – a crowd-sourcing service for sharing geotagged photos, and is releasing a large dataset of outdoor scene images together with manually annotated labels. We believe that this dataset will help researchers in building state-of-the-art algorithms for efficient advert detection.

Figure 1: Representative images of the ALOS dataset, along with manually annotated binary image maps.

Ii Dataset

Ii-a Dataset organization

We refer to our billboard dataset as the ALOS dataset, that stands for Advert Localization in Outdoor Scenes. The ALOS dataset contains images from Mapillary geotagged images. Figure 1 shows a few representative images, along with its manually annotated ground-truth labels.

Mapillary is a crowd-sourcing platform, that provides geotagged images, mostly street view style images from dashboard cameras. They have licensable images that can be filtered to include only images that they have detected which contain billboards. Mapillary have a platform where users can share geotagged images. Their community can then edit/label/review uploaded images. These images are from around the world, captured via various types camera. The Mapillary platform has the ability to filter the images to ones that contain billboards. The definition of billboard, as per Mapillary platform, is more broad than the target of our work. Therefore, the filtered images will contain images consisting of some signage or shop front, which should be neglected in our case. Therefore, it is absolutely necessary for us to annotate the images ourselves, ensuring the four corners are accurately labeled. Such high-quality annotated labels can subsequently used in several use cases.

The dataset contain billboards, posters and screens that are potential candidates for advertisement placement. In this paper, we interchangeably use the terms advert and billboard to represent a candidate object for advertising. A general rule of the image selection in the dataset includes good contrast, legible text and good image resolution (at least pixels). We restrict the shape of the billboard to any four-sided convex polygon. Moreover, the amount of perspective distortion should be at the minimum. A frontal viewpoint of the billboard picture is the ideal. The maximum allowable perspective distortion is degrees. Moreover, the billboard should not be occluded to a great extent. It should be visible with less (or absolutely no) additional human effort. The amount of occlusion should not be more than 10% of the billboard image. Finally, the billboard should not cover most of the entire image dimension.

Based on these restrictions, we build a corpus 222The download link of the dataset is available here: of images that are suitable for the task of billboard localization in imagze frames. The total number of images in this dataset is .

Ii-B Dataset characteristics

During the curation of the ALOS dataset, we ensured that diverse characteristics of the billboards are included in the dataset. The billboards in the ALOS dataset cover varying proportion of the total image area. We refer the amount of area covered by billboard as billboard coverage. The smallest billboard in the ALOS dataset cover of the image area; while the largest billboard capture of its image area. We show the distribution of the billboard coverage in Fig. 6(a). Similarly, the number of billboards in a single image varies significantly across the images of the ALOS dataset. Most of the images have a single billboard (cf. Fig. 6(b)) in it, while the largest number of billboards in a single image is . Finally, we also include images in our dataset, whose billboards are partially covered with occlusions (cf. Fig. 6(c)), or in the state of the being off-screen (cf. Fig. 6(d)). Figure 6 summarizes the distribution of billboard coverage and number of billboards in the proposed dataset.

(a) Billboard coverage
(b) Number of billboards
(c) State of occlusion
(d) State of off-screen
Figure 6: We visualize the distributions w.r.t. (a) billboard coverage area in a single image, (b) number of billboards in a single image, (c) state of occlusion, and (d) state of off-screen.

Iii Benchmarking Experiments

We implement three deep-learning models for semantic segmentation on our dataset: Fully Convolutional Network (FCN) [4], Pyramid Scene Parsing Network (PSPNet) [5], and U-Net [6]

. In the domain of visual computing, convolutional neural networks have shown promising results in dense semantic segmentation of images.

Long et al. have shown that a fully convolutional network trained end-to-end, pixel-by-pixel, can produce detailed segmentation maps of input images [4]. Recently, Zhao et al. proposed the pyramid scene parsing network [5]

that produced the best results on ImageNet scene parsing challenge 2016. It uses various region-based context aggregation, using its pyramid pooling module. We also benchmark U-Net architecture 

[6] on our proposed dataset.

Iii-a Subjective Evaluation

We train the FCN network on resized images of size of the ALOS dataset. We train it for iterations, with a batch size of images. A smaller batch size is necessary to fit the entire model in memory during the training time. We save the model at the end of training, for evaluation purposes. We train the PSPNet model for epochs, with a batch size of images. Similar to FCN network, we use a small batch size so that the entire network can fit in the memory during training. Finally, we trained U-Net model on the ALOS dataset. We used Adam optimiser, and trained the model for steps. Figure 12 shows a few sample visual results of the benchmarking algorithms on the ALOS dataset. The results from PSPNet model are poor, as the model fails to converge properly due to its large network size. The FCN network, being the simplest among the benchmarking methods generalizes well, and works well in most of the cases.

(a) Input image
(b) Ground truth
(c) FCN
(d) PSPNet
(e) U-Net
Figure 12: Subjective evaluation of various semantic segmentation algorithms in ALOS dataset. We show the (a) input image, (b) corresponding binary ground-truth images, and results obtained from (c) FCN, (d) PSPNet, (e) U-Net.

Iii-B Objective Evaluation

In addition to the subjective evaluation, we also provide the average values of several objective metric, that are commonly used in the area of semantic segmentation. Suppose is the number of pixels of class , that are predicted to class . We define as the total number of classes in the task of semantic segmentation. The total number of pixels in class is defined as . We compute several metrics for the subjective evaluation of the different approaches. The pixel accuracy is defined by . The mean accuracy is defined as . The mean intersection over union is defined by . Finally, the frequency weighted intersection over union is defined as .

We compute the average values of pixel accuracy, mean accuracy, mean intersection over union, and frequency weighted intersection over union, across all the images of the proposed dataset. Table I summarizes the results of the various algorithms.

Pixel Accuracy Mean Accuracy Mean IOU Frequency Weighted IOU
FCN 0.962 0.699 0.638 0.937
PSPNet 0.554 0.558 0.304 0.521
U-Net 0.721 0.814 0.432 0.689
Table I: Benchmarking of the ALOS dataset with various deep-learning based segmentation algorithms. The best performance according to each metric is marked in bold.

Iii-C Discussion on Benchmarking

These benchmarking results provide us interesting insights. We observe that a light-weight FCN network performs the best across the various methods. This is mainly because the larger network fail to converge on our dataset. The results of the PSPNet can be further improved by using batch normalization across multiple graphics processing units (GPUs). The U-Net approach performs the best based on mean accuracy, but performs poor according to other important metrics of semantic segmentation. Therefore, we conclude that the overall scores could be further improved by proposing a bespoke shallower neural network, that is specifically tailored for advert localization in outdoor scenes.

Iv Conclusions and Future Works

In this paper, we propose and release ALOS – a large-scale dataset of billboard/advert images, along with high-quality manually annotated ground-truth images. This dataset will be particularly useful for advertising and marketing agencies, for the purpose of product placement and embedded marketing. We also benchmark the performance of several popular deep-learning based segmentation algorithms in our proposed dataset. In the future, we plan in relaxing the criterion of outdoor scenes, and propose a larger dataset that encompasses other domains, including indoor scenes and entertainment videos.


  • [1] A. Nautiyal, K. McCabe, M. Hossari, S. Dev, M. Nicholson, C. Conran, D. McKibben, J. Tang, X. Wei, and F. Pitié, “An advert creation system for next-gen publicity,” in Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML-PKDD), 2018.
  • [2] M. Hossari, S. Dev, M. Nicholson, K. McCabe, A. Nautiyal, C. Conran, J. Tang, W. Xu, and F. Pitié, “Adnet: A deep network for detecting adverts,” in

    Proc. 26th AIAI Irish Conference on Artificial Intelligence and Cognitive Science

    , 2018.
  • [3] S. Dev, M. Hossari, M. Nicholson, K. McCabe, A. Nautiyal, C. Conran, J. Tang, W. Xu, and F. Pitié, “The CASE dataset of candidate spaces for advert implantation,” in Proc. International Conference on Machine Vision Applications (MVA), 2019.
  • [4] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , 2015, pp. 3431–3440.
  • [5] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” in IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 2881–2890.
  • [6] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical image computing and computer-assisted intervention. Springer, 2015, pp. 234–241.