Region extraction based approach for cigarette usage classification using deep learning

03/23/2021 ∙ by Anshul Pundhir, et al. ∙ 0

This paper has proposed a novel approach to classify the subjects' smoking behavior by extracting relevant regions from a given image using deep learning. After the classification, we have proposed a conditional detection module based on Yolo-v3, which improves model's performance and reduces its complexity. As per the best of our knowledge, we are the first to work on this dataset. This dataset contains a total of 2,400 images that include smokers and non-smokers equally in various environmental settings. We have evaluated the proposed approach's performance using quantitative and qualitative measures, which confirms its effectiveness in challenging situations. The proposed approach has achieved a classification accuracy of 96.74

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Figure 1: Schematic architecture of the proposed approach.

Today’s world, which is developing posthaste, has seen various technological innovations and financial advancements that positively serve society in many areas. Despite that, we have many problems such as pollution, an increasing number of road-accidents, health issues such as lung cancer, respiratory diseases, and eye-vision problems. These hazards happen due to various factors, out of which daily use of cigarettes is a prominent one. As per the doctors’ advice, one should avoid cigarette use since it has adverse effects on our health, environment, and life span. The governments have also established the rules to avoid their use in public areas, but some break laws when they find themselves not monitored by any authority. Unfortunately, cigarette use by one person has adverse effects on others’ lives in the form of pollution, health issues, and car accidents. So, there is an essential need to develop an automated system that can help to find a person’s smoking behavior. Such systems have a wide range of applications such as automated smoke monitoring systems, cigarette censoring in videos, and controlling the number of road accidents due to drivers’ smoking behavior [e08].

Various research attempts have been made for cigarette usage detection and classification [r01, r13, e01, e02, e03, e07, x1, x2, x4]. The existing works in this context are either image-based or sensor-based. The image-based methods process the image-related information such as the presence of smoke and the color of smoking object [e08, r01, r13, e02, e07, r02]. On the other hand, the sensor-based methods deploy the sensors to detect the smoking behavior and process the data collected by them [e01, e03, e04, e05, e06]. The literature survey suggests that this problem needs further exploration since most researchers have focussed on individual-level surroundings. Moreover, the problem includes challenges due to the tiny shape of a cigarette. Better cigarette usage detection and classification methods capable of addressing these challenges will contribute towards a safer and greener world. It motivated us to develop an approach that can effectively overcome the aforementioned challenges and accurately detect smoking behavior.

The proposed approach consists of a region extraction module, classification module, and conditionally active Yolo-v3 [r10] based real-time detection module. It provides a simple yet effective tool to judge the subjects’ smoking behavior by analyzing their visual information. The region extraction module refines the visual information by extracting face and hand proposals. The classification module processes these proposals for final classification. Based on the classification result, the detection module will perform cigarette detection.

The proposed approach has been evaluated on a recent dataset named ‘Dataset containing smoking and not-smoking images (smoker vs. non-smoker)’ [dataset] containing 2,400 images with a nearly equal number of smoker and non-smoker images and achieved accuracy of 96.74%. This approach’s effectiveness is determined quantitatively by measuring accuracy, precision, recall, and qualitatively by visualizing its performance in different challenging situations. During the evaluation, results show that the proposed approach can handle various challenges like variability in hand, face postures, different illumination conditions, and little difference between smoker and non-smoker in the larger scene. The code for the work presented in this paper is available at https://github.com/MIntelligence-Group/CigDetect.

The contributions of the paper are summarised as follows.

  • A novel region extraction based deep-learning approach has been proposed for cigarette usage classification. It is capable of handling challenging situations such as low brightness, little visibility of cigarettes, and various gestures of hands.

  • The incorporation of conditionally active detection has been observed to save the computational cost and improve the detection performance by reducing the false positives.

  • The classification accuracy results obtained for various baseline models formulated during the ablation study verify the proposed approach’s effectiveness. The proposed approach has obtained better results than the state-of-the-art methods for similar problems of identifying small objects in images.

The rest of the paper is organized as follows. The proposed methodology is discussed in section 2; experiments and results in section 3. Finally, in section 4, we have concluded our findings with future directions for further research.

2 Proposed Methodology

This section elaborates on the proposed methodology. The proposed method’s architecture has been shown in Fig. 1, and various components are discussed in the following sections.

2.1 Region Extraction Module

Due to the cigarette’s small size, smoker and non-smoker objects look very similar in the broader view. This module performs data preprocessing, which solves challenges due to the small size of a cigarette. As shown in Fig. 2 and 3, the input images are preprocessed using Faced algorithm (for detecting face regions) [r11] and Yolo-v3 (trained by us for detecting hand regions) [r10]

, to extract the probable cigarette regions (i.e., face, hand). It helped us improve the model’s performance since it needs to process relevant regions rather than process the whole image. We found

Faced algorithm more convincing on this dataset than the ‘Haar Cascade Classifier’ [soo2014object] for extracting variable face poses. Further, we have fine-tuned the parameters of Faced algorithm to improve the predictions.

For a given image, , we have extracted face proposals and hand proposals where and are the number of proposals extracted by Faced algorithm and trained yolo hand detector. Faced algorithm returns the bounding box for a face proposal as per Eq. 1 where, corresponds to coordinates of center and denotes width and height of the bounding box.

(1)

Likewise, the Yolo hand detector returns the bounding box for hand proposal as per Eq. 2 where denote the coordinates of top left corner and denote the coordinate of bottom right corner.

(2)

To improve the region’s coverage for a cigarette, we have adjusted the Faced bounding box by shifting them vertically down with wider width so that cigarette orientation around the lips can be covered effectively in different cases. Similarly, we have adjusted our trained Yolo-v3 for hand detection to ensure that the proposed regions can adequately cover the cigarette region. Adjustment in Faced bounding box is performed as per Eq. 3 where, and denote horizontal shift and vertical shift respectively.

(3)

Adjustment in Yolo hand detector bounding box is performed as per Eq. 4 where denote horizontal and vertical shift..

(4)
(a) Faced bounding box
(b) Adjusted bounding box
(c) Extracted region of interest
Figure 2: Adjusted bounding box in Faced algorithm.
(a) Original image
(b) Hand detected by trained Yolo
(c) Extracted region of interest
Figure 3: Hand detection and region cropped by trained Yolo.

2.2 Classification Module

From baseline experiments in Section 3.1.3

, it was observed that instead of using simple CNN and training from scratch, transfer learning can be used, which gives two-fold benefits in this problem by giving low-cost models and solving small dataset size issues. Here, we have used the ensemble of Resnet-18 and Resnet-34 models with their pre-trained weights. We have modified the Resnet model’s architecture by classifying two classes in the fully-connected layer. We have used a softmax classifier and cross-entropy loss function. For any given image

, model performs classification on face and hand proposals as and repectively, where any is either or . Finally classified category for the image is determined as per Eq. 5 where denotes the operation to takes maximum of all predicted classification categories.

(5)

2.3 Detection Module

This module implements real-time cigarette detection using Yolo-v3 trained on the cigarette images using LabelImg [r12]. This module is conditionally active and performs detection only if the given image is classified with smoking behavior. This idea helps in reducing false positives and improves performance significantly. The detection module gets triggered only when the given image, is classified as a smoker. Suppose denotes the indices of proposals on which , have classified as a smoker. Then cigarette detector performs cigarette detections on the input image . It is robust for various challenging situations (as shown in Fig. 5). In case of failed detection in the first attempt, it performs the detection on proposals to improve our results (as shown in Fig. 6). It overlays the proposals on the raw image to hold smokers’ identities, which helps in the cases where cigarette is in hand. This idea is useful for cigarette monitoring systems.

3 Experiments and Results

This section discusses the implementation details, evaluation metrics, and the results obtained during the experiments.

3.1 Implementation

3.1.1 Experimental Setup

We have trained our proposed model on Nvidia RTX 2060 GPU having 1920 CUDA cores. This proposed model has been tested on Intel(R) Core(TM) i5-9300H, 2.40GHz, 16 GB RAM CPU machine with 64-bit Windows-10 OS machine.

3.1.2 Dataset and Training

The proposed model has been trained and evaluated on the Mendeley smoker dataset [dataset], which has never been used before to the best of our knowledge. It contains 2,400 images with the smoker and non-smoker images in various poses and environmental settings. We have evaluated the proposed approach with the 80%-20% train-test split of dataset.

3.1.3 Ablation Study

An ablation study to decide the proposed approach’s architecture is performed and summarized in Table 1. Here, baselines are designed by considering the challenges due to the small dataset and cigarette size. We have used simple convolutional layers based architecture in baseline1 and baseline2. The importance of the region of interest (ROI) processing is shown by accuracy obtained in baseline1, which uses raw images compared to baseline2, which uses ROI. Baseline3, which uses raw images, is designed to evaluate the benefits of using transfer learning on this problem. Its architecture contains an ensemble of Resnet-18 and Resnet-34 models with their pre-trained weights [ensemble] and, when fed with raw images, gives better accuracy than baseline1 and 2. Based on the ablation study’s results, it has been concluded that ROI processing and transfer learning help this problem. With the observations mentioned above, we came up with the proposed approach’s architecture, as shown in Fig. 1.

Model Processing Strategy Accuracy
Baseline1 Raw input image 59%
Baseline2 Extracting ROIs 67%
Baseline3 Raw input image 90%
Proposed Approach Extracting ROIs 96.74%
Table 1: Summar of the Ablation Study.

3.2 Results & Evaluation

The proposed approach has obtained an accuracy of 96.74%. Its performance has been evaluated using the following quantitative and qualitative measures.

3.2.1 Quantitative Performance Measures

For our model’s quantitative measure, we have shown its accuracy, precision, recall, and confusion matrix in Table 

2.

Metric Obtained
Precision 95%
Recall 98%
Accuracy 96.74%
True Positives 197
True Negatives 190
False Positives 10
False Negatives 3
Table 2: Quantitative performance measures.

3.2.2 Qualitative Performance Measures

The classification results are shown in Fig. 4 while the detection results are shown in Fig. 5 and Fig. 6 in different challenging scenarios.

Figure 4: Classification on sample images.
(a) Dark background
(b) Side face pose
(c) Small visibility of cigerette
(d) Multi-subject enviornment
Figure 5: Cigarette detection in different settings.
(a) No cigarette detected
(b) Cigarette detected
(c) False detection
(d) False detection removed
Figure 6: Improvement in detection using proposed approach.

3.3 Comparison with the state-of-the-art approaches

As per the literature, no state-of-the-art (SOTA) approaches for cigarette usage analysis are available for the Mendeley smoker dataset [dataset]. Moreover, few datasets are available for this problem. We have compared the proposed approach to SOTA approaches for other problems with similar objectives and use-cases. The comparison shown in Table 3 affirms the proposed approach’s applicability for this problem.

Method Author Accuracy
CNN Based Ou et al. [sota3] 79.4%
Deep Learning Dhanwal et al. [e08] 89.9%
Wrist IMU Añazco et al. [e05] 91.38%
Faster-RCNN Lu et al. [e07] 92.1%

 

Proposed 96.74%
Table 3: Comparison with state-of-the-art approaches

4 Conclusion and Future Work

This paper has proposed a novel approach for smoking behavior classification and detection using deep learning. The proposed approach has obtained significant results in challenging situations for the Mendeley smoker dataset, which has never been used so far. It can be extended for anomalous human activity recognition and detection of other small objects. In the future, we aim to make the proposed approach more effective by including the information from more modalities such as videos and feeds through night-vision cameras.

Acknowledgements

This work was supported by the University Grants Commission (UGC) INDIA with grant number: 190510040512.

References