CamLoc: Pedestrian Location Detection from Pose Estimation on Resource-constrained Smart-cameras

Recent advancements in energy-efficient hardware technology is driving the exponential growth we are experiencing in the Internet of Things (IoT) space, with more pervasive computations being performed near to data generation sources. A range of intelligent devices and applications performing local detection is emerging (activity recognition, fitness monitoring, etc.) bringing with them obvious advantages such as reducing detection latency for improved interaction with devices and safeguarding user data by not leaving the device. Video processing holds utility for many emerging applications and data labelling in the IoT space. However, performing this video processing with deep neural networks at the edge of the Internet is not trivial. In this paper we show that pedestrian location estimation using deep neural networks is achievable on fixed cameras with limited compute resources. Our approach uses pose estimation from key body points detection to extend pedestrian skeleton when whole body not in image (occluded by obstacles or partially outside of frame), which achieves better location estimation performance (infrence time and memory footprint) compared to fitting a bounding box over pedestrian and scaling. We collect a sizable dataset comprising of over 2100 frames in videos from one and two surveillance cameras pointing from different angles at the scene, and annotate each frame with the exact position of person in image, in 42 different scenarios of activity and occlusion. We compare our pose estimation based location detection with a popular detection algorithm, YOLOv2, for overlapping bounding-box generation, our solution achieving faster inference time (15x speedup) at half the memory footprint, within resource capabilities on embedded devices, which demonstrate that CamLoc is an efficient solution for location estimation in videos on smart-cameras.



There are no comments yet.


page 1

page 4

page 5

page 7

page 8


Deep Learning based Pedestrian Inertial Navigation: Methods, Dataset and On-Device Inference

Modern inertial measurements units (IMUs) are small, cheap, energy effic...

YOLOpeds: Efficient Real-Time Single-Shot Pedestrian Detection for Smart Camera Applications

Deep Learning-based object detectors can enhance the capabilities of sma...

Fatigue Detection

Nowadays, there are many fatigue detection methods and the majority of t...

Smart IoT Cameras for Crowd Analysis based on augmentation for automatic pedestrian detection, simulation and annotation

Smart video sensors for applications related to surveillance and securit...

Efficient 2.5D Hand Pose Estimation via Auxiliary Multi-Task Training for Embedded Devices

2D Key-point estimation is an important precursor to 3D pose estimation ...

Enabling Image Recognition on Constrained Devices Using Neural Network Pruning and a CycleGAN

Smart cameras are increasingly used in surveillance solutions in public ...

Deep Multitask Learning for Pervasive BMI Estimation and Identity Recognition in Smart Beds

Smart devices in the Internet of Things (IoT) paradigm provide a variety...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The unprecedented expansion of Internet of Things (IoT) devices and their advancing capabilities offer a perspective to the trend in computing for years to come, with more of the computations previously reserved for server-side now migrating to the edge of the Internet on resource-constrained devices. While this is now possible due to technology advancement, other factors also contribute to this accelerating trend, such as shifting perception at social and political levels, with users becoming more aware and concerned about their data privacy [1]. From policy makers, new legislation introduces heavy sanctions on companies for mishandling of users data [2], which pressures them to move more data processing to user proximity.

Fig. 1: Position estimation in the 2D-space in front of a camera indicated by the lower red dot using a bounding box approach (left) and our pose estimation approach (right). Our algorithm (CamLoc) achieves better detection accuracy with substantially lower compute resources.

Intelligent devices rely on sensors to understand user environment and context, to perform assistive actions and for user-device interactions. One of the richest information sensing modality is vision. A wide range of applications across different fields rely on vision for environment perception (surveillance, robotics and building automation and control). Knowing the exact location of a person in the environment in front of a camera is useful to many of these applications providing location based services. However, current methods used in computer vision require heavy computations, which are typically performed on servers with abundant resources. Migrating this detection task locally to a surveillance camera or other adjacent devices with limited embedded compute resources is not easy. Typically, algorithms designed to perform this detection on resource-constrained devices accept a severe downgrade of detection accuracy or inference time.

In this paper we develop a pose estimation algorithm building on key-points detection [3], specifically designed to operate efficiently on embedded devices. To improve the location detection in different scenarios of occlusion, we extend the body frame determined from visible key body points to approximate the location of remaining key-points not visible in the image, enabled by appreciated body posture from visible points. We compare this with the performance of a popular detector YOLOv2 [4], pruned to recognize only people, which determines a bounding box overlapping a person in image (as shown in Figure 1).

We collect a large dataset comprising images from cameras (over 2100 frames) annotated with the exact location of a pedestrian in the space in front of the camera. This dataset was designed to be particularly challenging for location detection, involving occlusions by having objects between pedestrian and camera to mask portions of the human body, in 42 different scenarios. As extension to just one camera, we consider situations where the algorithm can be improved by having access to a second camera facing the same scene from a different angle. We show that multiple cameras help to improve detection accuracy, assuming communication between cameras over the local network. This dataset will be made publicly available for other researchers to develop new algorithms for this challenging problem.

We show that CamLoc based on key body points detection performs better in comparison with a simple baseline relying on person detector with the popular YOLOv2 system for both inference time and detection quality.

This paper makes the following contributions:

  • We design a visual system for pedestrian location estimation in the 2D-space in front of a camera, building on key body points and based on posture estimation from the available body points extending pedestrian skeleton to compensate for occlusions and body outside of frame. This can work both in single-camera and multi-camera conditions.

  • This approach is compared with a baseline based on bounding box generated by an object detection algorithm (YOLOv2), limited to just person detection. In order to cope with occusions, this is approach is adapted to extend bounding box such that it maintains the person ratio between height and width.

  • We collect a substantial dataset annotated with pedestrian position in front of a camera to demonstrate the feasibility of our proposed technique. This includes both single camera and multi camera scenarios. We make this available for other researchers to propose new location estimation algorithms for smart-cameras, here:

  • Feasibility of running the two location estimation approaches on embedded devices is evaluated on Nvidia Jetson TX2 and Odroid XU4, which are two popular representatives of embedded devices for the IoT space. The deep neural networks in composition of these algorithms are evaluated on their resource requirements (inference time and memory footprint).

The structure of this paper is as follows. Next chapter introduces a brief motivation for the necessity of estimating pedestrian location in 2D-space. Section III presents the two detection approaches, followed by a presentation of our collected dataset. Section V, presents the experiments to evaluate our proposed solution. Section VI presents the Related Work. We finish with Future work and Conclusions (Section VII).

Ii Motivation

There is a wide range of scenarios that require accurate localisation, some of which are highlighted by Mautz in [5]: location based services in indoor environments, private homes e.g. Ambient Assistant Living (systems providing assistance to elderly people in their home for daily living activities), context detection and situational awareness, in museums (visitors tracking for surveillance and study of visitor behavior, location based user guiding and triggered context-aware information services), logistics and optimization (for the purpose of process optimization in complex systems, it is essential to have information about the location of assets and staff members), applications using augmented reality, and many other applications.

Most of the recent solutions in the area of indoor localisations that do not require specialized infrastructure use the sensors and WiFi cards in smartphones to determine user location. The solution proposed in this paper represents an alternative to these solutions, from the infrastructure perspective, which can be used in other scenarios as well, as it does not require carrying or having attached any device on the person being localized. As mentioned before, pedestrian localization is beneficial for tracking in surveillance and study of behaviour in museums, in shopping malls, in conferences, etc. These are attainable due to cameras already deployed for surveillance.

Two main aspects are to be considered for applications that make use of location data: low-latency interactions and data privacy. We are addressing both of these issues by running the detection on end devices, the cameras themselves with limited computation resources, rather than in the cloud. This is possible due to the efficiency and low-resource utilization of the proposed deep neural network-based system (CamLoc). Since surveillance is usually intended to be used as forensics rather than preventive, so far, we are not aware of other systems that perform detection on the cameras to offer real-time detection.

On resource-constrained end devices, such as IoT devices, it is desired to optimize resource consumption. This motivates our choices in designing a system that can operate on single frames at low frames per second. The techniques proposed perform detection on each frame in separation from other frames in order to cope with adaptable frame rate, dropping frames to save energy, depending on application requirements.

Iii Location Detection Techniques

Given a camera with a view of the floor, localisation can be performed by using a homographic transformation from the position of the feet from the camera perspective to the floor plane (perspective-to-plane transformation). This method is based on the 2D direct linear transformation, developed by Abdel-Aziz and Karara

[6]. It implies that a set of 4 points must be defined a priori for a particular camera, as shown in Figure 7. The homographic transformation is based on the following formulae, given the camera perspective coordinates and :

The parameters can be calculated by transforming the equations in matrix format, given the set of camera points , and a predefined set of map points :

The problem then is how to estimate the position of the feet of a person in an occluded environment. For this, two methods have been analyzed:

  • Estimation using bounding box detection

  • Estimation using pose information

Iii-1 Baseline Person Detection

Popular object detectors have good performance for the person detection task in terms of both accuracy and execution time ([4], [7]). For our method, the technique described in [4] was used, having fast and accurate detections. It takes in a frame rescaled at the standard 224x224 resolution, and outputs coordinates of the bounding boxes and classes for each detected object in the frame. For our purposes, we are only interested in the person class. The feet position can be estimated as the midpoint between the two bottom vertices of the bounding box. However, in occluded environments, this is problematic, as this method assumes the person is standing upright, and fully contained in the bounding box. Since the bounding box is much smaller than the height of the person and the image from the camera is at an angle, the feet position is erroneously estimated to be further from the camera than in reality.

Iii-2 Body extension using pose estimation

This limitation motivated the use of pose information for inference. Since modern pose estimators ([3], [8]) are able to detect subsets of body parts, this information is used to extend the person’s body in the occluded area based on known body proportions [9]. This leads to an estimated feet position close to the actual one. Pose estimator neural architectures first detect person joints and through belief maps connect them to form body parts. Tome et al. [3]

uses a multi stage convolutional neural network to output the pose information of a person. The network takes in a rescaled 224x224 frame and outputs the human skeleton. As such, the methodology for extending the human body is as follows:

  • If the feet are found in the detections, take the point between them.

  • Else, perform linear regression on the midpoint between complementary body parts (i.e. right/left shoulder, right/left hip) and extend onto the regressed line accordingly, considering the detected joints (e.g. extend the body starting from the lowest joint detected).

Body extension is done on the regressed line from the detected joints, to account for natural body position (e.g. leaning against an object) and for eventual lens distortions. However, when insufficient joints are detected, regression cannot be performed and the frame is skipped. The percentages of skipped frames are shown in Table IV.

Iv Human Position in Camera Frames Dataset

Scene Name # scenarios # frames
S1_Wide 33 1929
S2_Narrow 9 267
Total 42 2196
TABLE I: Number of frames and scenarios in each scene.

As such, the location detection is performed by first getting the video frames from cameras, running them through the deep learning model (either person detection or pose estimation), post-processing (extending the body, estimating the feet and, using camera configurations, averaging detections based on camera distance) and then computing global coordinates.

Fig. 2: Block diagram of the location detection methodology.

To address this problem we start by collecting a wide dataset of video images annotated with exact location of person moving in a 2-D space in front of the camera. The collected dataset captures a single person in 2 different scenes from a total of 3 cameras. Each frame is annotated with the exact 2D position of the person in the scene. One of the scenes offers multiple points of view from 2 cameras simultaneously. Each scene is comprised of multiple localization scenarios, with varying levels of occlusion. A total of 42 scenarios are investigated across 2 scenes. A split down distribution of scenarios per each scene in presented in Table I. Each scene has an artificial grid drawn on the floor, that is used for validation. Global positioning is given relative to the origin of the grid.

Fig. 3: Scenes and camera perspectives.

Iv-a Scenes Description

The two scenes can be seen in Figure 3. Scenarios include obstacles at different distances and varying clothing.

Iv-A1 Scene 1

(Wide Space) S1_Wide represents a wide open-space such as a wide hallway, a lobby or a large room. Two camera perspectives are available at a perpendicular angle. These can be seen in Figure 3 in images (a) and (b). The two cameras are positioned at 2.8 meters and 1.8 meters, respectively, from the ground. The grid is a 540 cm x 300 cm rectangle, evenly divided into squares of 60 cm in length.

Iv-A2 Scene 2

(Narrow Space) S2_Narrow represents a narrow space, a typical hallway. The space reaches over 10 metres from the camera. The camera is at 2.5 meters from the ground, and the grid is a 225 cm x 1000 cm rectangle, divided into 75 cm x 90 cm rectangles.

Iv-B Occlusions and Obstacles

The scenarios captured by the dataset can be grouped in 5 broad categories, described in Table II. Situations with various levels of occlusions were considered, which could arise in real life scenarios. These include a person standing upright, sitting and with various body parts occluded by obstacles. Sample images from each type of scenario are shown in Figure 4. In some extreme cases, the body is almost completely covered (see scenario type 5. Table Standing), raising problems for vision-based positioning algorithms.

Fig. 4: Sample images from each scenario type (1 to 5). Not all scenes contain every type of scenario.
Scenario Type Description
1. Baseline No occlusions present. This is the best case
2. Table A simple table, used for testing localisation
when the person is sitting.
3. Table and Chair A more complex variant of the previous type,
where feet are not always visible.
4. Table Sideways Used for occluding the lower part of the body.
5. Table Standing Occluding most of the person, except the upper
part of the body.
TABLE II: Descriptions of scenario types across the scenes.

Iv-C Data Annotation

For the S1_Wide scene, the dataset offers the perspectives of two synchronized cameras. In this case, the ground truth annotation represents a combination of the annotations for the two camera perspectives: when the person is not visible on one of the cameras, the ground truth from the other camera represents the shared position. This approach is useful for situations when tracking the movement of people across video frames, including moving outside the coverage of one of the cameras. Otherwise, the position of the person is given by the midpoint between the annotations of the two perspectives. Figure 5 shows the annotation process.

Fig. 5: Capture from the annotation tool, in the multi-camera scenario. Annotating the global location in this scenario requires annotating and combining both camera views. The red circles are the annotations on the transformed grid; the location of both feet are marked and then averaged to get the person location in one frame.

Separately annotating two frames from different perspectives leads to different global coordinates. This is due to the differences in the set of points that define the homography, different frame rates and differences in synchronization. In the case of the S1_Wide scene, which benefits from two camera perspectives, the localisation mismatch level is low, the average localisation mismatch for each axis being less than 20 cm. This is a good result considering that the distance between the person being tracked and the camera can reach over 6 m. The absolute differences in coordinates are presented in Figure 6.

Fig. 6: Localisation mismatch in multi-camera perspective, with Y axis representing depth and X axis representing camera plain (95-percentile).

Iv-D Dataset processing

The videos from the surveillance cameras were preprocessed beforehand to remove the barrel lens distortion. This was achieved with the use of a Linux’s ffmpeg command-line tool, defish0r, which automatically corrects distortion, at the price of losing some information at the edges of the frame.

The frames were not scaled to a predefined set of dimensions. Instead, a configuration file is present for each scene, with the following information:

  • image height and width

  • camera height and X,Y coordinates, with their respective units of measurement

  • grid height and width, with units of measurement

  • the set of points to define the homography transformation

Generally, the origin of the plane coordinate system is the lower left corner of the grid, as viewed by the camera. This is not the case for the multi-camera scenes, where the origin was chosen to be the same for both cameras.

The dataset is offered as a set of frames from the gathered videos, with absolute X,Y coordinates annotations for each frame organized in .csv files.

Fig. 7: Grid with homography points defined. Image has lens distortion corrected. Capture taken from S1_Wide.

The dataset was collected in a realistic environment from surveillance cameras [10, 11] in an office building. It contains 2196 frames, and their distribution on each scene is shown in Table I (left) .

V Evaluation

Evaluation is performed by analyzing the errors in localisation with respect to the ground truth annotations. Error is calculated as the euclidean distance between the global ground truth coordinates and the predicted coordinates :

Considering that the predictors (object detectors / pose estimators) might not offer confident enough predictions for every frame, the percentage of missing predictions is also taken into account.

Fig. 8: Projection error with varying distance from camera

Due to the way localisation is performed, by projecting a detection from the camera perspective onto the floor, the localisation error should have a positive correlation to the distance between the person and the camera. Using the properties of similar triangles, as shown in Figure 8, maintaining the same camera height and varying the distance, leads to the following assertion:

This is to say that the relative increase of the localisation error is proportional to the relative increase of the person’s distance to the camera. The error is also dependent on the predictor feet accuracy from the camera perspective. As such, with better predictor accuracy, the increase in localisation error is smaller.

The fact that position errors increase with the distance from the camera shows that the methods presented in this paper are better suited for environments with good camera coverage.

V-a Detection in Single Images

The person detection was performed using a pretrained YOLOv2 [4] object detector, with all classes discarded except the person class. The YOLOv2 detector was trained on the VOC dataset [12], and is one of the most performant algorithms for object detection both in terms of accuracy and resource consumption. YOLO’s custom Darknet backbone architecture is one of the reasons for its speed, along with the fact that it belongs to the single-shot class of object detectors.

A gross estimate of the feet position is given by the centre of the lower edge of the bounding box, as shown in Figure 1. In the case of a person standing without any body parts occluded, the bounding box method has good accuracy in estimating the feet position. It is a gross estimate because it does not work well in occluded environments, where the lower part of the body is missing (i.e. a person standing behind a table or a chair).

The optimization we use for the bounding box detection method consists of extending the bounding box to meet a particular aspect ratio - the aspect ratio of the bounding box when the person is standing. This is problematic since the mentioned aspect ratio depends on the camera height, the person’s distance to the camera, and the person’s orientation towards the camera, all of which cannot be known a priori.

Pose estimation was performed using the technique described by Tome et al. in [3]. The backbone architecture used is MobileNet [13], which was chosen for its good trade-off between speed and accuracy. It makes use of depthwise separable convolutions for faster inference times. Lightweight neural architectures such as this one are becoming prevalent in the space of mobile applications. The network was trained on the COCO dataset [14].

Pose information offers a more accurate estimation of the feet position, even in the cases when the feet detections are missing (see Figure 14). The body position can be inferred from just a few body parts detected by using known body proportions. This is invariant to the camera position, since that information is contained in the estimated body proportion.

Fig. 9: Baseline scenario type error CDF.
Fig. 10: Table scenario type error CDF.
Fig. 11: Table and chair scenario type error CDF.
Fig. 12: Table standing scenario type error CDF.
Fig. 13: Table sideways scenario type error CDF.
Scene Pose estimation mean error (cm) Bounding box mean error (cm) Mean error difference (%)
S1_Wide Cam1 36.26 41.99 13.6
S1_Wide Cam4 53.58 60.99 12.1
S2_Narrow 45.27 48.37 6.4
TABLE III: Mean error value for both techniques (pose estimation and bounding box), using five cameras in three scenes.

Table III

shows descriptive statistics of each scene, analyzed from a single camera perspective. In most situations, pose estimation has lower errors, and lower standard deviation compared to bounding box. Error cumulative distribution functions for each of the scenario types are presented in Figures

9, 10, 11, 12, 13.

In these figures it can be observed that in the case of both methods, the position error is the lowest in the Baseline scenario (CDF in Figure 9 and scenario in the first row of images of Figure 4), where no occlusions occur. In this scenario, bounding box shows a slightly better performance than pose estimation. This is also the case in the Table scenario (CDF in Figure 10 and scenario in the second row of images of Figure 4) where the occlusions of the person are still minimal. However, when the occlusions are more significant (CDFs in Figures 11, 12 and 13 and scenarios in the third, fourth and fifth rows of Figure 4), pose estimation is outperforming bounding box. In Figures 9 through 13

, it can also be noticed that in most cases, the lowest localisation errors are obtained by Cam1 in the S1_Wide scene, most probably due to the position of the camera closer to the monitored scene.

Explanations for when pose estimation fares poorly consist of cases such as the one presented in Figure 14 when the body position is ambiguous, with no leg joints visible so the body position is interpreted as being upright. This case is representative for most of the bad predictions when almost all body parts are missing, and the body is interpreted as being upright, or in a position different from the actual one.

Bounding box detections also suffer from occlusion, but there is no real way to adjust the prediction: figure 15 shows the case where bounding box fares poorly, while the body extension technique of the pose estimation method handles the occlusion well.

Fig. 14: Edge case when localisation with pose estimation fails to estimate the feet position.
Fig. 15: Edge case when localisation with pose information works better than localisation with bounding box detections. The red circle represents the feet position offered by the bounding box detections. The black circle represents the feet position of the extended body to accustom for occlusions and missing body parts.

V-B Performance in Multi-Camera

Considering the S1_Wide scene, where positioning can be inferred from two different cameras at the same time, localisation could be improved by merging locations from both cameras using distance-weighted averaging:

The function takes into account the positions from both cameras and the distances to the camera. As such, when one of the cameras misses the prediction for a frame, the other camera supplies the position. If both cameras have inferred a position for the current frame, a weighted average of the two is computed using the inverse of their respective distances. This way, the position provided by the camera that is further away is penalized. This is motivated by the fact that localisation errors increase with the distance from camera, as shown in Figure 16 for pose estimation and Figure 17 for bounding box.

Fig. 16: Errors vs distance for pose estimation. Strong positive correlation.
Fig. 17: Errors vs distance for bounding box. Correlation is not as strong as in the case of pose estimation.
Fig. 18: CDF: multi-camera compared to individual cameras.
Pose estimation mean error (cm) Bounding box mean error (cm) Pose estimation missing predictions (%) Bounding box missing predictions (%)
Cam1+Cam4 38.62 39.52 0.34% 0.0%
Cam1 36.26 41.99 9.18% 2.48%
Cam4 53.58 60.99 4.47% 0.33%
TABLE IV: Performance results of multi-camera compared to individual cameras.

The perfomance of the multi-camera approach can be seen both in the CDF presented in Figure 18 and in Table IV. Since the procedure takes into account distances from both cameras, it can be noticed that errors smooth out. An important benefit of having multiple cameras consists is the significant improvement of the prediction ratios (less missed predictions) as shown in the right-hand side of Table IV.

V-C Resources footprint

(a) Jetson TX2
(b) Odroid XU4
Fig. 19: Experimented on resource-constrained devices.

We assess the performance of our proposed pose based localization system on two devices common to the embedded computing space, NVidia Jetson TX2 and Odroid XU4. The Jetson TX2 is a development platform with one integrated 256-core NVIDIA Pascal GPU, with 8GB of memory and a quad-core ARM Cortex-A57 CPU. The Odroid XU4 is an ARM big.LITTLE architecture, with four A15 and four A7 cores and 1.9GB of memory.

Table V shows the inference time with batch size of one (one image at a time) and memory footprint, achievable at run-time on real hardware. The difference in memory footprint between Odroid and Jetson TX2 is due to the internal libraries used for Convolution computations by each of the two devices, Jetson relying on cuDNN, a highly optimized computation library for NVidia GPUs, maximizing speed in detriment to memory footprint, while the Odroid with ARM processor uses OpenBLAS, also a highly optimized matrix multiplications library, but agnostic to hardware profiles so balancing run-time memory and latency. Both of these exceed the baseline performance of the bounding box implementation on the Jetson CPU, in terms of both memory footprint and inference time. Due to memory constrains we were unable to run the baseline (bounding box) on the Odroid devices. Admittedly, the bounding box implementation based on YOLO is bulkier that necessary since the detection can handle multiple classes, here filtered just for the person class. A slimmer implementation of a person detector would yield different performance.

CamLoc on the Jetson TX2 achieves a frame rate of just above 6 frames per second. Even though is not enough to be classified as real-time performance for the human eye, it is still remarkably responsive for most applications that require location estimation for interactive services.

Device Infer. time(sec) Perform.(FPS) Memory(MB)
Jetson TX2 (baseline) 2.6 0.33 1520
Jetson TX2 (pose) 0.16 6.25 620
Odroid (pose) 0.45 2.22 210
TABLE V: Performance of baseline (bounding box) and pose estimation based localizations on NVidia Jetson TX2 and just pose estimation manageable on Odroid XU4.

Vi Related Work

Vi-a Object Detection

One of the first methods to use deep convolutional networks in the context of object detection was R–CNN [15]. The approach was to extract region proposals using semantic segmentation and classify each of them using a SVM. As such, bounding boxes are generated with their respective classified class. Its main drawback was being slow, due each region being processed independently. Fast R–CNN [16] tries to reduce the execution time and memory usage by implementing region of interest pooling, more specifically Spatial Pyramid Pooling [17] to share computations. A final advancement to this method is Faster R–CNN [7]

, which uses an ”attention” model to propose regions through their Region Proposal Network. However, even with the optimizations brought by Faster R–CNN, detection is not done in real time: Faster R–CNN registers 5 FPS using VGG net

[18] with a mAP of 76.4. A faster but slightly less accurate approach is offered by YOLO [19], and its significantly more accurate successor, YOLOv2 [4]. The YOLOv2 network operates in real time, at 67 FPS using a modified GoogLeNet architecture with a mAP of 76.8. It frames detection as a regression problem, and predicts bounding boxes and class probabilities in a single evaluation. Since it removes the need of a detection pipeline (as in the spirit of Fast/Faster R–CNN), the system can be optimized as a whole. Part of the larger class of single-stage detectors is RetinaNet [20]

and it’s Focal loss function, which significantly increases accuracy. It was designed was to lower the loss for well classified cases, while emphasizing hard ones. Most of the time, two-stage detectors like Fast–RCNN tend to perform better accuracy–wise than single–stage detectors. This is due to single-stage detectors using a fixed grid of boxes, rather than generated box proposals. Still, RetinaNet has better performance on COCO dataset


Vi-B Pose Estimation

Early approaches in estimating the pose of people [8, 21], used direct mappings, HOG or SIFT, to build the pose from silhouettes. Nowadays deep learning approaches are ubiquitous, benefit from a large body of available datasets [22, 23, 24]. One of the most successful proposed approaches is DeepCut [25], which initially detects people in the scene, and subsequently estimates their body pose. This approach uses a convolutional neural network for hypothesizing body parts and then performs non-maximum suppression on the part candidates. An improvement to this method was later compiled in the form of DeeperCut [26], which improved body parts detectors. Another approach [27] uses a processing pipeline to first detect people in images and then estimate the pose. If the confidence of the detector is slim, pose estimation is skipped. Keypoints are predicted using heatmap regression with a fully convolutional ResNet. The system is trained only on COCO data, achieving state of the art results at the time. Tome et al., [3] propose an approach to detect 3D human pose. This method uses a 6-stage processing pipeline to ”lift” 2D poses, using a combination of belief maps provided by convolutional 2D joint predictors and projected pose belief maps.

Vi-C Vision-based Indoor Localisation

Although different classifications for the existing indoor localisation solutions were offered throughout literature [28, 29, 30, 31, 27, 32], a simpler classifications divides them into solutions that require specialised infrastructure and solutions that make use of widely available infrastructure (such as wireless access points or surveillance cameras in buildings and inertial sensors in mobile devices).

Even though the majority of the existing indoor localisation solutions that make use of widely available infrastructure use smartphone sensors and WiFi to estimate the location, there has also been research into positioning systems by means of computer vision. These systems do not require users to carry special tags, enabling applications in circumstances where caring or wearing a tag is not viable (e.g. Ambient Assisted Living scenarios where the typical users are not well-versed when it come to technologically [33]).

Mautz et al., have published a survey of optical indoor positioning system [34]. The paper describes different systems and classifies them based on the reference used to determine the location of users in a scene such as images, projected patterns and coded markers. Tsai et al., propose in [35] a system that extracts from the video of surveillance cameras foreground objects using a background model. The system does not perform user identification, only positioning. Several existing systems use RGB-D sensor for human positioning, such as the systems proposed by Munaro et. al., [36] and Saputra et. al., [37] that offer a scalable multi-camera solutions for people tracking, Duque et al., [38] who present a system for people localisation in complex indoor environments that works by combining WiFi positioning systems with depth maps, and the system proposed by Viola et al., [39] that detects and identifies people, even if occluded by others, using an algorithm for creating free-viewpoint video of interacting people using hand-held Kinect cameras. Nakano et al., [40] present the potential applications for their proposed Kinect Positioning System, an indoor positioning system that uses Kinect without an infrared photophore.

Most of the positioning systems by means of computer vision use depth cameras, which cannot be considered part of the widely available infrastructure in large built environments. The localisation solution that we have proposed in this paper make use of typical surveillance cameras, that most large buildings are already equipped with.

Vii Future Work and Conclusions

Although the scenarios explored in this paper are complex in terms of the amount of human body occlusion and different postures, real-world applications many come with other forms of complexity. In future work we will explore scenarios with multiple people in the scene, which may impact the performance of body key-points detection and will require user identification.

The trend of performing more computations on IoT devices for local intelligences is likely to continue, with computer vision enabling a large class of applications that will migrate from cloud to the edge. Here we show that our system, CamLoc, based on human pose estimation can perform efficient location estimation, both in accuracy and in hardware resources footprint compared to YOLOv2 on single images from a fixed camera. Our annotated dataset also includes a multipe-camera perspective of the same scene, which contributes to improving detection when used in coordination. Our results show that such computer vision systems can operate efficiently on embedded devices, opening the opportunity for complex interactive applications in user environment assisted by smart-cameras to perform detections in user proximity.


  • [1] Jennifer Golbeck and Matthew Louis Mauriello. User perception of facebook app data access: A comparison of methods and privacy concerns. Future Internet, 8, 2016.
  • [2] EU Commission. 2018 reform of eu data protection rules., 2018. [Online; accessed 20-December-2018].
  • [3] Denis Tome, Christopher Russell, and Lourdes Agapito. Lifting from the deep: Convolutional 3d pose estimation from a single image. CVPR 2017 Proceedings, pages 2500–2509, 2017.
  • [4] Joseph Redmon and Ali Farhadi. Yolo9000: better, faster, stronger. arXiv preprint, 2017.
  • [5] Rainer Mautz. Indoor positioning technologies. 2012.
  • [6] Karara H. M. Abdel-Aziz, Y. I. Direct linear transformation into object space coordinates in close-range photogrammetry. 1971.
  • [7] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [8] Ankur Agarwal and Bill Triggs. Recovering 3d human pose from monocular images. IEEE transactions on pattern analysis and machine intelligence, 28(1):44–58, 2006.
  • [9] G. Livshits, A. Roset, K. Yakovenko, S. Trofimov, and E. Kobyliansky. Genetics of human body size and shape: body proportions and indices. Annals of Human Biology, 29(3):271–289, 2002.
  • [10] VIVOTEK Inc. VIVOTEK FD816B-HT Fixed Dome Camera.
  • [11] VIVOTEK Inc. VIVOTEK IB8338-H Bullet Network Camera.
  • [12] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
  • [13] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [14] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [15] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 580–587, 2014.
  • [16] Ross Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep convolutional networks for visual recognition. In European conference on computer vision, pages 346–361. Springer, 2014.
  • [18] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [19] Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016.
  • [20] Tsung-Yi Lin, Priyal Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence, 2018.
  • [21] Ahmed Elgammal and Chan-Su Lee. Inferring 3d body pose from silhouettes using activity manifold learning. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, volume 2, pages II–II. IEEE, 2004.
  • [22] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, pages 3686–3693, 2014.
  • [23] U Iqbal, A Milan, M Andriluka, E Ensafutdinov, L Pishchulin, J Gall, and SB PoseTrack. A benchmark for human pose estimation and tracking. arXiv preprint arXiv:1710.10000, 2(3):4, 2017.
  • [24] Catalin Ionescu, Dragos Papava, Vlad Olaru, and Cristian Sminchisescu. Human3. 6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE transactions on pattern analysis and machine intelligence, 36(7):1325–1339, 2014.
  • [25] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Bjoern Andres, Mykhaylo Andriluka, Peter V Gehler, and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4929–4937, 2016.
  • [26] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In European Conference on Computer Vision, pages 34–50. Springer, 2016.
  • [27] Chenshu Wu, Zheng Yang, Yunhao Liu, and Wei Xi. Will: Wireless indoor localization without site survey. IEEE Transactions on Parallel and Distributed Systems, 24(4):839–848, 2013.
  • [28] Krishna Chintalapudi, Anand Padmanabha Iyer, and Venkata N Padmanabhan. Indoor localization without the pain. In Proceedings of the sixteenth annual international conference on Mobile computing and networking, pages 173–184. ACM, 2010.
  • [29] Zhuoling Xiao, Hongkai Wen, Andrew Markham, and Niki Trigoni. Lightweight map matching for indoor localisation using conditional random fields. In Proceedings of the 13th international symposium on Information processing in sensor networks, pages 131–142. IEEE Press, 2014.
  • [30] Zheng Yang, Chenshu Wu, and Yunhao Liu. Locating in fingerprint space: wireless indoor localization with little human intervention. In Proceedings of the 18th annual international conference on Mobile computing and networking, pages 269–280. ACM, 2012.
  • [31] Hui Liu, Houshang Darabi, Pat Banerjee, and Jing Liu. Survey of wireless indoor positioning techniques and systems. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 37(6):1067–1080, 2007.
  • [32] Anshul Rai, Krishna Kant Chintalapudi, Venkata N Padmanabhan, and Rijurekha Sen. Zee: Zero-effort crowdsourcing for indoor localization. In Proceedings of the 18th annual international conference on Mobile computing and networking, pages 293–304. ACM, 2012.
  • [33] Andreas Braun and Tim Dutz. Low-cost indoor localization using cameras–evaluating ambitrack and its applications in ambient assisted living. Journal of Ambient Intelligence and Smart Environments, 8(3):243–258, 2016.
  • [34] Rainer Mautz and Sebastian Tilch. Survey of optical indoor positioning systems. In Indoor Positioning and Indoor Navigation (IPIN), 2011 International Conference on, pages 1–7. IEEE, 2011.
  • [35] Tsung-Han Tsai, Chih-Hao Chang, and Shih-Wei Chen. Vision based indoor positioning for intelligent buildings. In Intelligent Green Building and Smart Grid (IGBSG), 2016 2nd International Conference on, pages 1–4. IEEE, 2016.
  • [36] Matteo Munaro, Filippo Basso, and Emanuele Menegatti. Openptrack: Open source multi-camera calibration and people tracking for rgb-d camera networks. Robotics and Autonomous Systems, 75:525–538, 2016.
  • [37] Muhamad Risqi Utama Saputra, Guntur Dharma Putra, Paulus Insap Santosa, et al. Indoor human tracking application using multiple depth-cameras. In Advanced Computer Science and Information Systems (ICACSIS), 2012 International Conference on, pages 307–312. IEEE, 2012.
  • [38] Jaime Duque Domingo, Carlos Cerrada, Enrique Valero, and Jose A Cerrada. An improved indoor positioning system using rgb-d cameras and wireless networks for use in complex environments. Sensors, 17(10):2391, 2017.
  • [39] Paul Viola and Michael J Jones.

    Robust real-time face detection.

    International journal of computer vision, 57(2):137–154, 2004.
  • [40] Yoshiaki Nakano, Katsunobu Izutsu, Kiyoshi Tajitsu, Katsutoshi Kai, and Takeo Tatsumi. Kinect positioning system (kps) and its potential applications. In International Conference on Indoor Positioning and Indoor Navigation, volume 13, page 15th, 2012.