With increasingly ubiquitous deployment of smart surveillance cameras throughout the cities where majority of population lives, privacy issues are getting into focus [4, 8]. Privacy often defines the boundaries to limit access to an individual’s private information and body. Today, we live in an information society where vast quantities of data about us are gathered and analyzed through automated processes and cameras. A lot of private attributes and personal information about individuals are collected by closed-circuit television (CCTV) cameras and streamed to remote cloud servers and viewing stations with no privacy protection mechanism enforced .
Initially, these surveillance cameras were deployed for public safety purposes and to provide concrete evidence for forensics analysis [5, 37]. The user may vary from the public safety authorities, law enforcement agents to a house owner. With the transmission of the unprotected video through the communication network the video may be subject to attacks. As a consequence, these large amount of data collected by the cameras could be intercepted and abused by adversaries. For example, a man in the middle who can view the raw frames is considered as a breach of the privacy [13, 15, 21]. This has caused the public to be more concerned and to ask for change in the way video surveillance works [10, 19].
Specifically, the practice of mass-surveillance can have a profound effect on the understanding of minors about privacy in their later lives . Usually children learn through experience; hence, they should grow up in an environment where privacy is practiced if they are to learn what privacy is and how it works. Besides, many argue that the right experience of privacy is very important to a child’s future success and good decision-making in setting correct safely measures and social privacy boundaries. As a result, today’s pervasive surveillance systems must have means to protect children’s privacy.
Privacy protection is one of the active research areas in the rise of Internet of Things (IoT) , where a huge number of sensors and low powered processors are going to be connected to the network with none or minimal security measures. One of the more important aspects of this research is to protect the identity of the people in case the data is compromised. Any effort to address the privacy problems in a surveillance system must have techniques for identifying private attributes on images and for protecting these attributes [11, 22].
Private attributes like face are detected through the use of machine learning or deep learning networks. Following detection, these private attributes of individuals are scrambled using apropos cryptographic schemes. These schemes ensure that video streams are not accessed by means of interception and abused by unauthorized people while being transmitted from the cameras to the cloud servers and viewing centers. Among variant privacy preserving requirements, minor children’s identity and face protection is essential to every family to protect the minors from attackers or abusers [2, 18, 29].
In this work, we propose a novel Minor Privacy protection solution using Real-time video processing at the Edge (MiPRE). In MiPRE, the video is checked by the smart cameras with Deep Neural Network (DNN) to detect children’s faces, and then a lightweight blurring algorithm is called to scramble the faces before the raw video is transmitted through network to the consumer or the storage drive. Therefore, the MiPRE scheme protects the privacy of minor children by securely denaturing their faces. In this paper, we present the face detection and recognition model of the MiPRE system, which categorizes the tested images into adults and children. More specifically, the face detection method employs the Multitask Convolutional Neutral Network (MTCNN) to detect and align the faces. The face recognition is realized using the FaceNet model , which is designed by Google group, was employed to extract the feature embedding of children’s faces.
The rest of this paper is organized as follows. In section II we discuss several methods for children detection as well as the historical efforts in the detection and recognition of human faces. Section III presents the system architecture of our MiPRE scheme and its function blocks are discussed in detail, including the multi-step pipeline face detection and children recognition. Section IV reports the model training process and the performance of the children detection. Finally, Section V concludes this paper.
Ii Related Work
With the development of machine learning, computers are becoming more widely used, which reduces manual workload and guarantees high recognition rate [1, 6]. In recent years the community also witnesses the migration of powerful machine learning algorithms to the IoT environments by developing lightweight solutions [23, 24, 36, 37].
In the field of face recognition, research is mainly focused on two aspects, namely authentication [9, 34] and recognition [28, 38]. In the face recognition process, whether it is recognition or authentication, a well-known method is top-bottom approach where face rectangle is first detected, features form the face are extracted and finally a comparison is made . In face recognition, human age recognition is a well-studied issue . Classifiers are trained to detect the age of the subject or to predict the facial appearance in certain age group. Building on the state of the art architectures, we present a unique decentralized method for children detection which performs the most accurate.
Face recognition-based age recognition is the process of extracting age-related facial features, create an age classification model [3, 41]. Then, use this model to evaluate the age range of given person to categorise this person into different age groups. The ability to build a model through face recognition is because human aging and changes are not changed by human willpower. This is a complex process that is related to the health and status of people’s living environment, etc.
Although the research on face recognition started earlier, there are few studies on the establishment of children classification models. Today’s top-performing techniques of face recognition are based on Multi-task convolutional neural networks. Both Facebook’s DeepFace and Google’s FaceNet  architectures have the highest accuracy. DeepFace uses 6 conv. DeepFace uses 6 conv. layers followed by two FC layers that are used to detect and map a face in 3-D space and to map 67 fiducial points on the face. Facenet approach is to detect faces that belong to the same person using illumination and Pose invariance architecture. MTCNN and FaceNet are employed in our model to reach a better results compared with state of the art techniques.
Iii MiPRE: Minor Privacy Protection at the Edge
Figure 1 presents the architecture of our MiPRE system. It consists of three major function blocks: (1) face detection using a multi-step pipeline model, (2) face recognition based on the extracted features and separate faces of children from adults, and (3) face scrambling to protect children’s privacy. Each module is implemented in a docker container which promises scalability and faster updates in parts of the system using microservices architecture [20, 25]. The design rationales and technical details of face detection and face recognition are presented in the following sections. The face scrambling for privacy protection is beyond the scope of this paper, interested readers may find the complete description of our MiPRE scheme in .
Iii-a Face Detection
While there are many face detection methods, such as Dilb or OpenFace face detection, MTCNN (Multitask Convolutional Neutral Networks) is adopted in this work for two reasons. On one hand, it achieves a high detection accuracy, and on the other hand, FaceNet model has already provided MTCNN interface to detect faces.
Basically, MTCNN is a deep learning model for face detection based on a multi-task cascaded Convolutional Neural Network (CNN). It exploits the inherent correlation between detection and alignment to boost up its performance. In particular, to predict face and landmark locations in a coarse-to-fine manner, the framework used in this paper leverages a cascaded architecture with three stages of carefully designed deep conv. networks [30, 40].
Given an input image, an image pyramid is made by re-scaling the image into different scales through a bi-linear interpolation. This step insures scale invariation. Figure2 shows an example, in the MTCNN the three cascaded stages follow scaling step:
: It is a full convolutional neural network (FCN). The feature map obtained by forward propagation is a 32-dimensional feature vector at each position. It is used to determine whether or not grid cells ofcontain a face. If a grid cell contains human face, the Bounding Box of the human face is regressed, and the Bounding Box corresponding to the area in the original image is further obtained. The Bounding Box with the highest score is retained by a Non-maximum suppression (NMS) step and all of the other Bounding Boxes with an excessively large overlapping area are removed.
R-Net: It is a simple CNN stage. Similar to the last stage (O-Net), the and the resulting Bounding Box area is up-scaled to . It is then given to the R-Net stage to have the highest detection confidence of Bounding Box detection and facial landmark extraction.
O-Net: O-Net is for higher accuracy. It is a simple CNN, the Bounding Box that P-Net step produces may or may not contain a human face. This box as well as the input is first up-scaled using a bilinearly interpolation method to 24, which is then used as the input to the O-Net to determine whether a human face exists. If a human face is contained, the Bounding Box is regressed, which is also followed by the NMS step.
presents the architecture of the layers used in each of the stages in the cascaded MTCNN model. Each step uses different sizes of Conv. filters and different number of layers to produce the same class of results. The outputs are in three categories. The face classification score is presented as the first set of outputs using two neurons. One for the presence of a face and the other as the score. Another part of the output is the bounding box regression where four neurons present the upper left and lower right of the bounding box as. Facial landmark localization regresses the position of five points of left eye, right eye, nose, left mouth corner, and right mouth corner.
During the training phase, the three networks will use the landmark positions as supervised signals to guide the learning of the network. In the prediction phase, however, P-Net and R-Net only conduct face detection and do not output landmark positions because they are inaccurate in these pases. The landmark position is only obtained the O-Net. Bounding box and landmarks coordination outputs are normalized relative to the input image.
As mentioned above, there are three tasks that MTCNN archives. Namely face classification, bounding box regression and facial landmark localization. Thus, the loss function of the algorithm also has three sections. Due to the limited space, here is a highlight of the key points, readers interested for more details are referred to.
Cross-entropy loss function is employed for face classification as shown in Eq. (1):
where the shows the ground truth for object and the is the network output for the face detection.
Next is the bounding box regression loss where the euclidean distance loss function is employed as seen in Eq. (2):
Lastly, the same regression loss is used for each of the landmark for each samples as mentioned in Eq. (3):
Iii-B Children Faces Recognition
There are several ways to compare the similarity of two images. The euclidean distance metric is one of the most used one, because of the ease in implementation and no expensive computation. Given a feature map where the features are extracted from the face, this metric is going to show the similarity in the features between the feature set and a known set. This idea is the back bone of this section where we are going to feed the faces that are extracted in the face detection step to the FaceNet and compare the resulting feature map with datasets that are know positive and negative images of children’s faces. A similarity threshold is then picked to give a final label to the face.
FaceNet is a universal system that can be used for face authentication, recognition and clustering. FaceNet’s approach is to learn to map images to an Euclidean space through CNNs. Spatial distance is directly related to the similarity of pictures. Different images of the same person have a small spatial distance, and images of different people have a larger distance in space. As long as the mapping is determined, the related face recognition task will be simple .
Currently, existing DNN-based face recognition models use a FC classification layer. The middle layer in the FC layers, after the Conv. layers, or the last Conv. layer is a vector map of the face image. The FC classifier layer is then placed on top of this vector map. The disadvantages of such methods are indirectness and inefficiency. In contrast, FaceNet directly uses the loss function of triplets-based Large Margin Nearest Neighbor (LMNN) to train the neural network, and the network directly outputs a 128-dimensional vector space. The triplets we selected contain two matching face thumbnails and one non-matching face thumbnail. The goal of the loss function is to distinguish positive and negative classes by distance boundaries as shown in Fig. 4
The purpose of the model is to embed the 2-D face image into the Euclidean space with dimensions where . In this vector space the anchor image of a face is close to other images with the same facial expressions () and far from faces with different characteristic (). As illustrated by Fig. 5, the training process migrates the network’s behavior pattern from the left side to the right side.
To reach this goal a triplets loss function is calculated from triplet of three pictures. The triplet is composed of Anchor (A), Negative (N), and Positive (P) images. Any image can be used as a base point (A), then images which have the same facial characteristics are its (P) and images that do not share the same characters are considered as its (N). Triplets Loss minimizes the distance between an anchor and a positive, both of which have the same identity, and maximizes the distance between the anchor and a negative image. Thus, the loss function can be formulated as:
where is the safe boundary between the positive and negative.
Theoretically speaking, the best images for training purposes are the ones with the highest distance between the (A) and (N) and lowest between the (A) and (P). However, in practice this approach creates local minimum and global solution is not going to be reached. A remedy to this problem is to select all positive image pairs in a mini-batch, which can make the training process more stable. For the selection of the (N), on the other hand, as long as the Eq. (5) is met the network is going to be trained.
Training: in order to train this network we used 10,000 images of children and 10,000 images of adult faces. The children ages are variable between 6 to 14 years old. For positive selection we selected 100 children and 100 images of adult faces. The images are fed through the MTCNN to have a Bounding Box around the face. The face rectangle is then fed to the FaceNet for euclidean distance calculation. Figure 6 presents the data flow of our model for minor’s face detection.
In Fig. 6 the positive and negative dataset with which a test image is compared against is prepared before hand. This feature set is called Embedding dataset and has all of the feature maps from all 200 aforementioned images for comparison. Total inference time, thus, is divided to two parts: face detection time and feature comparison using the FaceNet network. More details will be presented in Section IV.
Iv Experimental Results
Iv-a Experimental Setup
The multi-level MiPRE architecture is tested on a x86 based CPU. The model is to be executed on edge server grade hardware that are more powerful in nature than low-powered edge devices. In this context we consider a laptop or PC to be edge server grade and devices such as raspberry PI to be edge modules. Specifically, the MiPRE model is tested on a AMD Ryzen 7 2700X processor with 3.7GHz based clock with 8 cores. The system has dual 8GB memory modules and is running a windows 10. During inference, we observed average of 18% CPU utilization which is acceptable considering the need to connect several edge nodes to an edge server. On the other hand, memory utilisation of the process is higher at about 10GB on average. Although there is no surprises in memory usage because of loading several stages of CNN models, it should be considered when deploying the model.
Iv-B Accuracy of Face Recognition
We compared the accuracy of our MiPRE model to the state of the art models for facial components based on the age. Table I shows the comparison. The approach reported in  tries to divide faces to multiple components and use their changes as the features to indicate the age of the subject face. Levi and his colleagues  use a CNN to classify primary objects in an image between gender and age. Meanwhile, a rule based method has been also proposed that divides the image into sections and implement privacy measures based on rule sets . The last method we compared with is  which tries to extract facial features and accurately detect the age of each. As shown by Table 1, our MiPRE scheme achieves a better performance in terms of accuracy than of these reported efforts.
|Otto et al. ||81.27 %|
|Levi et al. ||84.7 %|
|Teixeira et al. ||91.14 %|
|Du et al. ||79.24 %|
Table II shows the ratio at which the multi staged model we proposed in MiPRE works. The model achieves a miss detection rate of 7.9% in the testing set, as 158 out of 2000 images were mis-categorized. Meanwhile, the detection rate of 92.1% is among the best.
|Miss detection||Miss detection rate||True Detection||Detection rate|
Figure 7 is the ROC curve that shows detection rate, shown as True Positive Rate versus the False Positive rate. This curve gives some intuitive insight to the best possible threshold to be set for the child detection. A bigger area under this curb means that the system performs better with higher true positive and lower false positive rate. During implementation, for example, if the true positive rate of 0.7 is needed, then a 0.2 false positive rate is going to be expected.
As the number of surveillance cameras increases, families are more concerned about the privacy of their members and their personal data. Child protection is a vital role of the parents and it is important to minimize unauthorized video appearance of the minor children. Particularly, face is one of the most powerful human identifying attributes and scrambling it can effectively anonymize individuals. In this work, a novel lightweight minor privacy protection scheme named MiPRE is proposed. Leveraging a multi-stage DNN based face recognition approach to detect children in the video and a lighweight chaos based face scrambling algorithm, the MiPRE scheme ensures de-identification of the minors at the edge of the network, before the video is streamed to the Internet. The MiPRE scheme is tested on a platform consisting of a smart camera and an edge server, the experimental results verified that the MiPRE scheme meets the design goal. It achieved a high accuracy in children face recognition, .
Our on-going effort mainly focus on identifying other private attributes that have significant impacts on children privacy-preserving, which allows us to extend the coverage of the MiPRE scheme. Corresponding to additional computing capacities raised by features other than faces, we will continue investigating lightweight machine learning algorithms to fit the next version of the MiPRE scheme in the edge environments.
-  B. Amos, B. Ludwiczuk, M. Satyanarayanan et al., “Openface: A general-purpose face recognition library with mobile applications,” CMU School of Computer Science, vol. 6, p. 2, 2016.
-  I. R. Berson and M. J. Berson, “Children and their digital dossiers: Lessons in privacy rights in the digital age.” International Journal of Social Education, vol. 21, no. 1, pp. 135–147, 2006.
-  S. Bhattacharya and M. Gupta, “A survey on: Facial emotion recognition invariant to pose, illumination and age,” in 2019 Second International Conference on Advanced Computational and Communication Paradigms (ICACCP). IEEE, 2019, pp. 1–6.
-  A. Cavallaro, “Privacy in video surveillance [in the spotlight],” IEEE Signal Processing Magazine, vol. 2, no. 24, pp. 168–166, 2007.
-  N. Chen, Y. Chen, E. Blasch, H. Ling, Y. You, and X. Ye, “Enabling smart urban surveillance at the edge,” in 2017 IEEE International Conference on Smart Cloud (SmartCloud). IEEE, 2017, pp. 109–119.
-  D. Cox and N. Pinto, “Beyond simple features: A large-scale feature search approach to unconstrained face recognition,” in Face and Gesture 2011. IEEE, 2011, pp. 8–15.
-  J.-X. Du, C.-M. Zhai, and Y.-Q. Ye, “Face aging simulation based on nmf algorithm with sparseness constraints,” in International Conference on Intelligent Computing. Springer, 2011, pp. 516–522.
-  F. Dufaux, “Video scrambling for privacy protection in video surveillance: recent results and validation framework,” in Mobile Multimedia/Image Processing, Security, and Applications 2011, vol. 8063. International Society for Optics and Photonics, 2011, p. 806302.
-  M. E. Fathy, V. M. Patel, and R. Chellappa, “Face-based active authentication on mobile devices,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, pp. 1687–1691.
-  A. Fitwi, Y. Chen, and N. Zhou, “An agent-administrator-based security mechanism for distributed sensors and drones for smart grid monitoring,” in Signal Processing, Sensor/Information Fusion, and Target Recognition XXVIII, vol. 11018. International Society for Optics and Photonics, 2019, p. 110180L.
-  A. Fitwi, Y. Chen, and S. Zhu, “No peeking through my windows: Conserving privacy in personal drones,” arXiv preprint arXiv:1908.09935, 2019.
-  A. Fitwi, M. Yuan, S. Y. Nikouei, and Y. Chen, “Minor privacy protection by real-time children identification and face scrambling at the edge,” submitted to EAI Endorsed Transactions on Security and Safety, 2020.
-  U. Hessler, “Museum camera films merkel’s apartment in security breach,” https://www.dw.com/en/museum-camera-films-merkels-apartment-in-security-breach/a-1945643, 2006 (accessed on April 2, 2019).
-  E. Jose, M. Greeshma, M. H. TP, and M. Supriya, “Face recognition based surveillance system using facenet and mtcnn on jetson tx2,” in 2019 5th International Conference on Advanced Computing & Communication Systems (ICACCS). IEEE, 2019, pp. 608–613.
-  V. Kumar and J. Svensson, Promoting social change and democracy through information technology. IGI Global, 2015.
-  G. Levi and T. Hassner, “Age and gender classification using convolutional neural networks,” in
-  Y. Li, G. Wang, L. Nie, Q. Wang, and W. Tan, “Distance metric optimization driven convolutional neural network for age invariant face recognition,” Pattern Recognition, vol. 75, pp. 51–62, 2018.
-  M. O. Lwin, A. J. Stanaland, and A. D. Miyazaki, “Protecting children’s privacy online: How parental mediation strategies affect website safeguard effectiveness,” Journal of Retailing, vol. 84, no. 2, pp. 205–217, 2008.
-  D. Lyon, “Surveillance, power and everyday life,” in Emerging digital spaces in contemporary society. Springer, 2010, pp. 107–120.
-  D. Nagothu, R. Xu, S. Y. Nikouei, and Y. Chen, “A microservice-enabled architecture for smart surveillance using blockchain technology,” in 2018 IEEE International Smart Cities Conference (ISC2). IEEE, 2018, pp. 1–4.
-  E. M. Newton, L. Sweeney, and B. Malin, “Preserving privacy by de-identifying face images,” IEEE transactions on Knowledge and Data Engineering, vol. 17, no. 2, pp. 232–243, 2005.
-  S. Y. Nikouei, Y. Chen, A. Aved, and E. Blasch, “I-vise: Interactive video surveillance as an edge service using unsupervised feature queries,” arXiv preprint arXiv:2003.04169, 2020.
-  S. Y. Nikouei, Y. Chen, S. Song, R. Xu, B.-Y. Choi, and T. Faughnan, “Real-time human detection as an edge service enabled by a lightweight cnn,” in the IEEE International Conference on Edge Computing. IEEE, 2018.
-  ——, “Smart surveillance as an edge network service: From harr-cascade, svm to a lightweight cnn,” in 2018 ieee 4th international conference on collaboration and internet computing (cic). IEEE, 2018, pp. 256–265.
-  S. Y. Nikouei, R. Xu, Y. Chen, A. Aved, and E. Blasch, “Decentralized smart surveillance through microservices platform,” in Sensors and Systems for Space Applications XII, vol. 11017. International Society for Optics and Photonics, 2019, p. 110170K.
-  C. Otto, H. Han, and A. Jain, “How does aging affect facial components?” in European Conference on Computer Vision. Springer, 2012, pp. 189–198.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
-  M. Sharif, S. Bhagavatula, L. Bauer, and M. K. Reiter, “Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition,” in Proceedings of the 2016 acm sigsac conference on computer and communications security, 2016, pp. 1528–1540.
-  B. Shmueli and A. Blecher-Prigat, “Privacy for children,” Colum. Hum. Rts. L. Rev., vol. 42, p. 759, 2010.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1701–1708.
-  E. Taylor, “I spy with my little eye: the use of cctv in schools and the impact on privacy,” The Sociological Review, vol. 58, no. 3, pp. 381–405, 2010.
-  L. Teixeira, F. Maffra, K. Lelu, A. Al-Obaidi, and A. Badii, “A rule-based methodology and assessment for context-aware privacy,” in 2014 IEEE 6th International Conference on Awareness Science and Technology (iCAST). IEEE, 2014, pp. 1–6.
T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel, “Facial image super resolution using sparse representation for improving face recognition in surveillance monitoring,” in2016 24th Signal Processing and Communication Application Conference (SIU). IEEE, 2016, pp. 437–440.
-  E. Vazquez-Fernandez and D. Gonzalez-Jimenez, “Face recognition for authentication on mobile devices,” Image and Vision Computing, vol. 55, pp. 31–33, 2016.
-  S. Waters, “The effects of mass surveillance on journalists’ relations with confidential sources: A constant comparative study,” Digital Journalism, vol. 6, no. 10, pp. 1294–1313, 2018.
-  R. Wu, B. Liu, Y. Chen, E. Blasch, H. Ling, and G. Chen, “A container-based elastic cloud architecture for pseudo real-time exploitation of wide area motion imagery (wami) stream,” Journal of Signal Processing Systems, vol. 88, no. 2, pp. 219–231, 2017.
-  R. Xu, S. Y. Nikouei, Y. Chen, S. Song, A. Polunchenko, C. Deng, and T. Faughnan, “Real-time human object tracking for smart surveillance at the edge,” in the IEEE International Conference on Communications (ICC), Selected Areas in Communications Symposium Smart Cities Track. IEEE, 2018.
-  M. A. Yaman, A. Subasi, and F. Rattay, “Comparison of random subspace and voting ensemble machine learning methods for face recognition,” Symmetry, vol. 10, no. 11, p. 651, 2018.
-  Y. Yang, L. Wu, G. Yin, L. Li, and H. Zhao, “A survey on security and privacy issues in internet-of-things,” IEEE Internet of Things Journal, vol. 4, no. 5, pp. 1250–1258, 2017.
-  K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters, vol. 23, no. 10, pp. 1499–1503, 2016.
-  H. Zhou and K.-M. Lam, “Age-invariant face recognition based on identity inference from appearance age,” Pattern recognition, vol. 76, pp. 191–202, 2018.