Privacy-sensitive Objects Pixelation for Live Video Streaming

01/03/2021 ∙ by Jizhe Zhou, et al. ∙ University of Macau 0

With the prevailing of live video streaming, establishing an online pixelation method for privacy-sensitive objects is an urgency. Caused by the inaccurate detection of privacy-sensitive objects, simply migrating the tracking-by-detection structure into the online form will incur problems in target initialization, drifting, and over-pixelation. To cope with the inevitable but impacting detection issue, we propose a novel Privacy-sensitive Objects Pixelation (PsOP) framework for automatic personal privacy filtering during live video streaming. Leveraging pre-trained detection networks, our PsOP is extendable to any potential privacy-sensitive objects pixelation. Employing the embedding networks and the proposed Positioned Incremental Affinity Propagation (PIAP) clustering algorithm as the backbone, our PsOP unifies the pixelation of discriminating and indiscriminating pixelation objects through trajectories generation. In addition to the pixelation accuracy boosting, experiments on the streaming video data we built show that the proposed PsOP can significantly reduce the over-pixelation ratio in privacy-sensitive object pixelation.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 8

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

(a) Established offline pixelation pipeline.
(b) Established offline pixelation pipeline.
Figure 1. Current pixelation methods v.s. PsOP under the same scene. The leftmost clips are a clip of a streaming scene. Potential privacy-sensitive objects are labeled in circles and rectangular and only circled objects shall be pixelated. Tracklets rendered in gray ground require manual association in (a). The color of the clusters in (b) represents the corresponding colored objects. Pixelation results are shown in the rightmost clip.

Live video streaming has never been so popular as in the past two or three years. Live video streaming, especially the outdoors, presents audiences unprepared, unrehearsed filming of reality, and ramps up the sense of immersion, unpredictability or nervousness. Real scenes are recorded, then instantly broadcast to audiences mostly through streamers’ phone cameras and the mobile network. Streaming activities join a symbiosis with the popularization of smartphones, 5G networks, and cheaper communication costs (stewart2016up). The primary host of live streaming activities transformed from TV stations and news bureau to ordinary individuals. Every rose has its thorn. Without imposed censorship or filters before broadcasting, the pervasiveness of live streaming today arouses unprecedented violations in the personal privacy protection field. Strict law or conservative religions even commit those violations to crimes (faklaris2016legal).

Actually, apart from the indifference of streamers and the absence of imposed censorship, the handcrafted pixelation process is the main reason that leads to privacy infringements amid video streaming. As a giant live-streaming hosting service provider, YouTube is already aware of such urgent needs and published its offline auto-pixelation tool on the latest YouTube Creator Studio. As far as we know, current studies, including YouTube Studio and Microsoft Azure, mainly focus on processing offline videos; whilst leave the pixelation methods on the online live video streaming field underexplored. To ensure the privacy rights and the sound development of the streaming industry, proper pixelation methods shall be interpolated between the filming and broadcasting of live video streaming. Transparency in terms of user experience is also indispensable to keep the pixelation practical. Therefore, in this paper, we devote to establish a pixelation method that generates automatic personal privacy filtering during unconstrained streaming activities.

Most of the few existing offline pixelation methods adopt a similar tracking-by-detection structure that assembles privacy-sensitive objects detectors ahead of multi-object tackers. In such a structure, a Multi-Object Tracker (MOT) aims at continuously locating an arbitrary target in a video with a given bounding box in the target’s initial frame, while the detector is responsible for giving the bounding box in the target’s initial frame. As shown in Figure 1 (a), this structure intuitively follows the manual pixelation pipeline and works well on fine-shot videos. Once a sensitive object is spotted by the detector, it is tracked by the tracker till its vanishment. This action sequence loops over for every detection. Allocating Gaussian or other filters on the trajectories reaches the final pixelation. Essentially, the object trajectories determine the final pixelation results.

In tracking-by-detection structure, trajectories are the joint effort of trackers and detectors, and in fact, trackers and detectors are contained independently. Comparing with conventional fine-shot videos, live streaming videos always involve very few shot changes and the shaky camera. Recorded often by hand-held cameras, shaky or jerky camera stands for frequent camera shakes, abrupt streaming condition alteration, and noisy backgrounds. Under such conditions, trackers still struggle standing up to scattered or drifted tracklets in the long-term. Even were the long-term tracking accuracy somehow resolved, tracking-by-detection structure also prone to corruption as its performance decisively in accordance with detection accuracy. Trackers are unable to function well with poor initiations. However, due to inadequate training samples and the lack of comprehension for video contexts, false positives and false negatives produced by the image-based detectors spring up in streaming videos. Consequently, besides efficiency, blinking, drifting, missing, or excessive mosaics are allocated everywhere when migrating existing methods to pixelate live videos. Hence, blaming unacceptable mosaics mainly on inaccurate detections, current methods are incapable of handling live video streaming.

The other major drawback of the current pixelation workflow is the over-pixelation problem. Introduced by the tracking algorithms and inherited by current tracking-by-detection pixelation methods, the over-pixelation problem occurs when the trackers insist on generating puzzling and unnecessary mosaics for unidentifiable privacy-sensitive objects. Heavy or fully occlusion, and massive motion blur are the common reasons for objects’ unidentifiable. Tracking algorithms are designed to construct seamless or long enough trajectories for objects in avoiding frequent ID-switch. Unlike tracking, in the meantime of blocking the privacy-sensitive information from leaking, pixelation tasks are dedicated to preserving the audience as many originals as possible. Over-pixelation is an intrinsic problem while migrating tracking algorithms to pixelation tasks.

To address the aforementioned issues, our Privacy-sensitive Objects Pixelation (PsOP) introduces a brand-new framework for generating reliable pixelation with real-time efficiency in live video streaming. To yield reliable trajectories for pixelation, PsOP foremost copes with the inaccurate detections caused by deep networks insufficiency and the lack of comprehension for video contexts. PsOP divides the potential privacy-sensitive objects into contexts-irrelevant Indiscriminating Pixelation Objects (IPOs) and contexts-relevant Discriminating Pixelation Objects (DPOs) to alleviate contexts’ effects on detection. As their sensitiveness barely changes in scenes, IPOs, including erotic images, trademarks, phone numbers or car plate numbers, etc., are liable to be handled through well-established sensitive objects detection algorithms (yu2016iprivacy). Conversely, DPOs mainly made up of faces and texts are not to be solely dealt with detection networks. When the domain of discriminating and indiscriminating pixelation objects overlaps (e.g., the phone number is also a kind of text), IPOs are prioritized in claiming the overlapped bounding box for tighter privacy protection. Depicted in Figure 1 (b), detected IPOs are smoothed and then pixelated.

The remaining Discriminating Pixelation Objects (DPOs) are the primary concern of PsOP. The sensitiveness of instances belonging to DPOs changes according to scenes or contexts. The varying sensitiveness of DPOs along with the inbuilt network insufficiency cannot be simply solved through training on video data. Such training assumes the detection, recognition, and video semantics segmentation tasks share a unified network structure as well as demands labor-intensive video data labeling. Considering the learning-based feature vectors are in fact the central role for instances’ trajectories association, as shown in Figure 1 (b), PsOP abnegates tracking-by-detection structure and employs the detection, embedding, and clustering procedure to generate trajectories under inaccurate detections.

In specific, for every class of DPOs, pre-trained detection and embedding networks are leveraged to yield feature vectors for instances of the class. Subsequently, we propose the Positioned Incremental Affinity Propagation (PIAP) algorithm for clustering under inaccurate detection and embeddings. In classic Affinity Propagation (AP), affinities are the distances between feature vectors. A series of messages about affinities, denoted as availabilities and responsibilities, are propagated among vectors to reach a final consensus on clustering results. For PsOP, since an individual instance of DPOs cannot appear at different positions within a frame, the detected instances positioned in the same frame serve as negative samples with minimal affinities. These minimized affinities revise the consensus through the propagation of availabilities and responsibilities. The preference matrix is also computed based on availabilities and responsibilities. Weak preference indicates the detection is relatively isolated in the feature space; thereby, excluded as outliers through propagation. Besides, as a single instance owns similar expanding availability and responsibility matrices in adjacent frames, positioned affinities are propagated in an incremental way. A run of PIAP iterates until the consensus of clusters’ results is reached for all vectors. Link the detection within the same cluster forms the trajectory of an instance. Comparing with classic AP, while retaining the ability of clustering under ill-defined cluster number, PIAP solves the noise-sensitive and time-consuming problem of AP by introducing positioned affinities and incremental propagation. Moreover, unlike gradually fitting trackers, detection, and embedding based PIAP simultaneously solves the inherent over-pixelation problem.

In order to evaluate the performance of the proposed PsOP and PIAP, we crawled raw streaming videos from YouTube Live and Facebook Live platforms. These raw videos contents are inspected and selected in detail to ensure the dataset diversities. Tested on the live streaming video dataset (84261 labeled boxes and 19372 frames) we collected and manually labeled, PsOP gains significant pixelation accuracy boosting as well as retains much more non-sensitive originals for audiences.

In summary, the main contributions of this paper are as follows:

  • We build the Privacy-sensitive Objects Pixelation (PsOP) framework for the pixelation of privacy-sensitive objects in live video streaming. As far as we know, PsOP is the first online method adopts the detection, embedding, clustering procedures and solves the over-pixelation problem.

  • We proposed the Positioned Incremental Affinity Propagation (PIAP) clustering algorithm to generate trajectories for inaccurately detected and sensitiveness varying discriminating pixelation objects. The proposed PIAP spontaneously handles the cluster number generation, cluster under unbalanced sample-size problems, and further endows the classic AP with noise-resistance and time-saving merits.

  • We built a live video streaming dataset to test the proposed PsOP framework and to be available to public. Diverse streaming videos are collected through live streaming platforms, and dense annotations on each frame are manually labeled for the evaluation of the proposed method.

Figure 2. The proposed PsOP framework. Purple and blue lines respectively indicate the pixelation process of discriminate pixelation objects, and nonsensitive objects. Dash lines and boxes presents the extendability of PsOP through pre-training.

2. Related Works

To our best knowledge, face pixelation in live video streaming has not been well studied before. Current research of privacy-sensitive objects detection and pixelation focus on image-level (yu2016iprivacy; yu2017privacy). (yu2016iprivacy)

established a multi-task learning based algorithm to extract the user-dependent privacy-sensitive objects. User blurred images are collected and fed to the multi-task CNNs for finding privacy-sensitive objects through a decision-tree shaped loss-function. Put the availability of such data in videos and accuracy aside, the relatively shallow structure along with the decision-tree shaped loss-function are unaware of the sensitiveness variations of discriminating pixelation objects in videos. The most widely applied commercial offline pixelation tools developed by YouTube Studio 

(youtube2020) and Microsoft Azure (Juliako2020) adopts similar detection networks in their tracking-by-detection structure. However, these works do offer us an innovative way of spotting the IPOs.

As we mentioned in the introduction, current pixelation methods are incapable of functioning under poor detection. Referring to existing techniques, we firstly tried to solve the detection flaws in an end-to-end way by replacing the 2D deep CNNs with its extension 3D CNNs (ji20133d) or CNNs with attention mechanisms (li2018learning; zhu2018end). Taking an entire video as input, 3D CNNs and the attentions consider spatial-temporal information simultaneously. Achieving great success in video segmentation (yue2015beyond; karpathy2014large), and action recognition areas (ji20133d; tran2015learning)

, they are expected to relief the detection burden. However, tested on the live streaming dataset we collected, both 3D CNNs and the attention models are quite time-costing and extremely sensitive to the video quality and length. Furthermore, without sufficient training data, as designed to extract high-level abstractions of video contexts, 3D CNNs cannot precisely handle the multi-task regression on individual frame-level. Therefore, the mosaics generated by both models always blink during live-streaming.

Another proper way is to exploit multi-object tracking (MOT) algorithms that can contend with poor initiations. Although MOT are claimed to be thoroughly studied (zhang2014technology), their tracking accuracy is still challenging in unconstrained videos. When reviewing state-of-the-art trackers, two main categories of conventional trackers can be found: offline and online trackers. The former category of trackers assumes that object detection in all frames has already been conducted, and clean tracklets are achieved by linking different detections and tracks in offline mode (berclaz2011multiple). This property of offline trackers allows for global optimization of the path (zamir2012gmcp) but makes them inappropriate in dealing with live streaming.

A typical online MOT algorithm aims at continuously locating an arbitrary target in a video with a given bounding box in the target’s initial frame. State-of-the-art online MOT algorithms like KCF (henriques2014high) and ECO (danelljan2017eco) are implemented through continuous motion prediction and online learning based on the correlation filter. Varieties of tracking related works are conducted, however, they concentrate on the association of the learning-based appearance features among shots changes. Tracking problems give tacit consent to the bounding box precisions, and intrinsic detection inaccuracy is seldom mentioned or discussed. Only very few works (yu2016poi) take the detection accuracy into consideration, and they merely acknowledge the false negatives.

3. Proposed Method

The procedure of the proposed Privacy-sensitive Objects Pixelation (PsOP) is shown in Figure 2. The process of contexts-irrelevant Indiscriminating Pixelation Objects (IPOs) are omitted since they are pixelated without generating trajectories. Live streaming videos are sliced into video segments with a fixed size for better accuracy. Then, frame-wise detections are applied through detectors pre-trained on image datasets. Detectors, along with corresponding embedding networks for every class of contexts-relevant Discriminating Pixelation Objects (DPOs) (not limited to face and text), are applied in parallel. Similarly, embeddings yielded by the same detector and embedding network are feed to the PIAP clustering. The parallel PIAP associates the same instance across frames into the same cluster, and sequential link within a cluster forms the trajectory. As not all DPOs are sensitive in a particular scene, sensitive thesaurus (bollegala2012cross) or minor manual specification are appended to filter the trajectories of DPOs further. Gaussian filters are imposed on the final trajectories for pixelation and then streaming to the audience.

3.1. Video Segments

The proposed PsOP leverages image-based pre-trained detection networks. Dash lines and boxes in Figure 2 shows the extendable detection of IPOs and DPOs through user-defined pre-training datasets. As false positives and negatives are common in detection, we design a buffer section at the beginning of each live streaming to promote the accuracy of trajectories without affecting the audience experience. With the PsOP’s real-time efficiency, we can stack every frames into a short video segment by demanding a frames buffering section at the very beginning of live video streaming without causing discontinuities in broadcasting. Video segments slice the typical hours-long live streaming into numerous segments in seconds level. Transmitted to a lag at the begging, the conflict between accuracy and efficiency is greatly alleviated. Trajectories of privacy-sensitive objects could be smoothed within and across segments. Considering the primal latency brought by data communication and video compression, such a buffering latency appended is acceptable to users.

3.2. Indiscriminate Pixelation Objects

Then, the Indiscriminating Pixelation Objects (IPOs) are firstly detected in video segments. We adopt the (yu2016iprivacy) as the detection network here. Training is conducted jointly on NPDI (avila2013pooling), METU (tursun2017large), and openALPR. The multi-task learning network is trained to handle the detection of erotic image, trademark, and plate numbers simultaneously. As in Figure 2, video segments are directly fed to the detector of indiscriminate pixelation objects colored in purple. Gaussian smooth among five consecutive frames is applied for compensating false negatives. Within a segment, the detection which has less than five bounding boxes with an overlapping ratio IOU is eliminated as false positives to avoid the blinking of mosaics. Afterward, the trajectories of indiscriminate pixelation objects are established and ready for pixelation. To avoid the domain overlapping of IPOs and DPOs, bounding boxes detected by both detectors with an IOU are categorized to IPOs.

3.3. Discriminate Pixelation Objects

To elucidate the details of proposed pixelation method for Discriminate Pixelation Objects (DPOs), faces and texts are used in the following as tangible instances. Following the embedding then clustering pixelation process for DPOs, in this paper, we use MTCNN (zhang2016joint), and CosFace (wang2018cosface)

to process every frame in turn. The application of MTCNN and CosFace considers the convenience for latter cosine similarity in PIAP. The detection and embedding networks could be substituted by other state-of-the-art methods. MTCNN accepts arbitrary size inputs and detects face larger than

pixel. CosFace generates dimension feature vector for each detected face. Faces are aligned to the frontal pose through the affine transformation before embedding. Similarly for texts, Textboxes (liao2017textboxes) and FastText (joulin2016fasttext) are used in texts detection and embedding. TextBoxes receives input images containing texts with an arbitrary scale. A bit close to MTCNN, frames are rescaled to for efficiency, and Non-Maximum Suppression (NMS) are also leveraged to boost the accuracy. FastText is a C-BOW like fast text classification network, and the result of the second last layer before softmax regression is extracted as embedding. The embedding dimension of FastText is defined to . The clustering for faces and texts is conducted in parallel.

The proposed PIAP is then activated to connect the same faces or the same piece of text across frames according to face or text vectors. Do notice that the efficacy of PIAP remains the same in establishing trajectories for other objects of this type. DBSCAN (ester1996density) and Affinity Propagation (AP) (frey2007clustering) are the candidates under unpredictable cluster numbers. As noisy detection results are common in videos and the data-size for clustering is also unbalanced, the density-based DBSCAN is excluded. To revise the false detection, variation of embedded vectors, and boost the clustering speed, we employ the position information and incremental clustering in the proposed PIAP. Sequential links within each cluster form the trajectory of each face and text, and likewise, Gaussian smooth is applied within a segment before pixelation.

3.3.1. Affinity Propagation (AP)

Similar to traditional clustering algorithms, the first step of classic AP is the measurement of the distance between data nodes, denoted as the similarities. Following the common notation in AP, and (, for face and for text) are two of the data nodes. is the similarity matrix stores the similarities between every two nodes. is the element on row , column of and denotes the similarity between data node and , thereby indicating how well the data node is suited to be the exemplar for data point . Similar notations are also used in below.

The core of AP is that a series of responsibilities and availabilities messages are passed among all data nodes to reach the consensus on exemplars’ selection. and are the responsibility and availability matrix. passes to its potential exemplar indicating the current willingness of choosing as its exemplar considering all the other potential exemplars. Correspondingly, responds to update the current willingness of accepting as its member considering all the other potential members. The sum of and can directly give the fitness for choosing as the exemplar of . The consensus of the sum-product message passing process ( and remain the same after iterations) stands for the final agreement of all nodes in the selection of exemplars and the association of clusters is reached. Apart from the ill-defined cluster number, AP is not sensitive to the initialization settings; selects real data nodes as exemplars; allows asymmetric matrix as input; is more accurate when measured in the sum of square errors. Therefore, considering the subspace distribution (least square regression), AP is effective in generating robust and accurate clustering results for high dimension data like face vectors.

The governing equations for message passing in AP are:

(1)
(2)

Equation (3) is used to fill in the elements on the diagonal of the availability matrix:

(3)

Update responsibilities and availabilities according to (1), (2) and (3) till convergence, then the criterion matrix which holds the exemplars is the sum of the and at each location.

(4)

The highest value of each row of the criterion matrix is designated as the exemplar.

3.3.2. Positioned Incremental Affinity Propagation (PIAP)

Clustering builds the connection of the same face across frames since its result could be facilely corrected with some intervention. With the results of face detection and recognition, the proposed PIAP process in a segment-wise way to group the faces within and across segments simultaneously. Apparently, the longer segments offer more information but also incur more noise and reduce efficiency. In this paper, the proposed PsOP cuts every 150-frames into a segment and leverages the whole existing context to reach accurate and fast face pixelation. Define a data stream

is sequentially collected face/text feature vectors of the current video segment. i.e. is an matrix with dimension feature vector.

For the normalization in the PIAP, cosine similarity is brought for measurements. As an instance of DPOs (like a person’s face) can only appear at one position in a single frame, instances that belong to the same frame are set to the minimum similarity value . Let stands for the other instances that belong to the same frame as . The similarity matrix can be generated as:

(5)

Moreover, in (2), the left side of the equation is the accumulated evidence of how much can stand for itself (), and how many others is also responsible to stands for (). Every positive responsibilities of contributes to . Therefore, according to (5), should not accumulate ’s choice as the evidence for representing . One step further, the choice of will actually repel in availability, and the message passing of shall be rewritten as:

(6)

(5), (6) ensures and are mutual exclusive in the clustering process.

After the revision of the position information, the challenge for extending AP into an incremental way is that the data nodes received at different timestamps stay at varying statuses with disproportionate responsibilities and availabilities value.

However, video frames are strongly self-correlated data. We could assign a proper value to newly-arrived vectors according to the ones in the previous segment without affecting the clustering purity. The embedded vectors of a particular person or text shall stay close with each other in the feature space across different frames. Thus, our incremental AP algorithm is proposed based on the fact that if two detected faces/texts are in adjacent segments and refer to the same person/the same piece of texts, they should not only be clustered into the same group but also have the same responsibilities and availabilities. Such fact is not well considered in past studies of incremental affinity propagation (sun2014incremental; wang2013multi).

Following the common notations in AP, the similarity matrix is denoted as at time with () dimension where . And the responsibility matrix and availability matrix at time are and with a same dimension as . Then, the update rule of and respect to and can be written as:

(7)

Note that the dimension of three matrices is increasing with time. is the newly arrived face vectors of a segment at time . stands for the amount of faces at time .

(8)

Easily, the could be updated through

(9)

Denote as the set of all vectors extracted in segment . The full process of PIAP algorithm can be summarized as Algorithm 1.

Input: ,,

Output:

1:  while  is not end of a live-streaming do
2:     Compute similarity matrix according to (5).
3:     if  is the first video segment of a live-stream then
4:        Assign zeros to all responsibilities and availabilities.
5:     else
6:        Compute responsibilities and availabilities for according to equation (7), (8) and (9).
7:        Extend responsibilities matrix to , and availabilities to .
8:     end if
9:     Message-passing according to equation (1), (6) and (3) until convergence.
10:     Compute exemplars and clustering results as equation (4).
11:  end while
Algorithm 1 Positioned Incremental Affinity Propagation

Handled by PIAP, faces, texts, and alike DPOs could be intra-class distinguished. Cluster results are smoothed similar as the IPOs for the exclusion of false negatives and positives. Faces and texts within a cluster are then linked sequentially to form a trajectory.

3.4. Pixelation on Trajectories of Privacy-sensitive Objects

In live video streaming, the appeared non-streamers are much more than streamers but own a far smaller sample-size for each non-streamer. Thus, as in the rightmost part of Figure 1, with trajectories built by PIAP, manual specification (like screen touching) is involved solely on the face of a newly appeared streamer. Also, texts are further filtered through checking the sensitive thesaurus based on (bollegala2012cross). With the rolling of PIAP, the previously specified cluster is capable of guiding the future aggregation of the same faces or texts.

The trajectories of the faces of the non-streamers, the sensitive texts along with IPOs are gathered as the trajectories of privacy-sensitive objects. Gaussian filters are applied on these trajectories for pixelation before broadcasting to the audience.

4. Experiments and Discussion

Dataset
Quantity
of videos
Category Resolution
People occurred
Frames
Privacy-sensitive
objects Labels*
IPOs
Labels*
DPOs
Lables*
4 a,b,c,d 720p/1080p 2 4133 27337 6255 21082
8 a,b,c,d 360p/480p 2 4680 23869 3975 19894
4 a,b,c,d 360p/480p 2 5692 16994 7946 9048
4 a,b,c,d 720p 2 4867 16061 10805 5256
  • *: the value amounts the occurrences of the relative objects. Meaning the same object is repeatedly counted in different frames.

Table 1. Details of the live video streaming dataset
Method SOPA SOPA SOPA SOPA SOPP SOPP SOPP SOPP Entire Dataset
() () () () () () () ()
MP
(frames)
OPR FPS
YouTube (youtube2020) 0.61 0.57 0.68 0.57 0.70 0.64 0.78 0.63 238 0.44
Azure (Juliako2020) 0.63 0.57 0.66 0.56 0.71 0.66 0.77 0.66 203 0.43
KCF (henriques2014high) 0.59 0.52 0.68 0.61 0.69 0.63 0.79 0.70 166 0.47 100
ECO (danelljan2017eco) 0.57 0.61 0.69 0.66 0.66 0.75 0.79 0.74 178 0.41 30~50
PsOP
0.73 0.69 0.75 0.71 0.80 0.79 0.84 0.83 387 0.21 18 ~30
  • ”stand for the higher the better and the lower the better respectively.

Table 2. Pixelation results on collected live video streaming dataset
Method SOPA SOPA SOPA SOPA SOPP SOPP SOPP SOPP Entire Dataset
() () () () () () () ()
MP
(frames)
OPR
YouTube (youtube2020) 0.45 0.42 0.56 0.49 0.53 0.47 0.77 0.63 238 0.56
Azure (Juliako2020) 0.43 0.47 0.54 0.53 0.50 0.53 0.70 0.68 203 0.54
KCF (henriques2014high) 0.35 0.32 0.38 0.31 0.41 0.40 0.44 0.40 113 0.64
ECO (danelljan2017eco) 0.27 0.28 0.34 0.31 0.37 0.39 0.41 0.40 148 0.59
PsOP
(MTCNN+CosFace)
0.63 0.60 0.71 0.66 0.80 0.77 0.86 0.85 362 0.34
PsOP
(Pyramid+ArcFace)
0.63 0.62 0.71 0.65 0.80 0.79 0.86 0.85 387 0.34
  • ”s.tand for the higher the better and the lower the better respectively.

Table 3. Pixelation results of DPOs on live video streaming dataset

4.1. Dataset

As the privacy-sensitive objects pixelation problem in live video streaming has not been studied before, there is no available dataset or benchmark tests for reference. We collected and built a video live streaming video dataset from YouTube and Facebook platforms and manually labeled the sensitive objects for the experiments. In total, 20 streaming videos come from 4 different streaming categories (a:street interview; b:street streaming; c:flash activities; d:dancing) of 19372 frames are manually labeled and will be available to public. As demonstrated in Table 1, these videos are further divided into four groups according to the broadcasting resolution, and the number of people showed up in streaming. Live streaming videos with at least 720p resolution are marked as high-resolution , and the rest are low-resolution . Similarly, the videos contain more than two people are sophisticated scenes , and the rest are naive ones .The number of labeled bounding boxes regarding privacy-sensitive objects, IPOs, and DPOs are also listed to exhibit the dense annotation.

4.2. Parameters

All the parameters remain unchanged for entire live-streaming video tests. for every segment contains 150 frames. 10 seconds for buffering as . CosFace resizes the cropped face to ; weight decay is 0.0005. The learning rate is 0.005 for initialization and drops at every 60K iterations. The damping factor for PIAP is default 0.5 as it in AP. Detection threshold for MTCNN is [0.7,0.8,0.9]. The IOU rate is [0.7].

4.3. Evaluation Metrics

Setting our manual pixelation annotations in the collected video live streaming dataset as the baseline, we migrated and adopt the algorithms including ECO and KCF tracker, currently applied commercial tools like YouTube Studio and Microsoft Azure, and the proposed PsOP for comparison. Sensitive Objects Pixelation Accuracy (SOPA), Sensitive Objects Pixelation Precision (SOPP), and Most Pixelated (MP) frames are the most concerned metrics. Specifically:

where , , , correspond to the missed pixelation, false positives in pixelation, missmatched pixelation, total pixelation in frame . SOPA reflects the overall performance of the PsOP.

where is the bounding box overlapping ratio of pixelated sensitive object with its labelled ground truth. is the total number of matched pixelation in frame . Therefore, SOPP states the degree of drifting and pixelation precision. Over-Pixelated Ratio (OPR) is another important metric indicating the degree of over-pixelation problem.

4.4. Ablation Study

In Table 2, all the algorithms get their best performance on the High-resolution Naive-scene () sub-dataset.KCF and ECO here are optimized for privacy object tracking tasks with detection based calibration. To conduct the comparisons, we activated manual specification on occurrences of sensitive objects for the first four algorithms. However, affected by the noisy detection results, KCF styled online learning algorithms are maladjusted in video live streaming. Although ECO applies higher dimension features than KCF, the SOPA of ECO is lower than KCF. Overall, our PsOP achieves better performance in terms of SOPA, SOPP, MP, and OPR. Particularly, PsOP noticeably increases most pixelated frames and decrease the over-pixelation ratio on the entire dataset, indicating that PsOP can construct more robust trajectories and avoid pixelation if the object is temporally not identifiable. The high value in SOPP suggests the trajectory drifting in PsOP is relatively rare or happens at a later time.

Table 3 shows the pixelation results solely on DPOs. Also, manual specification is activated for the first four methods. PsOP gains a surpassing pixelation accuracy and proves the proposed embedding and PIAP clustering framework is sharp in dealing with DPOs. Listed at the bottom line of Table 3, the boxed PsOP substitutes the detection and embedding network with more advanced PrimidBox (tang2018pyramidbox), ArcFace (deng2019arcface). The last section shows the network selection is actually not that important for PIAP, as long as they are deep architectured, well-tunned networks.

Dataset Method Purity
Cluster nubmer
(Clustered/Truth)
Time(s)
AP 0.81 19/17 1.82
PAP 0.89 17/17 1.82
PIAP 0.89 17/17 0.07
AP 0.72 37/43 3.05
PAP 0.84 41/43 3.05
PIAP 0.83 41/43 0.13
AP 0.86 6/6 0.75
PAP 0.93 6/6 0.75
PIAP 0.93 6/6 0.04
AP 0.89 7/7 0.80
PAP 0.92 7/7 0.80
PIAP 0.91 7/7 0.03
Table 4. Clustering purity and efficiency for faces

Figure 3. Qualitative comparison of YouTube (mosaics in rectangular), KCF (mosaics in hexagons), and PsOP (mosaics in circles). Green boxes indicate human intervention is needed for specification, yellow are the auto-generated mosaics, red ones are the over-pixelated area, and blue ones are the miss-pixelated area.

Table 4 is the face clustering performance of classic AP, PAP (AP after position revision), and PIAP on the collected live streaming dataset. Comparing AP with PAP, the position information of face vectors significantly boosts the purity of the clustering result. According to the results of PIAP and AP, our incremental message passing algorithm solves the time-consuming problem of AP almost without affecting the purity as we stated before. The value of the Time column is the seconds needed to reach an iteration of consensus.

4.5. Qualitative Comparison

Qualitative comparisons among the YouTube offline pixelation tool, migrated KCF, and PsOP are shown in Figure 2. Snapshots are shoot from left to right. IPOs, including the trademark and the car plate number, and DPOs, including faces and texts, are contained in the scenes. The first row is the original scene in which a girl streamer on the street is waving hands to greeting the audience. The 2nd, 3rd, and 4th rows are respectively the pixelation results after YouTube, KCF, and PsOP. Besides the significant differences in red and blue boxes indicating accuracy, the green boxes in the leftmost snapshot demonstrate the huge gap of required manual work on the initialization. Only in PsoP, privacy-sensitive objects are auto-detected and categorized, and DPOs are clustered and pixelated with minimum intervention.

4.6. Efficiency

Training and testing are conducted on our i7-7800X, 64G RAM and dual Nvidia GTX 1080 machine. The main cost of FPVLS is on the face embedding algorithm. The embedding for one face takes 10-15ms on our i7-7800X, dual GTX1080, 32G RAM machine. The face detection, including compensation for a frame, takes 3ms. Another time-costing part is the initialization of PIAP. The initialization process takes 10-30ms depending on the case. Furthermore, each incremental propagation loop takes 3-5ms. So PsOP can satisfy the real-time efficiency requirement. In the cases of extreme circumstances that the live video streams contain many faces or other privacy sensitive objects, we can reduce the sampling rate of the video frames to improve the efficiency of the detection and recognition processes.

5. Conclusion

In this paper, a novel extendable and unified framework using Privacy-sensitive Objects Pixelation (PsOP) is proposed for live video streaming. The proposed PsOP shows significant improvements in pixelation accuracy, precision, and over-pixelation ratio compared with other offline pixelation algorithms. Also, our PsOP requires minimum human intervention on the pixelation of discriminating pixelation objects represented by faces and texts. Leveraging the trained deep CNNs, the proposed PsOP manages to solve the pixelation problems in video live streamings through the proposed Positioned Incremental Affinity Propagation (PIAP) clustering. The PIAP deals with the ill-defined cluster number issue and boosts the accuracy of classic Affinity Propagation (AP) through position information. Moreover, the accuracy of AP is not affected by the incremental processing. However, the low-resolution video streamings are still quite challenging because the image resolution affects much of the detection and embedding network performance. Designed and trained on high-quality stationary images, these deep neural networks can drop to a deficient performance and break the robustness of the PsOP. Our future work will focus on the improvement of accuracy and robustness under low resolution streaming scenes.

Acknowledgements.
This work was partly supported by the University of Macau under Grants: MYRG2018-00035-FST and MYRG2019-00086-FST, and the Science and Technology Development Fund, Macau SAR (File no. 041/2017/A1, 0019/2019/A).

References