Video live streaming platforms, sites, and streamers rapidly become part of our daily life. Benefit from the popularity of smart-phones and 5G networks, online video live streaming is amazingly pervasive . Live streaming technology allows people to live broadcast any scene without a filter. Contradict to traditional television programs, without the censorship imposed by superiors, these personally held live streaming channels severely disregard the protection of personal privacy . Besides, even today, the placement of mosaics in the post-production procedures is still labor-intensive. Therefore, apart from the indifference to privacy, streamers and streaming platforms are also incapable of allocating adequate human labor on such censorship anytime, anywhere. The worst-case scenario is the outdoor live-streaming, which happens in public . Many other people, like passersby and pedestrians, become unwitting casualties once filmed and broadcast through the streamers’ cameras. In such cases, streamers, together with live streaming platforms, have raised a host of privacy infringement issues.
Despite different laws of privacy in different regions, respect privacy rights and blur the content-irrelevant peoples’ faces are universally correct . As a live streaming hosting service provider, YouTube is already aware of such urgent needs and published its offline face pixelation tool on the latest YouTube Creator Studio. As far as we know, current studies, including YouTube Studio and Microsoft Azure, mainly focus on processing offline videos; whilst the face pixelation solution on the online video live streaming field is underexplored. Accordingly, to prevent disturbing disclosure and promote the sound development of the video live streaming industry, in this paper, we build a privacy security method called Face Pixelation in Video Live Streaming (FPVLS), which automatically blurs irrelevant peoples’ faces for personal privacy protection during video live streaming activities.
In order to yield consistent face pixelation during streaming, the key of FPVLS is to locate and identify faces effectively and efficiently in frames. The tangible reference of FPVLS is daily life TV shows. People who are unwilling to expose in front of the camera will be facially pixelated in the post-production to protect their privacy. Intuitively, FPVLS can also be resolved as an automized replica of such a manual pixelation process with real-time efficiency. Likewise, since the manual face pixelation pipeline is remarkably close to the multi-face tracking (MFT) algorithms, the construction of FPVLS seems to be straightforward.
However, vast tests conducted on our collected video live stream dataset reveal the migrated multi-face trackers suffer a serious drifting issue. A typical scene in live streaming is demonstrated in Fig. 1 from left to right. The leftmost snapshot shows the female streamer dressing in yellow was streaming on the street, and a street hawker was waving hands to her for greeting. Without the permission of the hawker, his face shall be pixelized for privacy protection. Meanwhile, a passerby in a light green shirt was passing through the area between the streamer and the hawker. YouTube Studio, which adopts MFT algorithms, generates the middle row’s mosaics circled in red. With camera shakes and the crossover of passerby and the hawker, the ambiguous motion prediction and failed face calibration algorithms jointly lead mosaics on the hawker’s face to an instant drifting. At the rightmost snapshot, the mosaics are brought to the neck region of the passerby and leave the hawker’s face naked. Further, the over-pixelation phenomenon can also be observed in the middle snapshots of Fig. 1. Even if the multi-face trackers were not affected by the drifting issue, they would pixelate the already fully occluded hawker’s face. The over-pixelation phenomenon owns to the trackers’ continuous prediction to accumulate more evidence for tracklets association during temporary occlusion.
The serious drfting issue is mainly caused by the underperformance of the face detection techniques due to the inherent unconstrained features in the live-streaming videos. Neglected by most of the state-of-the-art MFT studies, the outcomes of the face detection networks essentially decide the quality of the tracklets. Reliable tracklets are the priors for tracking algorithms. Therefore, the results of the detection networks are vital for the performance of the trackers. Benchmark tests for state-of-the-art online MFT algorithms are carried out in music videos, movie clips, TV sitcoms, or CCTV footage, which are recorded by multiple high-resolution, steady, or stably moving cameras with various shot changes. Face detection techniques practically function well in these videos, wherefore tracklets can be reliably generated as the priors to boot up tracking algorithms. In these tests, clean and clear face detection results become acquiescence, and the primary concern of the trackers is the tracking consistency among frequent switching shots. However, live videos are commonly recorded by handheld, mobile phone cameras with a single or a few shots and broadcast after multiple compression. Specifically, live streaming videos involve crowded scenes with frequent irregular motions of the camera, interaction among people, and abrupt broadcasting resolution conversion. Current face detection techniques can no longer provide reliable results in live streaming videos. Tracklets generated under low-confidence jeopardize the efficacy of motion prediction, face calibration, and contents based reasoning algorithms. Thereupon, trackers are prone to fast corruption, once scattered or inaccurate tracklets are yielded.
Consequently, resigning the tracklets mechanism, FPVLS adopts a brand new, fast, and accurate frame-to-video framework. This framework contains two core stages, including a raw face trajectory generation stage and the other trajectory refinement stage to solve the face pixelation issue under less prior, worse quality, noisier, but fewer shot changes scenes. In general, when streaming started, face detection and embedding/recognition networks are alternately applied on frames to yield face vectors. These face vectors are then clustered through Positioned Incremental Affinity Propagation (PIAP). The proposed PIAP, which is an extension of the classic Affinity Propagation (AP) algorithm, integrates the same person’s faces across frames into the same cluster and excludes the false positives considering both the similarity of face vectors and their physical positions. The sequential link of the faces within a single cluster forms the raw face trajectories. The false negatives in detection arouse the intermittence of the raw trajectories. To compensate the deep network’s insufficiency, we introduce a proposal network with a shallow structure and a high recall rate to re-detect within the gaps of raw trajectories. Due to the efficiency requirement, a two-sample test based on the empirical likelihood ratio is applied to cull the non-faces re-detected. The Proposal network and the two-sample test collaborate in the trajectory refinement stage to yield the seamless trajectories. The post-process allocates the Gaussian filter on the refined trajectories for final pixelation. An example of FPVLS is presented in the bottom row of Fig. 1. Firstly, while the target face is fully occluded, as shown in the middle pictures of Fig. 1, FPVLS will not produce baffling, over-pixelated mosaics like those in YouTube Studio’s or other trackers’ results. Also, the mosaics marked in yellow rectangular remain on the hawker’s face after the crossover. The drifting issue is greatly alleviated since FPVLS directly leverages the face vectors for trajectory generation and refinement. Specifically, this paper consists of following contributions:
We are the first dedicating to study and establish a face pixelation tool for personal privacy protection in live video streaming.
The PIAP clustering algorithm is proposed to connect the same faces across frames, thereby generates robust raw face trajectories.
A non-parametric imposing two-sample test based on empirical likelihood ratio statistics is introduced for raw trajectories refinement. Such an error rejection algorithm can be widely applied to serve other face detection algorithms.
We collect and build a video live stream dataset from the Internet for the tests of our proposed method.
The remaining of the paper is organized as follows. Section II reviews the related works; Section III introduces the detail methods of FPVLS; experiments, results, and discussion could be found in Section IV; Section IV concludes and discusses future work.
Ii Related Works
To our best knowledge, face pixelation in video live streaming is not studied before. Limited by conditions, streaming videos commonly involve a single shot. Likewise, the video quality can only maintain at a relatively low level (720p/30FPS or lower). All these harsh realities cause more false positives negatives during face detection and embedding. As a result, video live stream virtually contains frequent camera shakes and massive motion blurs. Referring to existing techniques, we first tried to realize FPVLS in an end-to-end way by replacing the 2D deep CNN with its extension 3D CNN . Taking an entire video as input, 3D CNN handles spatial-temporal information simultaneously and achieves great success in video segmentation [28, 16] and action recognition areas [15, 22]. However, tested on the live streaming dataset we collected, 3D CNN models are quite time-costing and extremely sensitive to the video quality and length. Furthermore, without sufficient training data, as designed to cope with high-level abstractions of video contents, 3D CNN cannot precisely handle the multi-task regression on individual frame-level. Therefore, the face mosaics generated by an end-to-end 3D CNN blinks during streaming.
Another proper way to realize face pixelation in live-streaming through existing techniques is by exploiting multi-object tracking (MOT) or multi-face tracking (MFT) algorithms. YouTube Studio and Microsoft Azure published their own offline face pixelation tools for videos based on MFT. Although MOT and MFT are claimed to be thoroughly studied , their tracking accuracy is still challenging in unconstrained videos. When reviewing state-of-the-art trackers, two main categories of conventional trackers can be found: offline and online trackers. The former category of trackers assumes that object detection in all frames has already been conducted, and clean tracklets are achieved by linking different detections and tracks in offline mode . This property of offline trackers allows for global optimization of the path  but makes them incapable of dealing with live streaming.
State-of-the-art online MOT algorithms like KCF  and ECO  are mainly implemented through continuous motion prediction and online learning according to the correlation filter. In the development of the aforementioned trackers, the authors assume a rather fixed or predictable position for targets in the adjacent frames of the video. Although this assumption is generally applicable in conventional videos, it does not hold in video live streaming due to the frequent camera shakes. Tests of such tracking algorithms are always established on high-resolution music video dataset in the absence of massive camera moves and motion blur . Besides, to work properly, widely accepted solutions of MOT and MFT [32, 14] in unconstrained videos assume some known priors, including the number of people showed up in the video, initial positions of the people, no fast motions, no camera movements, and high recording resolution. The state-of-the-art studies on MFT  manage to exclude as many priors as possible but still require the full-length video in advance for analysis.
The tacking-by-detection method is the most relevant algorithm in MOT, considering the harsh realities of live streaming. The POI tracker  and IOU tracker  are the outstanding ones for the tracking-by-detection method. IOU tracker relies on detection networks on the current frame to yield the bounding box with the most considerable IOU value regarding the tracking list. The advantages and disadvantages of this method are obvious. IOU tracker is very fast in processing frames but relies too much on the result of the detection and does not add relevant information between the frames. The false positives and false negatives of the detection are difficult to solve, which will cause the problem of frequent identity switching. POI tracker substitutes the traditional correlation filters in tracking methods with deep CNNs based detection and embedding. Online POI tracker applies detection networks in all frames and divides the tracking video into multiple disjoint segments. Then, they use the dense neighbors (DN) search to associate the detection responses into short tracklets in each segment. Several nearby segments are merged into a longer segment, and the DN search is applied again in each longer segment to associate existing tracklets into longer tracklets. To some degree, the POI tracker shares a similar idea with our raw face trajectory generation stage in the video segment, detection, and embedding aspects. However, POI uses a DN search instead of clustering algorithms and do not further apply refinement. Therefore, the POI tracker will also be profoundly affected by the false positives and negatives yielded by detection.
Iii Proosed Methods
Owing to the tremendous success of deep CNN,  and other works already can handle the image level face pixelation through object detection and multi-task learning. Therefore, leveraging priorless, stationary image based face detection and recognition networks, we proposed the brand new FPVLS for video live streaming. FPVLS discards tracklets association or video based CNNs, and directly generate accurate face trajectories through raw face trajectories generation and refinement stages.
Straightforwardly, FPVLS might be realized by fine-tuning the deep face detection and recognition networks frame by frame according to the live streaming data. However, put the accuracy aside, deep CNN based face recognition network builds a classification algorithm that demands a large number of the streamers’ faces in advance for the specification of streamers. Moreover, as there are no perfect detection and recognition algorithms under any circumstance, they are always expected to produce detection errors and misclassifications from time to time. As we stated previously, although there are fewer shot changes, all the other harsh realities will cause more false positives and negatives in detection and recognition. Such results will directly cause the generated face mosaics either blinking or incorrectly placed during the entire streaming. Notably, blinking mosaics are unacceptable as they primarily lead to privacy leakage and will instantly distract viewers’ attention. Besides, the recognition model trained on image datasets cannot represent the temporal connection of the same person’s faces across adjacent frames. As a result, it aggravated the blinking phenomenon.
Considering all the above pros and cons that need to be handled for face pixelation in video live streaming, the framework of FPVLS is organized as illustrated in Fig. 2. Along with a preprocessing and another postprocessing stage, FPVLS contains two core stages, including the raw face trajectories generation and the trajectory refinement. Stages are surrounded by bronze yellow rectangular, and the critical intermediate results or algorithms are surrounded by dark rectangular. (1) Prepocessing: We design a buffer section at the beginning of each live stream for FPVLS to solve the specification of the streamers and promote the accuracy for later trajectory refinement stage. Face detection and embedding are also applied in this stage to generate face vectors. (2) Raw Face Trajectories Generation.
Since face vectors are independently yielded in preprocessing, we shall iteratively link the same person’s faces across frames to plot the raw trajectory. Since the number of people shows up in streaming cannot be known in advance, and we need a fast enough algorithm for efficiency consideration as well, the affinity propagation clustering algorithm is chosen and extended into an incremental way. More specifically, an incremental affinity propagation clustering algorithm associated with position information is developed to organize the face trajectories across frames in real-time. Also, this Positioned Incremental Affinity Propagation (PIAP) spontaneously excludes false positives that occurred in the detection network as outliers.(3) Trajectory Refinement. Despite the eliminated false positives, the detection caused false negatives are the real challenges we incurred. Normally, image-based CNN is very sensitive to resolutions and motion blurs since they are designed and trained on high-quality images. Thus, we will always get intermittent raw trajectories for every face. To solve the intermittences, a shallow proposal network is built to propose suspicious face areas at a top recall rate to compensate the detection network insufficiency. Afterward, a two-sample test based on empirical likelihood ratio statistic is designed to deny the inappropriate bounding boxes yielded by the proposal network and receive the correct ones for refinement. Our trajectory refinement stage aims to vanish the discontinuity of raw trajectories, thereby prevents personal privacy from leaking. (4) Postprocessing. Gaussian filters with various kernel values can be placed on the non-streamers’ trajectories to generate the final pixelized video and then broadcast to the audience.
Iii-A1 Video Segments
A buffer section is designed at the beginning of each live stream to solve the specification of the streamers and promote the accuracy for trajectory refinement. Within the buffer section, the streamers’ faces shall be manually specified (like touching the phone screen) to initialize the PIAP clustering. Face detection and embedding are also applied in this section. Assuming guaranteed real-time efficiency, we can stack every frames into a short video segment by demanding a frames buffering section at the very beginning of a video live streaming without affecting the audience’s experience (see proof below).
Then, leveraging well-established, stationary image based deep face detection and recognition networks frame by frame as a feature extractor, face vectors could be obtained within segments. Video segments reform the typical hours-long video live stream into numerous segments in seconds level. PIAP also further spontaneously handles the association of segments in a binary Markov chain way. Considering the primal latency brought by communication and compression, such a buffering latency appended are minor and may not even be noticed. Video segments greatly alleviate the conflict between accuracy and efficiency through transmitting the conflict to a lag at the beginning. The length of the segments shall be controlled within a rational interval to balance the lag time, real-time efficiency, and refinement stage’s performance.
Stack every frames into a short video segment by demanding a frames buffering section at the very beginning of a video live streaming will not cause discontinuity.
The broadcasting frame per second is a constant during live. An arbitrary frame which is recorded at frame number , will be originally broadcast to the audience at time .
At time this frame is send to FPVLS
FPVLS takes } for processing
Extra seconds needs for broadcasting
The broadcast frame number after FPVLS is the sum of the above three:
Therefore any frame will be broadcast with a time lag which can be totally covered by the buffering section. ∎
Iii-A2 Face Detection and Recognition on Frames
to process every frame in turn. The chosen of MTCNN and CosFace considers the convenience for later cosine similarity in PIAP and the proposal network’s training. However, do note that the detection and embedding/recognition networks could be substituted by other recent published state-of-the-art work like PyrimaidBox & ArcFace . Also, instead of indulging in comparing or tuning developed face detection and embedding/recognition algorithms, we intentionally avoid using pioneer algorithms to demonstrate the performances of FPVLS under relatively poor priors. MTCNN accepts arbitrary size inputs and detects face larger than pixel. CosFace reads the face detection results as input and generates dimension feature vector for each face. The face alignment is applied right after detection. Every detected face is cropped out from the frame and aligned to frontal pose through the affine transformation before embedding.
Iii-B Raw Face Trajectories Generation
These embedded face vectors are further clustered to connect the discrete faces across frames and segments. Considering the ill-defined cluster number and the commonly used affinity based measurement, Affinity Propagation (AP) 
is chosen for the face clustering. Due to the requirement of real-time efficiency, we extended the classical time-consuming AP clustering into an incremental way to process the incoming segments dynamically. When new vectors arrived, instead of propagating all vectors’ responsibilities and availabilities until reconverge, PIAP only assembles new-coming vectors with the ones in the last segment through matrix normalization. Besides efficiency, since face embedding networks may generate less accurate feature vectors, pair-wise affinities based on deep feature vectors are further revised according to position information. As described in the above two aspects, Positioned Incremental AP (PIAP) is capable of clustering the dispersed same person’s faces into the same group rapidly. An inner-group sequential link quickly generates the faces’ raw trajectories afterward.
Iii-B1 Affinity Propagation
Affinity Propagation (AP) algorithm is originally published in Science by Frey and Delbert 
in 2007. AP does not require to specify beforehand the number of clusters. The key of AP is to select the exemplars, which are the represents of clusters, and then assign the rested nodes to its most preferring exemplars. As a member of unsupervised learning algorithms, AP shares the same max-sum message passing methodology with the Belief Propagation (BP). Similar to traditional clustering algorithms, the very first step of AP is the measurement of the distance between data nodes, also called the similarities. Following the common notation in AP clustering,and , () are two of the data nodes and denotes the similarity between data node and , thereby indicating how well the data node is suited to be the exemplar for data point . is the similarity matrix stores the similarities between every two nodes. is the element on row , column of and similar notations are also used in below. The diagonal of the similarity matrix denotes the preference of selecting as an exemplar. is called the preference value in some studies .
Then, a series of messages are passed among all data nodes to reach the agreement on exemplars’ selection. These messages are consist of responsibilities and availabilities . and are the responsibility and availability matrices. Node passes to its potential exemplar indicating the current willingness of choosing as its exemplar considering all the other potential exemplars. Correspondingly, responds to update the current willingness of accepting as its member considering all the other potential members. Then, is updated according to and a new iteration of message passing is issued. The sum of and can directly give the fitness for choosing as the exemplar of . The consensus of the sum-product message passing process ( and remain the same after an iteration) stands for the final agreement of all nodes in exemplars’ selection and clusters’ association is reached. Apart from the ill-defined cluster number, AP is not sensitive to the initialization settings; selects real data nodes as exemplars; allows asymmetric matrix as input; is more accurate when measured in the sum of square errors. Therefore, considering the subspace distribution (least square regression), AP is effective in generating robust and accurate clustering results for high dimension data like face vectors.
We start off by initializing both of the responsibility and availability matrices to zero, and following , the computation of and are:
Equation (3) is used to fill in the elements on the diagonal of the availability matrix:
Update responsibilities and availabilities according to (1), (2) and (3) till convergence, then the criterion matrix which holds the exemplars as well as clustering results is simply the sum of the availability matrix and responsibility matrix at each location.
The highest value of each row of the criterion matrix is designated as the exemplar. Naturally, as each row stands for a data node, data nodes that share the same exemplar are in the same cluster. Therefore, the exemplars can be computed as:
Iii-B2 Posistioned Incremental Affinity Propagation
(1)-(5) are the original equation for classic AP. In most trackers, the critical assumption is that within a single shot, the motion of the same face between consecutive frames is minor. Correlation filters, IOU trackers, or other online learning networks are all working under such an assumption. However, in live stream videos, the frequent camera shakes and the resolution conversion invalid this assumption. Therefore, the only fact we can leverage to help classic AP recovery from the outliers yielded by the embedding network is that any face can only appear in one position in a single frame. Therefore, the similarities among faces which come from a single frame should be small enough. Following the above idea, cosine similarity is brought for measurements. Let stands for the other face vectors that belong to the same frame as . The similarity matrix can be generated as:
, the negative similarity is more decent for the convergence of message passing as well as the normalization.
We intentionally avoid manipulating the vector value directly like in some other face clustering works . Although the embedding network is not perfect and surely will generate outliers due to many factors, it is still somehow reliable considering the subspace distribution of the face vectors. Instead of any rash tunning, we solely force the similarities to the minimal value if faces come from a single frame.
Moreover, the observation of equation (1) tells us reflects how suitable is considered as the potential exemplar for itself. Therefore, in (2), the left side of the equation is the evidence of how much can stands for itself () and how many others can stands for (). Every positive responsibilities of contributes to . Therefore, as we mentioned above, since any face can only have one position in a single frame, should not accumulate ’s choice as the evidence for . One step further, the choice of will actually repel in availability and equation (2) shall be rewritten as:
(6), (7) ensures and are mutual exclusive in the clustering process. A student proof could is: if networks’ insufficiency causes variants and aggregates to another cluster contains , they will repel each other through a significant drop in availability and cause the corresponding responsibility also drastically deduced. This chain reaction will not stop until one of them is revised to another cluster. Such an inner frame repel effect can be easily transmitted cross frames through messaging passing. Also, as , responsibility in (1) guarantees will not be selected as the potential exemplar of while initialization.
Despite the position information is applied to improve the accuracy, the key challenge of building incremental AP is that the previously computed data nodes have already connected through nonzero responsibilities and availabilities. However, the newly arrived ones remain zero (detached). Therefore, the data nodes come at different timestamps are also stay at varying statuses with disproportionate relationships. Considering the time efficiency of message passing, simple continuously rolling affinity propagation at each time step cannot solve the incremental AP problem.
In our case, video frames are strongly correlated data. We could assign a proper value to newly-arrived faces in adjacent segments according to the previous segment without affecting the clustering purity. Comparing with others’, the face vectors belong to a particular person are supposed to stay closer with each other. That is, the same person’s faces should gather within a small area in the feature space. Thus, our incremental AP algorithm is proposed based on the fact that if two detected faces are in adjacent segments and refer to one person, they should not only be clustered into the same group but also have same responsibilities and availabilities. Such fact is not well considered in past studies of incremental affinity propagation [3, 20].
Following the commonly used notations in the AP algorithm, the similarity matrix is denoted as at time with () dimension where . And the responsibility matrix and availability matrix at time are and with a same dimension as . Then, the update rule of and respect to and can be written as:
Note that the dimension of three matrices is increasing with time. is the newly arrived face vectors of a segment at time . stands for the amount of faces at time .
Easily, the could be updated through
Denote as the set of all face vectors extracted in sement . The full process of PIAP algorithm can be summerized as Algorithm 1.
Handled by our proposed positioned incremental affinity propagation (PIAP) algorithm, we can quickly cluster faces in new segments, exclude the outliers and discover newly spotted faces. Unlike newly spotted faces, the false negatives in the detection stage cannot form any steady enough trajectories and quickly isolated by other real faces while clustering. As a result, the scattered faces referring to the same person could be linked together sequentially to form our raw face trajectories.
Iii-C Trajectory Refinement
If we blur identified non-streamers’ faces instantly on raw trajectories and reassemble these blurred frames back to videos, fast blinking mosaics can be observed everywhere. This indicates the created raw trajectories are intermittent and unreliable due to inadequate learning samples, poor streaming quality, and noisy scenes. False negatives and false positives are significant in frame-wise face detection results. In Fig. 2, a well-trained face detector could offer us a raw trajectory with little breaks. Such kind of jumps within a raw trajectory could be quickly recovered by interpolation.
already noticed the uncertainty caused by unreliable detection results, but only talk about the false positive which can be eliminated by PIAP clustering. Thus, to eliminate the false negatives and reach a high rate of accuracy, we bring the trajectory refinement stage forward. Particularly, the trajectory refinement stage manages another shallow CNN named the proposal network to propose suspicious face areas at a high recall rate to compensate the detection lost within each segment. As segments cannot contain large enough data to support the Gaussian distribution of the face vectors, we introduce the two-sample test based on empirical likelihood ratio statistics to solve the depuration problem on proposed faces. Combing the two-sample test with the raw trajectories, we deny the inappropriate bounding boxes generated by the proposal network and receive the proper ones for refinements.
Iii-C1 The Proposal Network
An individual proposal network aiming at a high recall rate is tough to train due to the contradiction of precision and recall rate. According to our test, in extreme cases, a recall rate maximizing loss function will degenerate such a network after training. The cascaded architecture with multiple deep convolutional networks to predict face in a coarse-to-fine manner could solve the proposal network’s training issue. Deep CNNs are joined from head-to-tail in the cascaded architecture. The latter network receives the former’s output as input and refines the result to reach high accuracy. The former network owns a higher recall rate, and the latter owns a higher precision. Jointly, the coarse-to-fine manner is a process that continuously fixes a high recall rate to high accuracy. Therefore, the proposal network should be constructed in a fast shrinkage style so that the shallow structure and recall rate maximizing strategy could satisfy the requirement of loosed face detection. Also, the proposal network shall be trained within a cascaded architecture to restrict the drawbacks of the high recall rate.
To save the effort for further refinement, here we elaborately peel off the first network from MTCNN  and cull the last regression layer. MTCNN is a state-of-the-art face detection network leveraging cascaded architecture with three deep convolutional networks to predict face and landmark location on images. Benefit from the coarse-to-fine cascaded structure, the very first P-net in MTCNN and our proposal network mentally shares the same goal in proposing as many suspicious face areas as possible. MTCNN is trained as an entirety, and the performance of P-net will be revised by the other R-net and O-net to reach high accuracy. Thus, P-net will not incur the low precision problem. Since it is the first network in a cascaded face detection network, P-net retains a high recall rate as well. The structure of the proposal network is shown in Fig. 3.
The other faces detected by the initial face detection network are already cropped in the preprocessing stages before embedding. Thus, those faces will not be repeatedly detected by the proposal network. If some of the detection lost are captured through the proposal network and received by the later two-sample test, they will be filled in the raw trajectories. Afterward, the interpolation will reboot for smoothing. Thereby we can recursively refine the trajectories seamlessly. Since we do not require the proposal net to spot all the detection lost, we also control the threshold and use Non-Maximum Suppression (NMS) afterward to suppress the false positives under a high recall rate.
Iii-C2 Two-sample test Based on Empirical Likelihood Ratio
Sending the gap areas of the raw trajectory to the proposal network, we can get some of the detection lost along with other false positives as results. Associate through Hungarian algorithm, all the proposed faces form candidate trajectories, which correspond to orange dash lines in Fig. 4. An observation of a raw trajectory is marked as a sequence . Then, a possible candidate trajectory based on the proposal network results is a sequence . Generally, with false positives yielded by the proposal network, there is likely to have more than one . Illustrated in Fig. 3, and are and as we use the output of the last but one layer.
The expense of high recall rate is the massively generated false positives. A direct method to exclude these false positives and identify the rest is to reboot the embedding network and feed the results to PIAP once again. Notwithstanding the above method might be theoretically capable, we cannot ensure the real-time efficiency since the number of proposed faces could be substantial. Furthermore, there is also no guarantee that the deep network can embed false positives as outliers because the embedding network is trained on the face dataset. Other than embedding networks, the Siamese net might be helpful as well. Siamese net is designed to identify the slight differences between two similar inputs. However, apart from the unaffordable cost for online training and running the Siamese net, it is susceptible to noise due to its shared weight structure. Therefore, with many false positives and occluded partial faces are yielded by the proposal net, the Siamese net cannot sieve the partial faces out of suspicious faces.
Fig. 4 reveals the relation between and . The solid line corresponding to indicates the seamless trajectories after interpolation. Red areas on the solid line are the breaks recovered by interpolation. Under well-established face detection networks, should holds. Meaning the number of detection lost should be less than the number of detections in raw trajectories. Therefore, the necessary inference we can make is the detection lost could somehow be transformed from the detections through shifts, rotations, reverses, etc. Thereby, the embedding of faces is not a necessity regarding the raw face trajectories as priors, and the appropriate could be got through excluding the candidates consist of false negatives. That is, as in Fig. 4, the trajectories refinement process is just like linking dense dots (proposed faces) into dash lines (candidate trajectories) and then into a solid line (refined trajectories).
Some state-of-the-art studies in MFT  fields applied a similar concept for prediction optimization. The Gaussian Process (GP) model is applied by them to reflect an inference correlation of data notes. Under ideal conditions, is expected to be captured directly by the detection network. However, some noise stops deep CNN from functioning and requires the help of the proposal net. Therefore, and could be written as:
is Gaussian Process with hyperparameter, and independent noise term .
is the identity matrix.
The Central Limit Theorem (CLT) implies that the outputs ofconverge to a multivariate Gaussian distribution. A regular way to derive high-level parametric abstractions of and
is finding the maximum likelihood estimates (MLE) ofdenote as :
and then, the range of decides whether and stay close enough such that they could be regarded as samples from the same Gaussian distribution.
This GP model is sufficient for false positives rejection in most offline cases 
. However, in video live streaming, we manipulate short video segments that may not contain enough frames. The amount of faces within a segment is not significantly greater than the dimension number of the face vectors, and this violates the initial assumption of CLT. That is, the probability density function of face vectors’ distribution within individual segments cannot support the Gaussian distribution anymore. Therefore, such a GP model cannot function well. Inspired by, for the better efficiency and robustness, we propose a two-sample test algorithm to reject the proposed false positives based on empirical likelihood ratio statistics.
The goal of the two-sample test is to determine whether two distributions are different on the basis of samples. In our case, we test whether the candidate trajectory and the corresponding raw face trajectory are two face samples that come from the same population (distribution). The statistical null hypothesis for the two-sample test is. stands if and only if and are from the same distribution. Otherwise, and are different at a significance level and we should reject
and the corresponding candidate trajectory. As described in (12), a fully Bayesian treatment of neural networks is also a Gaussian distribution. Then we assume the domain of face vectors is a compact set defined in a Gaussian Reproducing Kernel Hilbert Space (RKHS)with reproducing kernel . Following the widely used Maximum Mean Discrepancy (MMD), we measure the distribution and by embedding them into such a Gaussian RKHS . Denote the mean embedding of distribution and in as and , the question of finding optimal estimators of MMD in kernel-based two-sample tests is an estimate of statistically optimal estimators in the construction of these kernel-based tests. Similar to the Gaussian function in (11), MMD represents the supremum of the mean embedding of and after the projection function :
according to equation (13),
stands for the Gaussian kernel function of .
As , if we pick the nearest (in time) face vectors in , and according to (15), set
for simplification, the linear time unbiased estimator of MMD is proposed in. The unbiased estimator functions similarly as the MLE of and under a multi-variant Gaussian Process with large sampled points.
Denote for further simplification. and are i.i.d which implies are also i.i.d observations. i.e., where is the mean value. Thence, we introduce the empirical likelihood ratio (ELR) statistic [7, 18] for solving the two-sample test of as:
where is the empirical likelihood ratio (ELR), and is subject to the normalization and constrains. could ensure the empirical mean of equals zero. Therefore, any subtal difference between and can be captured by the empirical likelihood ratio statistic through the pairwise discrepancy . The explicit solution of (17) is attained from a Lagrange multiplier argument according to :
According to (17) and (18), the supremum is transferred to the ELR test for statistic . If the null hypothesis holds,
whereconverge in this distribution. Otherwise, we will reject the null hypothesis when , and
The rejection threshold of (20) could be computed through the off-the-shelf python package. Therefore, leveraging as the new parametric feature to eliminate the false positives raised by the proposal network is extremely time-saving. The cost of getting is solely related to the computation of , and according to (15), this cost is linear to the sample size. Our two-sample test algorithm based on empirical likelihood ratio statistic achieves impressing results in real tests.
With the efforts of (19) and (20), the received is added to ’s trajectory. Fast interpolation once again applies to fill up the tiny breaks or jumps still exist and then generate a refined trajectory.
The final pixelation happens on the refined non-streamers’ trajectories through Gaussian filters. The choose of different Gaussian kernel can offer different pixelation styles. The video segments will directly be broadcast to the audience after postprocessing. An overall procedure of FPLVS is described in Algorithm 2.
Iv Experiments and Discussions
As the face pixelation problem in live video streaming is not studied before, there are no available dataset and benchmark tests for reference. We collected and built a video live streaming video dataset from YouTube and Facebook platforms and manually pixelated the faces for comparison. In total, 20 streaming video fragments come from 4 different streaming categories (a: street interview; b: dancing; c: street streaming; d: flash activities) are labeled. These labeled fragments last 1312 seconds. Since they all are streamed in 30FPS, each fragment contains around 2000 frames on average. As demonstrated in Table I, these fragments are further divided into four groups according to the broadcasting resolution and the number of people showed up in a stream. Live streaming videos having at least 720p resolution are marked as high-resolution , and the rest are low-resolution . Similarly, the live streaming videos contain more than two people is sophisticated scenes , and the rest are naive ones .
All the parameters remain unchanged for entire live-streaming video tests. for every segment contains 150 frames. 10 seconds for buffering as . CosFace resizes the cropped face to ; weight decay is 0.0005. The learning rate is 0.005 for initialization and drops at every 60K iterations. The damping factor for PIAP is default 0.5 as it in AP. Detection threshold for MTCNN is [0.7,0.8,0.9]. The proposal network inherent the threshold from MTCNN, and the resize factor is [0.702]; the IOU rate for NMS is [0.7]. for ELR statistic.
Iv-A3 Evaluation metrics
Set our manual pixelation results in the collected video live streaming dataset as the baseline, we migrated and adopted the algorithms including state-of-the-art MFT like IOU tracker and POI tracker; currently applied commercial tools like YouTube Studio and Microsoft Azure, and more advanced face detection and embedding networks for comparison. Multi-Face Pixelation Accuracy (MFPA), Multi-Face Pixelation Precision (MFPP), and Most Pixelated (MP) frames are the most concerned metrics. MFPA reflects the overall performance of the algorithms on correctly allocating mosaics on non-streamers’ faces, and MFPP states the degree of drifting and the tracking lost. A low MFPP means the drifting happened in the early frames of the video, and the algorithm quickly corrupted in tests. YouTube and Azure are offline pixelation tools but still tested here as examples of real-life commercial tools. Besides, the Number of People (NoP) represents the ability of the corresponding algorithms to spontaneously handled the ill-defined number of people problem in a live. NoP is another important metric due to the ability to spot new and vanished faces is critical in bringing a pixelation method to real applications. POI tracker has two versions, and the online version with Mask R-CNN detection is used here. FPVLS (without refinement) is the result of FPVLS without the trajectory refinement stage.
From Table II, all the algorithms get their best performance on the High-resolution Naive-scene () sub-dataset. YouTube and Azure offline pixelation tools adopt tracker based algorithms without calibration. Therefore, both of them are unsatisfactory in accuracy and precision. The low pixelation precision of the above two also indicates that they are prone to fast corruption in tests. KCF and ECO here are optimized for face tracking and pixelation tasks. However, affected by the noisy detection results, KCF styled online learning algorithms are maladjusted in video live streaming. Although ECO applies higher dimensional features than KCF, the MFPA of ECO is lower than KCF. The problem for the combination of the correlation filters and online learning strategy is that the correlation filter always yields many partial occluded faces as noisy samples for the learning network. Therefore, both KCF and ECO result in a fast and unrecoverable drifting in video live streaming. However, state-of-the-art tracking-by-detection methods are quite attractive in comparison. POI tracker shares some similar fundamental ideas with FPVLS in the process of detection, embedding, and then trajectory generation. FPVLS without the trajectory refinement stage somehow achieves similar performances with the POI tracker. Overall, we achieve better performance in terms of MFPA, MFPP, and MP. Specifically, FPVLS noticeably increases most pixelation (MP) frames on the entire dataset, indicating that FPVLS can robustly construct longer trajectories, and face IDs are correctly maintained within trajectories. Listed at the bottom section of Table II, the detection and embedding networks are actually not that important in FPVLS as long as they are deep architectured well-tunned networks.
|POI Tracker (online)||0.62||0.61||0.70||0.67||0.70||0.70||0.83||0.81||275||✓||✓||20~40|
Iv-A4 Ablation Study
The performance of FPVLS is the result of the joint action of the raw face trajectory generation and the trajectory refinement stages. Our ablation study focuses on exploring the promotion brought by each of the stages. From the last section of Table II, self-evidently, the result of FPVLS without refinement directly proves the effectiveness of the trajectory refinement stage. MFPA and MFPP are both promoted after refinement means the results generated by the refinement stage significantly improves the overall accuracy of the FPLVS. The accuracy improvement and much longer MP frames are all own to the trajectory refinement stage which contains the interpolation, and the collaboration of the proposal network and the two-sample test. Also, the switch of face detection and recognition networks does not conduct a profound influence on the results. However, according to our further tests, well-tunned preprocessing networks are still a necessity. If a bad enough detection and embedding networks are leveraged, a disaster behavior of FPVLS will be the end.
Table III shows the clustering results generated by PIAP in the raw trajectory generation stage. Positioned AP (PAP) is the clustering algorithm levering position information but not processed in an incremental way. As we can see, the position information helps to boost the accuracy and purity of AP clustering in every dataset. Moreover, the time cost for each convergence is greatly reduced by the incremental processing. As our test data does not contain hours-long live stream videos, the actual time for AP could potentially grow larger with the grow of face numbers. Comparing with PAP, while promoting the efficiency, the accuracy of AP is also well-maintained by PIAP.
Table IV is the recall rate and precision of the proposal network. The detection lost recovered by interpolation is excluded here. The denominator of the recall rate is all the detection lost labeled in the gaps, which asks the proposal network for re-detection. The proposal network gets the best performance under high-resolution naive scenes. Imbalance of precision and recall rate happens in both sophisticated scenes meaning the proposal network is mainly affected by the number of people in scenes and will propose a considerable amount of false positives. Therefore, the numerical analysis of the two-sample test based on ELR is proved to be more eminent than the deep embedding or recognition networks in efficiency and robustness. Combining with the bottom section in Table II with Table IV, the two-sample test based on ELR statistics does work to reject the candidate trajectories. Because the low precision of the proposal net is altered into an MFPP boosting after the two-sample test and interpolation.
Fig. 5 shows the specific function of each algorithm in a real streaming scene. Fig. 5(a) is the same scene that we demonstrated in Fig .1 pixelated solely through face detection and recognition networks. For comparison, we feed around 1000 pictures of the streamer in advance to the recognition network for training. The classification result of the recognition network is directly used for pixelation. In the rightmost snapshot of Fig. 5(a), the detection and recognition network failed to capture the hawker’s face and produced a false positive in purple rectangular. Also, in the second and the penultimate snapshots, the detection network failed to locate the hawker’s face. Such a straightforward method will instantly cause the mosaics to blink on faces. Then, we feed the results of (a) to PIAP clustering. The rightmost snapshot in (b) shows PIAP excluded the false positives caused by the detection network. Raw trajectories are generated in (b). Further, the proposal network is applied on raw trajectories to retrieve the detection lost. In (c), the proposal network feedback the lost faces along with some other non-face areas marked in brown rectangles. The ELR statistics managed to cull the non-faces through the two-sample test in (d). The result in (d) is fully processed by FPVLS except for the final pixelation. The effect of both the raw face trajectory generation and trajectory refinement stages are significant according to (a)-(d).
More qualitative results111More video results on:https://FPVLS.github.io are shown in Fig. 6 to demonstrate the performance of FPVLS under noisy low-resolution video live streaming scenes. POI tracker generates mosaics in red circles, and FPVLS generates those in yellow rectangular. The first two rows show the same over-pixelation problem also occurs in the POI tracker. In red circles, puzzling mosaics are placed on the frontal streamer’s faces while a crossover of two people’s faces happens. However, based on detection and embedding, such over pixelation could be correctly handled by FPVLS. The temporary invisible non-streamer’s face will not be pixelated; therefore, leave the streamer’s face clean. The last two rows show a crowded flash event that is hard to be dealt with trackers. Limited by tracklets, trackers cannot instantly recover from the drifting in noisy scenes. Despite other misplaced mosaics, the tracker blurs the streamer who wears a pink trenchcoat due to drifting and incapable of calibration. Although it is not perfect, FPVLS does manage to pixelated most of the faces correctly, especially left the specified streamer’s face (in blue rectangular) naked. Our method is not affected by the serious drifting issue in noisy scenes.
The main cost of FPVLS is on the face embedding algorithm. The embedding for one face takes 10-15ms on our i7-7800X, dual GTX1080, 32G RAM machine. The face detection, including compensation for a frame, takes 3ms. Another time-costing part is the initialization of PIAP; the initialization process takes 10-30ms depending on the case. Furthermore, each incremental propagation loop takes 3-5ms. So FPVLS can satisfy the real-time efficiency requirement, and under extreme circumstances that contain many faces, we can trick the detection and recognition jumping among frames for efficiency.
Leveraging PIAP clustering based raw face trajectories and the trajectory refinement based on the proposal network combined with the two-sample test, FPLVS manages to solve the face pixelation problem in video live streaming. PIAP deals with the ill-defined cluster number issue and boosts the accuracy of classic AP through position information. The two-sample test based on empirical likelihood ratio statistics is novelly introduced to substitute the widely used Gaussian process due to the lack of sufficient data within a single segment. The comparison in Table II shows the surpassing performance of FPLVS. With Apple A12 chips, FPVLS could be potentially transferred to smartphones and brought into real applications without much computation burden.
As we discussed in section IV, the low-resolution videos are still quite challenging because deep networks can drop to a deficient performance and break the robustness of FPLV under low-resolution scenarios. We will further extend FPVLS to behave better on low-resolution scenarios by involving the correlation filter styled algorithms as extra compensation.
-  (2011) Multiple object tracking using k-shortest paths optimization. IEEE transactions on pattern analysis and machine intelligence 33 (9), pp. 1806–1819. Cited by: §II.
-  (2017) High-speed tracking-by-detection without using image information. In 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6. Cited by: §II.
-  (2015) Diversity-induced multi-view subspace clustering. In , pp. 586–594. Cited by: §III-B2, §III-B2.
-  (2016) Empirical likelihood test for high-dimensional two-sample model. Journal of Statistical Planning and Inference 178, pp. 37–60. Cited by: §III-C2, §III-C2.
-  (2017) Eco: efficient convolution operators for tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6638–6646. Cited by: §II.
-  (2019) Arcface: additive angular margin loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4690–4699. Cited by: §III-A2.
Linear kernel tests via empirical likelihood for high-dimensional data. In
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3454–3461. Cited by: §III-C2, §III-C2.
-  (2016) Legal and ethical implications of mobile live-streaming video apps. In Proceedings of the 18th International Conference on Human-Computer Interaction with Mobile Devices and Services Adjunct, pp. 722–729. Cited by: §I.
-  (2007) Clustering by passing messages between data points. science 315 (5814), pp. 972–976. Cited by: §III-B1, §III-B1, §III-B.
Dimensionality reduction for supervised learning with reproducing kernel hilbert spaces.
Journal of Machine Learning Research5 (Jan), pp. 73–99. Cited by: §III-C2.
-  (2012) A kernel two-sample test. Journal of Machine Learning Research 13 (Mar), pp. 723–773. Cited by: §III-C2.
-  (2014) High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence 37 (3), pp. 583–596. Cited by: §II.
-  (2008) Robust object tracking by hierarchical association of detection responses. In European Conference on Computer Vision, pp. 788–801. Cited by: §III-C.
-  (2017) Arttrack: articulated multi-person tracking in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6457–6465. Cited by: §II.
3D convolutional neural networks for human action recognition. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 221–231. Cited by: §II.
-  (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §II.
-  (2018) A prior-less method for multi-face tracking in unconstrained videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 538–547. Cited by: §II, §III-C2, §III-C2.
-  (2001) Empirical likelihood. Chapman and Hall/CRC. Cited by: §III-C2.
-  (2016) Up, periscope: mobile streaming video technologies, privacy in public, and the right to record. Journalism & Mass Communication Quarterly 93 (2), pp. 312–331. Cited by: §I.
-  (2014) Incremental affinity propagation clustering based on message passing. IEEE Transactions on Knowledge and Data Engineering 26 (11), pp. 2731–2744. Cited by: §III-B2.
-  (2018) Pyramidbox: a context-assisted single shot face detector. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 797–813. Cited by: §III-A2.
-  (2015) Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 4489–4497. Cited by: §II.
-  (2017) Optimal video streaming in dense 5g networks with d2d communications. IEEE Access 6, pp. 209–223. Cited by: §I.
-  (2013) Multi-exemplar affinity propagation. IEEE transactions on pattern analysis and machine intelligence 35 (9), pp. 2223–2237. Cited by: §III-B1.
-  (2018) Cosface: large margin cosine loss for deep face recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5265–5274. Cited by: §III-A2.
-  (2016) Poi: multiple object tracking with high performance detection and appearance feature. In European Conference on Computer Vision, pp. 36–42. Cited by: §II, §III-C.
-  (2017) IPrivacy: image privacy protection by identifying sensitive objects via deep multi-task learning. IEEE Transactions on Information Forensics and Security 12 (5), pp. 1005–1016. Cited by: §III.
-  (2015) Beyond short snippets: deep networks for video classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4694–4702. Cited by: §II.
-  (2012) Gmcp-tracker: global multi-object tracking using generalized minimum clique graphs. In European Conference on Computer Vision, pp. 343–356. Cited by: §II.
-  (2016) Tracking persons-of-interest via adaptive discriminative features. In European conference on computer vision, pp. 415–433. Cited by: §II.
-  (2014) Technology survey on video face tracking. In Imaging and Multimedia Analytics in a Web and Mobile World 2014, Vol. 9027, pp. 90270F. Cited by: §II.
-  (2016) Joint face representation adaptation and clustering in videos. In European conference on computer vision, pp. 236–251. Cited by: §II, §III-A2, §III-C1.
-  (2017) Law infringements in social live streaming services. In International Conference on Human Aspects of Information Security, Privacy, and Trust, pp. 567–585. Cited by: §I.