Game of Drones - Detecting Streamed POI from Encrypted FPV Channel

Drones have created a new threat to people's privacy. We are now in an era in which anyone with a drone equipped with a video camera can use it to invade a subject's privacy by streaming the subject in his/her private space over an encrypted first person view (FPV) channel. Although many methods have been suggested to detect nearby drones, they all suffer from the same shortcoming: they cannot identify exactly what is being captured, and therefore they fail to distinguish between the legitimate use of a drone (for example, to use a drone to film a selfie from the air) and illegitimate use that invades someone's privacy (when the same operator uses the drone to stream the view into the window of his neighbor's apartment), a distinction that in some cases depends on the orientation of the drone's video camera rather than on the drone's location. In this paper we shatter the commonly held belief that the use of encryption to secure an FPV channel prevents an interceptor from extracting the POI that is being streamed. We show methods that leverage physical stimuli to detect whether the drone's camera is directed towards a target in real time. We investigate the influence of changing pixels on the FPV channel (in a lab setup). Based on our observations we demonstrate how an interceptor can perform a side-channel attack to detect whether a target is being streamed by analyzing the encrypted FPV channel that is transmitted from a real drone (DJI Mavic) in two use cases: when the target is a private house and when the target is a subject.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 4

page 7

page 8

page 9

01/02/2022

VISAS – Detecting GPS spoofing attacks against drones by analyzing camera's video stream

In this study, we propose an innovative method for the real-time detecti...
10/21/2020

Can We Enable the Drone to be a Filmmaker?

Drones are enabling new forms of cinematography. However, quadrotor cine...
12/12/2017

Directing Cinematographic Drones

Quadrotor drones equipped with high quality cameras have rapidely raised...
07/30/2020

DroneLight: Drone Draws in the Air using Long Exposure Light Painting and ML

We propose a novel human-drone interaction paradigm where a user directl...
10/26/2021

Market Design for Drone Traffic Management

The rapid development of drone technology is leading to more and more us...
05/04/2020

Noise2Weight: On Detecting Payload Weight from Drones Acoustic Emissions

The increasing popularity of autonomous and remotely-piloted drones have...
02/26/2019

Analyzing the Use of Camera Glasses in the Wild

Camera glasses enable people to capture point-of-view videos using a com...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Abstract

Drones have created a new threat to people’s privacy. We are now in an era in which anyone with a drone equipped with a video camera can use it to invade a subject’s privacy by streaming the subject in his/her private space over an encrypted first person view (FPV) channel. Although many methods have been suggested to detect nearby drones, they all suffer from the same shortcoming: they cannot identify exactly what is being captured, and therefore they fail to distinguish between the legitimate use of a drone (for example, to use a drone to film a selfie from the air ) and illegitimate use that invades someone’s privacy (when the same operator uses the drone to stream the view into the window of his neighbor’s apartment), a distinction that in some cases depends on the orientation of the drone’s video camera rather than on the drone’s location. In this paper we shatter the commonly held belief that the use of encryption to secure an FPV channel prevents an interceptor from extracting the POI that is being streamed. We show methods that leverage physical stimuli to detect whether the drone’s camera is directed towards a target in real time. We investigate the influence of changing pixels on the FPV channel (in a lab setup). Based on our observations we demonstrate how an interceptor can perform a side-channel attack to detect whether a target is being streamed by analyzing the encrypted FPV channel that is transmitted from a real drone (DJI Mavic) in two use cases: when the target is a private house and when the target is a subject.

I Introduction

The proliferation of consumer drones over the last few years [1] has created a new privacy threat. We are living in an era in which anyone with a drone equipped with a video camera can invade another individual’s privacy by maneuvering the drone to the individual’s house and directing the drone’s camera to the window of the house in order to film or record the subject in his/her private space. Many privacy invasion incidents have been reported in the media, and laws are being updated to deal with this new threat [2, 3, 4, 5, 6, 7, 8].

State of the art drones provide video piloting capabilities, a.k.a. first person view (FPV), a communication channel designed to (1) stream the video captured by the drone’s video camera to the operator’s controller in order to present the video stream to the operator in real-time, and (2) maneuver the drone by transmitting commands from the controller to the drone. FPV provides excellent infrastructure for a malicious operator to invade someone’s privacy without being detected because: (1) it eliminates the need for a malicious operator to be close to the drone or target by allowing the operator to maneuver the drone from far away to a target that is also far away from his/her location, (2) its traffic can be encrypted, and (3) it supports HD resolutions that enable the attacker to obtain high resolution images and close-ups (by using the video camera’s zooming capabilities) that are captured by the drone far from the target POI.

Extracting a target POI from an FPV channel has interested researchers for a long time. Many studies have suggested methods for detecting whether a drone is nearby, however none of them can detect exactly what is being captured, and therefore they fail to distinguish between the legitimate use of the drone (e.g., to film a selfie from the air) and illegitimate use (e.g., to stream the view into the window of someone else’s apartment), a distinction that in some cases depends on the orientation of the drone’s video camera rather than on the drone’s location. There are many known cases in which a POI was extracted from an unencrypted video stream [9, 10, 11, 12], however detecting a target POI from an encrypted video stream remains a challenge.

In this research we shatter the commonly held belief that the use of encryption to secure a surveillance video channel transmitting in real-time prevents an interceptor from extracting the POI that is being tracked by a drone. We present different methods that can be used by an interceptor to detect whether a particular POI (e.g., his/her house, a subject) is being tracked by a drone, even if the FPV channel is encrypted; in order to accomplish this, our proposed methods trigger a physical stimulus that influences the encrypted FPV channel, and can be used by an interceptor to detect if a specific POI is being tracked. We investigate the influence of changing pixels on the FPV channel in a lab setup. Based on our observations we demonstrate how an interceptor can perform a side-channel attack to detect whether a target is being streamed by analyzing the encrypted FPV channel that is transmitted from a real drone (DJI Mavic) in two use cases: when the target is a private house and when the target is a subject. We investigate the amount of time that the physical stimulus must be activated in order to achieve an FPR of zero in each of the use cases.

I-a Motivation

Interest in intercepting a drone’s traffic in order to extract the captured POI from an FPV channel is not limited to civilians; there are many known cases in which a military organization has successfully intercepted an unencrypted surveillance camera’s video stream transmitted from a rival’s drone and managed to identify the POI that was under surveillance [9, 10, 11, 12]. Just recently, the Israel Defense Forces confirmed that 12 soldiers were killed in 1997 as a result of intercepted intelligence that was transmitted from an Israeli surveillance drone over an unencrypted channel, giving Hezbollah advance knowledge of a naval commando operation deep inside Lebanon [13, 14]. However, to the best of our knowledge, there are no known cases in which interceptors were able to extract a POI from an encrypted FPV channel.

I-B Contributions

This study makes the following contributions:

  • Extraction of a POI from an encrypted FPV channel - We are the first to demonstrate a method that extracts a targeted POI from an encrypted FPV channel and can be used to distinguish between the legitimate use of a drone that does not invade a subject’s privacy and illegitimate use; a difference that depends on the angle of the video camera and not on the location of the drone, as demonstrated in Figure 2.

  • External Interception Model - Other studies [15, 16, 17]

    that aimed to classify video streams sent from a VOD supplier (e.g., Netflix, YouTube, etc.) to a client over the Internet used traffic that was captured from the client’s network. This setup raises two problems: (1) in real life, in order to capture traffic from a targeted network, a computer inside the targeted network must be infected with a malware, and (2) most of these studies claim that their model works using external network interception. There are major challenges that arise when using an external interception model (e.g., packet loss), and their models’ robustness to such challenges is unclear. Our study presents an external interception model using a radio frequency (RF) scanner that was empirically evaluated outside a lab setup and does not require any network infection.

  • Effective for Encrypted Traffic - Our methods work by analyzing encrypted traffic without any prior knowledge regarding the cryptography algorithm that is being used. We only use the length of the cryptogram which can be extracted from the second layer of the OSI model instead of its higher levels which were used in other studies [15, 16] to classify video streams.

The rest of this paper is structured as follows: in Section II we discuss various protocols and coding algorithms for video streams. In Section III we review related works from two main areas: information leakage from video streams and known methods of detecting nearby drones. In Section IV we present (1) an adversarial model of an attacker that performs a privacy invasion attack, and (2) an external interception model for the detection of a captured POI. In Section V we investigate the influence of changing pixels on the transmitted traffic (in a lab setup), and in Section VIII we discuss legal solutions for privacy invasion attacks.

Ii Background

Modern drones provide video piloting capabilities (FPV channel), in which a live video stream is sent from the drone to the pilot (operator) on the ground so the pilot is flying the drone as if he/she was onboard instead of looking at the drone from the pilot’s actual ground position. It allows an operator to control a drone using a remote controller. There are three types of technologies dominating the FPV market: Wi-Fi FPV and analog FPV. Wi-Fi FPV is, by far, the most popular method used to include FPV in budget RC drones because: (1) any Android/iOS smartphone (or tablet) on the market can be used to operate the drone, and (2) the additional hardware required includes only a Wi-Fi FPV transmitter (that is connected to the camera of the drone), instead of the dedicated radio transmitter and dedicated controller required by other FPV types (e.g., 5.8GHz Analog FPV and 2.4GHz Analog FPV). Almost every FPV-enabled drone selling for less than $100 uses Wi-Fi FPV [18]; in addition, there are dozens of kinds of Wi-Fi FPV drones available for purchase [19, 20, 21] and their price varies from $30 for a simple drone up to hundreds of dollars for a professional drone (DJI Mavic, DJI Spark, Parrot Bebop 2) with HD resolutions (up to 4K) and extended functionality (e.g., follow me, return home, etc.).

Wi-Fi communication between the controller and the drone is sent over a secured access point that is opened by either the drone or the controller (both parties are connected to the access point). Using dedicated hardware (e.g., a controller with a Wi-Fi signal range extender), current drone models provide operators the ability to control a drone using FPV from a distance of a few kilometers over Wi-Fi channels [22, 23, 24].

The video that is captured by the drone camera is streamed to its controller using real-time end-to-end media streaming protocols (RTP). The RTP standard describes two sub-protocols: RTP (Real-Time Transport Protocol) and RTCP (RTP Control Protocol). RTP is used to transport media, and the majority of its implementations are built on the user datagram protocol (UDP), since real-time multimedia streaming applications require timely delivery of information and can often tolerate some packet loss to achieve this goal. RTCP transports statistics for a media connection and information such as the transmitted octet and packet counts, packet loss, packet delay variation, and round-trip delay time. RTP and RTCP each have a secured version SRTP and SRTCP which are intended to provide encryption, message authentication and integrity, and replay protection. RTSP is a network control protocol designed for use in entertainment and communication systems to control streaming media servers. The protocol is used for establishing and controlling media sessions between end points, and it supports a transmission of commands, such as play, record, and pause.

Ii-a Video Coding Algorithms

Video encoding [25, 26, 27] begins with a raw image captured from a camera. The camera converts analog signals generated from striking photons into a digital image format. Video is simply a series of such images generally captured five to 120 times per second (referred to as frames per second or FPS). The stream of raw digital data is then processed by a video encoder in order to decrease the amount of traffic that is required to transmit a video stream. Video encoders use two techniques to compress a video: intra-frame coding (spatial compression) and inter-frame coding (temporal compression).

Intra-frame coding creates an I-Frame, a time periodic reference frame that is strictly intra-coded. The receiver decodes an I-frame without additional information. Intra-frame prediction exploits spatial redundancy, i.e., correlation among pixels within one frame, by calculating prediction values through extrapolation from already coded pixels for effective delta coding. The intra-coding process contains the following stages [28, 27]:

  1. Color conversion and chroma sub-sampling - The human eye has a lower sensitivity to color information than to dark-bright contrasts. First a conversion from RGB color space into YUV color components (e.g., YCbCr) is applied, and then, some of the chrominance information of the image is removed. This is a lossy stage.

  2. Partition - The actual frame is divided into non overlapping macroblocks.

  3. Transformation - A block is represented in the frequency domain.

  4. Quantization - This process is applied to the block to remove the insignificant part (high frequencies) and results in a compressed block with a smaller amount of information. This is a lossy stage.

  5. Entropy coding - Compression algorithms are used to represent the data by mapping frequently occurring patterns with a few bits and rarely occurring patterns with many bits (e.g., using Huffman coding).

Over the years various optimizations have been introduced for each of the stages, including: (1) dynamic partitioning techniques, (2) novel prediction algorithms and varying the amount of reference frames, (3) different domain transformations, and (4) quantization methods. These optimizations boost the transmission rate from 1.5 Mbps (MPEG-1) to 150 Mbps (MPEG-4).

Inter-frame coding exploits temporal redundancy by using a buffer of neighboring frames that contains the last M number of frames and creates a delta frame. A delta frame is a description of a frame as a delta of another frame in the buffer. The receiver decodes a delta frame using a received reference frame. There are two major types of delta frames: P-Frame and B-Frame. P-Frames can use data from previous frames to decompress and are more compressible than I-Frames. B-Frames can use both previous and upcoming frames for data reference to get the greatest amount of data compression. The process of generating a delta frame consists of the following stages:

  1. Partition - dividing the actual frame into nonoverlapping macroblocks.

  2. Reference block matching - finding a similar block in another frame.

  3. Motion vector extraction - extracting the difference between the two blocks by calculating the prediction error.

The order in which I, B, and P-Frames are arranged is specified by a GOP (group of pictures) structure. A GOP is a collection of successive pictures within a coded video stream. It usually consists of two I-Frames, one at the beginning and one at the end. In the middle of the GOP structure, P and B-Frames are ordered periodically. An example of a GOP structure, with I, P, and B-Frames, can be seen in Figure 1. Occasionally B-Frames are not used in real-time streaming due to delays.

Fig. 1: GOP structure - I,B, and P-frames

Intra-framing and inter-framing techniques were integrated into the MPEG-1 standard in the 1990s. Naturally, integrating these techniques into the protocol creates a variable bitrate (VBR) in the transmission of a video which is influenced by changes between frames and the content of the frame itself. A frame that can be represented as a set of prediction blocks of a similar neighboring frame (that has already been captured and transmitted) requires a smaller amount of data to be represented. The same thing is also true for video streams with a lot of redundancy in their frames. On the other hand, a frame with less similarity to other neighboring frames (e.g., as a result of the movement of several objects) necessitates that a larger amount of data be represented as a set of prediction blocks of other frames. The same thing is also true for a frame with less redundancy. Even if the video stream is encrypted at the transport layer (e.g., using TLS), the sizes of the packets and times of arrival are visible to anyone watching the network. In terms of cyber security such coupling between the captured stream and its cryptogram series can be used to extract meaningful information, as described below in Section III.

Iii Related Work

Fig. 2: From left to right: (a) a drone boxed in red, two people boxed in blue, and a window of an organization boxed in yellow, (b) illegitimate use of the drone camera - filming the organization, (c) a legitimate use for selfie purposes.

In this section we describe: (1) methods that exploit information leakage of an encrypted video stream to extract insights about the stream, and (2) methods for nearby drone detection. In the area of video hosting services, several studies exploited video stream information leakage to classify a video stream sent from a video hosting service (e.g., YouTube, Netflix, etc.) using Dynamic Adaptive Streaming over HTTP (DASH) protocols (a.k.a. MPEG-DASH). This attack model relies on two steps: (A) building a database of reference traces of video streams, and (B) classifying a query trace of an intercepted video stream by matching it to the database. Saponas et al. [15] analyzed Slingbox’s encrypted streams sent over wired and wireless connections to a client installed on a computer and managed to achieve a 89% accuracy in classifying 26 different movies by analyzing 40 minutes of the stream’s bitrate. Schuster et al. [17]

classified video streams sent from Netflix, Amazon, YouTube, and Vimeo by analyzing burst patterns using convolutional neural networks.

Reed et al. [29, 30] classified Netflix’s video streams and reached accuracy over 90% by analyzing only eight minutes of the stream. In an area similar to video hosting services, Liu et al. [16] constructed robust video signatures using wavelet based analysis by analyzing traffic sent over the RTP of an IPTV. In the area of VoIP, Wright et al. showed that VBR leakage in encrypted VoIP communication can be used for the detection of the speaker’s language [31] and phrases [32]. White et al. [33] extended this approach to extract conversation transcripts. Wampler et al. [34] analyzed packets’ average inter-arrival time, size, and bandwidth, and the number of received packets in a window, in order to extract hand movement and ambient light changes from IP camera traffic sent over RTP.

In terms of the attack model, the described studies did not conduct their experiments using an external RF scanner (e.g., NIC in monitor mode), so their attack model requires a malware installed on the targeted network/computer in order to detect the video streams.

In the area of drone detection, various methods were introduced over the last few years to detect a nearby drone. Radar is a traditional method of detecting drones, however the detection of small consumer drones requires expensive high-frequency radar systems [35]

. Several studies suggested computer vision techniques to detect a drone by using a camera to analyze motion cues

[36, 37]. However, these methods suffer from false positive detections due to: (1) the increasing number of drone models, and (2) the similarities between the movements of drones and birds [37]. In order to distinguish between birds and drones, several approaches analyzed the noise of the rotors captured by microphones [38, 37]. However, very expensive equipment is required in order to address the challenges arising from the ambient noise and the distance between the drone and the microphone [38]. A hybrid method that combines all of the methods discussed in this section was suggested by [39] in order to improve the accuracy of detection, however such a method is very expensive to deploy. Two other studies proposed a method to detect a consumer/civilian drone controlled using Wi-Fi signals. The first method [40] analyzes the protocol’s signatures of the Wi-Fi connection between the drone and its controller. The second method [41] analyzes the received signal strength (RSS) using a RF scanner (e.g., Wi-Fi receiver).

None of the described methods for drone detection is able to determine whether the drone was used to invade privacy (by video recording the subject/target). More specifically, they are unable to understand what exactly is being recorded by the drone. In crowded areas, the difference between legitimate and illegitimate use is based on the angle of the drone’s camera. Figure 2 presents legitimate and illegitimate uses of a drone. All of the described methods [35, 36, 37, 38, 39, 40, 41] fail to distinguish between the act of taking a selfie and a privacy invasion attack. In contrast, our method does not have these abovementioned weakness. In this research we demonstrate methods for: (1) determining exactly what is being recorded, and (2) providing a subject with proof that he/she was under surveillance.

Transmitter
Purpose Publication
Required
.o’
Analyzed
Protocols
Interception
Video Hosting
Services (Netflix,
YouTube, etc.)
Classify
Video-stream
[15] - USENIX 2007
[16] -
[17] - USENIX 2017
[30] - CODASPY 2017
[29] - CCNC 2016
Minutes
DASH Internal
IPTV
Classify
Video-stream
[42] - GLOBECOM 2008 Minutes RTP Internal
IP Camera
Lights on/off,
Hand movement
[34] - GLOBECOM 2015 Immediate RTP Internal
PC
Language extraction[31]
Phrase detection[32]
Transcripts[33]
[31] - USENIX 2007
[32] - S&P 2008
[33] - S&P 2011
- VoIP Internal
Drone
Detecting Streamed POI
- 10 seconds RTP External
TABLE I: Information leakage from VBR - related work

Iv Adversary Model & Proposed Detection Scheme

We consider an adversary operator that uses a drone to film a target (subject or organization) for:

  1. Self-entertainment - the attacker considers a privacy invasion attack as a form of entertainment and performs the attack to satisfy his/her curiosity.

  2. Malicious purpose - the attacker uses the drone’s video camera to collect information about the target for malicious purposes. For example, in cases in which the target is an organization, the malicious purpose can be to break into an organization (the drone can be used to count the number of subjects that leave the building). Another malicious purpose is using the drone to disable a secret facility (in which the captured video is used to map the organizational assets). In cases in which the target is a subject, the purpose can be to understand whether the subject is cheating on his/her spouse by using the drone’s video camera to spy on the subject (as was shown in [5]).

The interceptor’s goal is to determine whether a target is being captured by a drone’s video camera. We assume that an interceptor has detected the presence of a drone nearby (using one of the known methods for drone detection [35, 36, 37, 38, 39, 40, 41]) or by analyzing suspicious access points. In addition we assume that the interceptor owns an RF scanner (e.g., an NIC) that is connected to a computer with an adequate antenna that captures the traffic being sent from the drone to the controller. The interceptor initiates a physical stimulus aimed at the captured target in a random pattern and analyzes the intercepted traffic in a detection model. Figure 3 presents the proposed target detection scheme and the parties involved.

Fig. 3: Proposed target detection scheme.

Iv-a Detection Model

Algorithm 1

presents the target detection model. It receives as input: (1) the intercepted Wi-Fi stream (that was captured by the NIC), (2) a watermarking pattern (binary sequence) that was modulated by the physical stimulus, and (3) a window for each bit that was modulated. In addition, the algorithm receives the (4) start and (5) end time of the watermarked pattern (in epoch representation). First, the intercepted Wi-Fi traffic is converted to a bitrate array (line 3). A

is extracted for the following time period: four seconds before the first physical stimulus begins until the next physical stimulus starts (line 5). A is extracted for the following time period: for four seconds from the time at which the first physical stimulus begins (line 6). A and are calculated by averaging and intervals (lines 8-9). A is calculated as the middle between the and the (line 10). Each bit interval is extracted from , classified as 0/1 bits and concatenated to (lines 11-17). Finally, a Boolean result is returned by comparing the to patterns(line 18).

Input:
      1) intercepted-WiFI-Stream // Intercepted by the NIC
      2) watermarkingPattern // binary sequence (e.g., 101..01)
      3) window // milliseconds for single bit modulating
      4) beginPatternTime // begin time of pattern (epoch)
      5) endPatternTime // end time of pattern (epoch)
      Output:
      Boolean result

1:procedure underDetection?
2:     
3:     
4:     
5:     
6:     
7:     
8:     
9:     
10:     
11:     for (i = beginPatternTime; i < endPatternTime; i = i+window) do
12:         
13:         
14:         if avg > cutoff then
15:              
16:         else
17:                             
18:     return (watermarkingPattern == extractedPattern)
Algorithm 1 Privacy Invasion Attack Detection

Iv-B Intercepting FPV Channels

As was discussed in Section II, many commercial drones provide FPV capabilities over Wi-Fi channels. In this case, the drone/controller exposes a secured access point that both parties connect to using authentication. The video stream is sent over Wi-Fi communication and can be intercepted using an NIC (in monitoring mode). An antenna can be used by an interceptor in order to extend the interception range. DJI Mavic, DJI Spark, and Parrot Bebop 2 drones use Wi-Fi Protected Access II (WPA2) protocols to secure their networks. The access points of FPV channels can be detected by changing the NIC mode of a laptop to monitoring mode (or using a software defined radio instead) and using dedicated tools (such as airmon, inSSIDer, etc.), that can even detect hidden networks, in order to find suspicious access points. After identifying suspicious access points, a specific access point can be found by searching for known BSSIDs or MAC IDs of drones. In situations in which the BSSIDs or MAC IDs have been changed, the interceptor can use a method to detect the type of drone by performing forensic analysis of the access point communication as was suggested by Peacock et al. [40]. Another option is to analyze each of the access points within range of the target in the detection model.

V Influence of Physical Stimulus

In this section we investigate the influence of physical stimuli (that change pixels in the streamed video) on the transmitted traffic in a lab setup. All of the methods described in this section make use of a simple principle that changes in the number of pixels from a frame to a consecutive frame requires data to encode, therefore changing a large number of pixels results in more data to encode and causes the FPV’s bitrate to increase (intra-frame coding). We show how the act of flickering an object within a streamed video changes the FPV’s bitrate and test the influence of dividing the flickering object and changing its flickering colors.

V-a Lab Experiments

Fig. 4: Bitrate of captured Wi-Fi signals of the white wall at different resolutions

In the preliminary lab experiments described below we assess the influence of various changes to the pixels on the traffic using a Mavic Pro [43] consumer drone. The drone was configured to transmit video at a rate of 24 frames per second (FPS), its default configuration. We used a laptop (Dell Latitude 7480) that runs Kali Linux with a standard NIC (Intel Dual-Band Wireless-AC 8265 Wi-Fi [44]) for interception. We enabled the monitor mode on the NIC using airmon-ng [45] and intercepted the encrypted video traffic of the Mavic’s AP. The Mavic’s AP uses 802.11n to transfer the data between the connected parties. From the external interception perspective, we were able to extract only the second layer (data link layer) meta-data which includes the following: BSSID, source MAC address, destination MAC address, and packet length. The payload of the packet is encrypted.

We started by analyzing the Mavic’s traffic when the captured video is steady by placing the Mavic in front of a white wall. Figure 4a shows the bitrate of the traffic that was transmitted from the drone for a period of 240 seconds and intercepted by a laptop’s NIC (in monitoring mode) at 1000 ms aggregation at three different resolutions and rates (720p 30 FPS, 720p 60 FPS, and 2K 30 FPS). As can be seen from the results in Figure 4, the bitrate is fairly stable for each resolution over time, however higher resolutions generate higher bitrates.

In the rest of the experiments described in this section, we placed the Mavic in front of a laptop in order to expose the drone to specific images/objects on the monitor. The experimental setup is presented in Figure 5.

Fig. 5: Lab Setup - The DJI Mavic is placed in front of a laptop monitor. A second laptop is used to intercept the traffic (using its NIC in monitoring mode).

First, we investigated the effect of changing the percentage of captured pixels on the traffic. We programed a Python code to present a flickering rectangle on the screen in the middle of the monitor. We tested the effect of various rectangle sizes on the traffic that was sent from the drone to the controller using external interception (the interception laptop was not connected to an access point; its NIC was on monitor mode). The rectangle flickered from white to black, and vice versa, on a white background for 40 seconds.

Percentage of
changing pixels
0% 1.2% 2.50% 5% 10% 25% 50% 75% 100%
Bitrate (KB) 120 130 135 161 170 230 260 290 320
Delta Bitrate (KB) 0 10 15 41 50 110 140 170 200
Delta Bitrate (%) 0% 8% 13% 34% 42% 92% 117% 142% 167%
TABLE II: Influence of changing the amount of pixels on the traffic

As can be seen from the results presented in Table II, there is a strong connection between the percentage of pixels changed and the volume of traffic that was sent from the drone. This phenomenon occurs because a larger amount of changing pixels results in a larger amount of changing macroblocks. A larger amount of changing macroblocks means that the encoder must use more data to encode the delta frames; this in turn increases the amount of traffic. In addition, as the results show, very small changes (< 2.5%) are effectively absorbed and merged with the background noise.

Next, we aimed to determine the effect of separating the pixels and dividing them across several objects (rather than centralizing them on one object) on the amount of traffic generated (given a fixed number of changed pixels). In this series of experiments we fixed the amount of changing pixels but presented a different number of rectangles, dividing the fixed number of pixels to form smaller equal sized rectangles (2, 4, 8, 16, 32), and positioned the rectangles in different places on the monitor. As can be seen from the results presented in Table III, there is a strong connection between increasing the number of rectangles and an increase in the amount of traffic that is required to encode the change. We believe that this phenomenon can be explained as follows: dividing a single rectangle (which centralizes the fixed number of pixels) into smaller pieces (thereby dividing the fixed number of pixels) and separating them from each other on the monitor results in the intersection with more macroblocks that change compared to a centralized object of the same size. Therefore, this requires more data to encode and increases the amount of traffic.

Number of Pieces 1 2 4 8 16 32
Bitrate (KB) 250 260 275 300 325 340
Bitrate Delta (KB) 0 10 25 50 75 90
Bitrate Delta (%) 0.00% 4.00% 10.0% 20.0% 30.0% 36.00%
TABLE III: Influence of dividing an area into pieces on the traffic

Finally, we assessed whether the objects’ position on the monitor affects the traffic. In order to do this, we conducted an experiment in which we flickered a rectangle that is one fourth the size of the screen in four different places on the monitor: top right, top left, bottom left, and bottom right. As can be seen from the results presented in Table IV, each of the flickering rectangles had the same effect on the traffic. Therefore, we believe that when the objects’ size remains fixed, the location on the monitor has no effect, since the same number of changing macroblocks is involved.

White Screen Top Left Top Right Bottom Left Bottom Left
Bitrate (KB) 120 195 195 195 195
Bitrate Delta (KB) 0 75 75 75 75
Bitrate Delta (%) 0.00% 62.500% 62.500% 62.500% 62.500%
TABLE IV: Influence of location of an object on the traffic

From this set of experiments we were able to conclude that (1) the larger the number of changed pixels, the greater the influence on traffic (larger number of changing macroblocks), and (2) the influence is even greater if the pixels are not clustered together (intersection with a larger number of macroblocks).

After investigating the effect of the number of pixels changed, the objects’ location on the monitor, and the difference between keeping the pixels centralized vs. dividing them, we moved on to assess the effect of the object’s color on the amount of traffic. We conducted an experiment in which we flickered different colored rectangles (black, green, blue, red, orange, yellow, pink, purple, and white) of the same size on the monitor.

No flicker Black Red Green Orange Yellow Gray Purple Blue Pink
RGB Value
Bitrate (KB) 100 325 325 325 325 325 325 325 325 325
Delta Bitrate (KB) 0 225 225 225 225 225 225 225 225 225
Delta Bitrate (%) 0% 225% 225% 225% 225% 225% 225% 225% 225% 225%
TABLE V: Influence of changing the color of an object on the traffic

As can be seen from results presented in Table VI, each color caused the same effect on the amount of traffic that was sent from the drone. From this experiment we can conclude that no color significantly outperforms another.

Rather than using the RGB color space, video encoders use different color spaces to represent a picture including: YCbCr, YCoCg, etc. Video encoders transform a captured picture from an RGB color space to a luma value (denoted as Y) and two chroma values. The Y component can be stored with a high resolution or transmitted at a high bandwidth, and the two chroma components can be bandwidth-reduced, subsampled, compressed, or otherwise treated separately for improved system efficiency. Considering this information, we then tested the effect of different brightness levels of the same color on the traffic. To do so, we conducted an experiment in which we flickered two colors (green and blue) at five different brightness levels.

Brightness No flicker 0% 20% 40% 60% 80%
Bitrate (KB) 100 300 300 310 320 350
Bitrate Delta (KB) 0 200 200 210 220 250
Bitrate Delta (%) 0% 200% 200% 210% 220% 250%
TABLE VI: Influence of brightness level of an object on the traffic

As can be seen from the results presented in Table VI, increasing the level of brightness of the object increases the amount of traffic sent from the drone to the controller. The results obtained from the captured traffic were identical for blue and green colors. From these results, we concluded that brighter shades outperform darker shades of the same color.

In the next section we leverage our findings to detect privacy invasion attacks targeting a private house and a subject.

Vi Evaluation

In this section we present the evaluation for target detection in two use cases: when the target is a private house and when the target is a subject.

Vi-a Detecting a Privacy Invasion Attack Against a Building

In this subsection we demonstrate a method of securing a building from privacy invasion attacks by triggering a physical stimulus. As was shown in the lab experiments, flickering objects influence the amount of traffic that is required to encode the flickering objects. Taking this into consideration, we show how a smart film (a.k.a. smart glass) can be used as a means of triggering a physical stimulus in order to detect whether a building is being tracked by a drone. We purchased a smart film with an RF controller and attached it to a window of a private house that we wanted to secure. The smart film switches between two modes: transparent and white given a radio command sent from its controller.

We wrote a simple Python program that uses a software defined radio (HackRF) to modulate a given signal using on-off keying (OOK) modulation. Each bit of the signal was modulated using a window of two seconds. For 1 bits we flickered the smart film, and for 0 bits we switched the smart film so it was transparent for the entire period of time. We randomly selected the sequence 111100001111111000000 as a signal to be modulated using the physical stimulus.

In order to demonstrate an illegitimate use of the drone, we asked an operator to fly the DJI Mavic and film his neighbor’s garden and private house from the operator’s property. The default resolution and FPS were used (720p and 24 FPS). We purchased a parabolic antenna, TP-Link TL-ANT2424B 2.4GHz 24dBi Grid Parabolic, and connected it to a laptop to intercept the drone’s outgoing traffic sent over the access point. We ran our Python code to create a physical stimulus using the smart film. Figure 6 presents two snapshots that were taken from the streamed video and the results of applying Algorithm 1 to the intercepted traffic. The peaks in the bitrate correlate to the time at which the smart film was flickered. The flicker that was used to modulate the 1 bits increased the bitrate up to 1.5-2 times, from an average of 300-350 KB to 450-570 KB. As can be seen from the intercepted traffic, the flicker that was produced using the smart film influenced the bitrate in a way that watermarked the bitrate according to the given pattern that was programmed in the Python code.

Fig. 6: A smart film controlled via a HackRF connected to a laptop.
Duration of
Stimulus
(in seconds)
Target’s house
without stimulus
Neighbor’s
house
2 0.480 0.535
4 0.294 0.294
6 0.202 0.200
8 0.154 0.145
10 0.027 0.032
12 0.015 0.017
14 0.008 0.011
16 0.003 0.006
18 0.000 0.002
20 0.000 0.002
22 0.000 0.002
24 0.000 0.001
26 0.000 0.001
28 0.000 0.001
30 0.000 0.001
32 0.000 0.000
34 0.000 0.000
36 0.000 0.000
TABLE VII: FPR based on applying Algorithm 1 on the same house without physical stimulus and another house

In order to prove that the physical stimulus is the cause of the traffic change, we conducted another experiment in which we streamed the same house for 20 minutes without initiating a physical stimulus. In addition, in order to prove that this effect could not be reproduced with another house (that is not the target), we streamed another house (the neighbor’s house) for 20 minutes. In both cases we intercepted the traffic using the same experimental setup. We applied Algorithm 1 to the intercepted traffic and calculated the false positive rate as a function of the duration of the physical stimulus for the original signal 111100001111111000000. As can be seen from the results that are presented in Table VII, a pattern of 10 seconds is sufficient to exclude detection mistakes of filming another target (with an FPR of 0.032). In addition, a pattern of 10 seconds is sufficient to exclude mistakes of the same generated pattern without any physical stimulus that are coincidental and the result of wind or other physical movement (with an FPR of 0.027). Figure 7 presents a FPR graph as a function of the duration of the physical stimulus.

Fig. 7: A graph of the FPR as a function of the duration of the physical stimulus.

Vi-B Detecting a Privacy Invasion Attack Against a Subject

In this section we demonstrate a method to secure a subject from a privacy invasion attack using a physical stimulus. We show how a cyber-shirt can be used as a means of triggering a physical stimulus in order to detect whether a subject is being tracked by a drone. We connected a LED strip to a microcontroller (Arduino Uno) and attached them both to a white shirt to create a cyber-shirt. We programmed the micorocontroller to modulate the pattern 01010011 01001111 01010011 (SOS in ASCII) using the LED strip as a light sequence (the physical stimulus). Again, we used on-off keying to modulate the binary sequence. Each bit was modulated using a window of five seconds. For 1 bits we flickered the LED strip, and for 0 bits we switched the LED strip off. A DJI Mavic was used by an operator to record the person wearing the cyber-shirt. We repeated the same experimental setups used in the previous experiment. The video of the experiment was recorded by the DJI Mavic.

Fig. 8: Cyber-shirt experiment

Figure 8 presents the results of applying Algorithm 1 to the intercepted traffic. The peaks in the bitrate correlate to the time at which the LED strip was flickered. The flicker that was used to modulate the 1 bits increased the bitrate by 3-4 times, from an average of 20 KB to an average of 60-80 KB. As can be seen in Figure 8c, the light sequence that was produced by the LED strip influenced the bitrate in a way that watermarked the bitrate to the given pattern that was programmed in the microcontroller.

In order to prove that the physical stimulus is the cause of the traffic change, we conducted one more experiment in which we streamed the same subject for 20 minutes without any physical stimulus conducted by the shirt. In addition, in order to prove that this effect could not be reproduced by another object nearby (that is not the target), we streamed the area for 20 minutes. In both cases we intercepted the traffic using the same experimental setup. We applied Algorithm 1 to the intercepted traffic and calculated the false positive rate as a function of the duration of the physical stimulus for the original signal 01010011 01001111 01010011. As can be seen from the results presented in Table VIII, a pattern of 10 seconds is sufficient for excluding detection errors that occur when filming another target (with an FPR of 0.067). In addition, a pattern of 10 seconds is sufficient for excluding errors that are coincidental and the result of wind or other physical movement (with an FPR of 0.066). The graph in Figure 9 presents the FPR as a function of the duration of the physical stimulus.

Duration of
Stimulus
(in seconds)
The Subject
without Stimulus
Nearby Area
5 0.233 0.220
10 0.066 0.067
15 0.032 0.035
20 0.026 0.022
25 0.008 0.003
30 0.007 0.001
35 0.005 0
40 0.001 0
45 0 0
TABLE VIII: FPR based on applying Algorithm 1 on the same subject without a physical stimulus and another subject
Fig. 9: A graph of the FPR as a function of the duration of the physical stimulus.

Vii Countermeasures

In this section we review countermeasures against privacy invasion attacks and more specifically, discuss countermeasures that can be used against our methods. In recent years several methods have been suggested to disable a drone. Son et al. [46] presented a method to spoof the drone’s gyroscope using ultrasound. Davidson et al. [47] spoofed the drone’s downward camera using a laser and projector that was projected on different surfaces. Luo et al. [48] showed different methods for hijacking and disabling a drone: (1) using GPS spoofing of no-fly zones, and during the return to home task, and (2) FPV jamming of the video stream. Rodday et al. [49] showed techniques to hijack a $30,000 drone by replaying maneuvering commands to the drone, and Kamkar et al. [50] performed a deauthentication attack on a Parrot AR.Drone.

In order to evade detection by an interceptor that uses our methods an attacker can use a drone equipped with two video cameras that are directed at different angles. The first camera will provide an FPV channel and be used by the operator for maneuvering. The second camera will be positioned 180 from the first camera, and it will store the captured video to a memory card and be used by the operator to film a target. Using a drone with two cameras has a primary advantage and disadvantage. The primary advantage is that the second camera which is used for recording the target is immune to our detection methods, so an interceptor won’t be able to detect the target POI by initiating a physical stimulus. However, using different cameras for maneuvering and for recording results in blind/semi-blind recording. Therefore the operator won’t have total control of the specific object that is being recorded, because the FPV channel only streams the video from the first camera which is not directed at the target. Another method that can be used by an operator to evade detection is redundancy. In this case, the operator can program the drone to generate traffic at a fixed bitrate. This can be done by transmitting the raw data (instead of delta frames) that is captured by the camera at a low quality. Another option for implementing a fixed bitrate approach is to encode the captured data using delta-frames (in order to support high resolutions) and compensate for the difference in the bitrate between the B and P frames to the I-Frames with additional mock encrypted packets that will be sent from the drone in order and be discarded by the receiver. However, the main disadvantage of this method is that calculating the required volume of mock traffic to be sent will cause delays to the FPV channel.

Viii Conclusions

In this research we demonstrated methods that can be used to detect whether an object has been captured and is being streamed from a drone camera to its controller. While many methods have been suggested in recent years to detect the presence of a nearby drone, this research is the first to introduce methods that distinguish between the legitimate and illegitimate (for purposes of privacy invasion) use of a nearby drone. These days, consumer drones are used to conduct privacy invasion attacks throughout the world, however no tool currently exists for showing that a specific drone is being used to stream a target. Perhaps in the future, legislation will consider a watermarked PCAP as evidence, opening up new opportunities for implementing our methods in a system or service.

Ix Acknowledgments

We would like to thank Roei Cohen and Ido Lavi for their help filming the videos.

References