Log In Sign Up

Selfie Drone Stick: A Natural Interface for Quadcopter Photography

A physical selfie stick extends the user's reach, enabling the creation of personal photos that include more of the background scene. Conversely a quadcopter can capture photos at distances unattainable for the human, but teloperating a quadcopter to a good viewpoint is a non-trivial task. This paper presents a natural interface for quadcopter photography, the Selfie Drone Stick that allows the user to guide the quadcopter to the optimal vantage point based on the phone's sensors. The user points the phone once, and the quadcopter autonomously flies to the target viewpoint based on the phone camera and IMU sensor data. Visual servoing is achieved through the combination of a dense neural network object detector that matches the image captured from the phone camera to a bounding box in the scene and a Deep Q-Network controller that flies to the desired vantage point. Our deep learning architecture is trained with a combination of real-world images and simulated flight data. Integrating the deep RL controller with an intuitive interface provides a more positive user experience than a standard teleoperation paradigm.


page 2

page 5

page 7


A Deep Learning Approach to Drone Monitoring

A drone monitoring system that integrates deep-learning-based detection ...

Can We Enable the Drone to be a Filmmaker?

Drones are enabling new forms of cinematography. However, quadrotor cine...

A simulation environment for drone cinematography

In this paper, we present a workflow for the simulation of drone operati...

Towards Reproducible Evaluations for Flying Drone Controllers in Virtual Environments

Research attention on natural user interfaces (NUIs) for drone flights a...

Intelligently Assisting Human-Guided Quadcopter Photography

Drones are a versatile platform for both amateur and professional photog...

Control Design of Autonomous Drone Using Deep Learning Based Image Understanding Techniques

This paper presents a new framework to use images as the inputs for the ...

Where is my Phone ? Personal Object Retrieval from Egocentric Images

This work presents a retrieval pipeline and evaluation scheme for the pr...


Although there has been prior work on the problem of improving quadcopter teleoperation [alabachi2018intelligently, lan2017xpose], and photography [cheng2015aerial, Sukthankar-Rey-FLAIRS2016], the premise behind most of these investigations has been that the user must learn the proposed interface paradigm. Our philosophy is to make the users learn as little as possible and the system learn as much as necessary. Our Selfie Drone Stick (SDS) interface mimics the functionality of a selfie stick, enabling the user to control the quadcopter with only a mobile phone and a simple gesture.

The goal is to generate a well-framed selfie of the user against the desired background, as if it were taken using a virtual selfie stick extending from the user in the direction of the handheld smart mobile device (SMD). The user specifies the desired composition using an ordinary selfie captured using the SMD, where the relative orientation of the SMD directly specifies the azimuth and elevation of the vantage point while the desired distance is indirectly specified by the size of the user’s face in the SMD frame. The drone flies to the optimal vantage point to capture the selfie using a learned controller. The drone mirrors the bearing of the SMD as measured by its onboard IMU and selects an appropriate distance such that the user’s body visually occupies the same area in the drone selfie as the user’s face did in the SMD frame. The resulting photos frame the user against the entire background, just as if the user had used a very long selfie stick to compose the photograph.

Instead of using deep RL to learn a direct control policy based on the raw pixel data as was done in [mnih2015human], our controller utilizes an abstract state space representation. First, a dense neural network object detector (DUNet) is trained to detect the human face (which is prominent in the phone camera image) and also the human body (visible from the drone’s viewpoint). We pre-train DUNet on PASCAL VOC [Everingham15] and WIDER FACE [yang2016wider] datasets for human and face objects.

The DUNet architecture consists of a sequence of dense blocks that process the input image at different scales, connected to a sequence of prediction layers, each of which independently generate detection results. Two strengths of DUNet are its processing speed and ability to reliably detect small objects; in prior work, we demonstrate its ability to learn customized object detection models for mobile robots [alabachi2019customizing]. This learning strategy eliminates the necessity of having extra convolutional layers to account for variations in appearance and orientation. A variant of Double DQN is used to separately learn a flight control policy in a discretized action space using simulation data. Our experiments show that the RL controller outperforms a traditional visual servoing approach.

This work introduces a novel interface for automating UAV selfie-photography based on data received from any mobile device. Our system is able to capture selfies at the correct depth, background and orientation angle in the scene, with the user placed at the right position in the frame. Once trained, our RL controller autonomously flies the quadcopter to the optimal vantage point. The system architecture for the Selfie Drone Stick is shown in Figure 1, and the ROS code has been made publicly available.111Download SDS from

Related Work

The mounted high resolution cameras of the new generation UAVs have motivated work on automating photography and videography [cheng2015aerial], creating dynamic scenes from high and hard multi-view points [nageli2017real], object recognition [pillai2015monocular], and recording cinematography videos [joubert2015interactive]. However many tasks are too ill-specified to be performed autonomously, necessitating the development of specialized user interfaces.

Natural user interfaces (NUI) rely on innate human actions such as gesture and voice commands for all human-robot interaction [popov2016control, fernandez2016natural, obaid2016would, ma2016studies]. Alternatively, more precise navigation in indoor and outdoor environments can be achieved through structured waypoint designation strategies [alabachi2018intelligently, liu2011roboshop, gebhardt2016airways]. Wearable sensors were employed in a point-to-target interaction scenario to control and land a drone using arm position and absolute orientation based on the inertial measurement unit (IMU) readings [gromov2018video]. Our system removes the need to employ gestures, hand crafted strokes, or wearable devices. Any mobile device equipped with a camera and IMU sensors can be used to direct the quadcopter using our SDS interface.

A subset of the human-robot interaction research has specifically addressed the problem of user interfaces for drone-mounted cameras. For instance, alabachi2018intelligently alabachi2018intelligently track user-specified objects with an adaptive correlation filter in order to create photo collections that include a diversity of viewpoints. XPose [lan2017xpose] is a touch-based system for photo taking in which the user concentrates on adjusting the desired photo rather than the quadcopter flight path. Unlike XPose, our system does not require a SLAM system and hence is robust to localization errors.

Deep RL has been used to learn specialized flight controllers; for instance, DQN was used to learn autonomous landing policies for a quadcopter with a downward facing camera [polvara2017autonomous]

. Most machine learning controllers are learned in simulation and game worlds; however a notable exception is Gandhi et al. gandhi2017learning who learned UAV obstacle avoidance policies by collecting a large dataset of real-world crashes. Our deep RL controller is learned using a combination of real-world images used to train DUNet and Gazebo flight trajectories for training DQN. In order to capture the perfect selfie, our learned controller must be able execute multiple types of flight paths and is not limited to a single maneuver.


Figure 2:

Diagram of the three agents in our SDS system. TMA runs on the SMD and uses camera frame and IMU sensor readings to create the target vector. EMA runs on either the simulated or real world environment. Its task is to create the drone-shot vector and state space. NPA is the DNN agent trained to autonomously navigate the UAV towards achieving the drone-shot vector.

The selfie drone stick (SDS) platform is composed from three cooperating agents, as shown in Fig. 2 and described below.

Target Modeling Agent (TMA):

The goal of the TMA is to process sensor data from the user’s smart mobile device (SMD) during selfie acquisition and to generate a specification for the desired vantage point to which the drone should fly. The SDS system is triggered by the user by taking a regular selfie using our web-based camera app. When the shutter is pressed, the TMA captures the device’s orientation from its IMU along with the image. The IMU information partially specifies a bearing from the user along which the drone should seek to position itself in order to capture the desired shot. To fully specify the bearing, we also need to know where the user was located within the selfie frame, and this is accomplished using an object detector to detect the location of the user’s face in the selfie. We employ the DUNet [alabachi2019customizing] object detector for this purpose since it was designed to run efficiently on mobile devices. In addition to the bearing to the desired vantage point, we also need to specify the range. The key idea behind the SDS interface is to enable the user to specify the distance to the vantage point by varying the distance of the SMD from the user’s face — moving the SMD further away should cause the drone to capture photos from further away. Thus, the TMA fuses the IMU data (yaw angle, ) with the face bounding box (location of centroid and ratio of user’s face to the image) to generate the target vector. The mapping between the size of the user’s face in the SMD frame and the desired size of the user’s body in the drone frame is described below; intuitively, this is analogous to the length of the user’s virtual selfie stick.

Environment Modeling Agent (EMA):

The EMA serves a twofold function: 1) it manages the simulated environment used for training the learned drone controller, and 2) it is responsible for mapping between the SMD vector (relative position of smartphone from user), as generated by the TMA to the full specification of the best vantage point for the drone.

Our simulation is based on the Gazebo simulator [koenig2004design] with ROS [quigley2009ros] and inherits the openAI Gym environment [brockman2016openai] to observe a new state every ms. Rather than utilizing the raw image, the observed state consists of an abstracted representation:

where denotes the yaw angle, is the location of the centroid of the bounding box surrounding the detected object in normalized coordinates and

is the ratio of the bounding box of the user to the whole image. These four observations enable straightforward transfer from simulated to real environments and are robustly estimated in the real world by a combination of IMU and object detection. While the first three quantities are the same between SMD and drone, the last is not. This is because the ratio of the user’s face to the image captured by the SMD falls in a different range to the ratio of the bounding box on the user’s body to the image captured by the drone. The former is observed to fall in the range of

while the latter has a range of ; we simply map linearly between them. This enables us to have a consistent specification for the vantage point across both simulated and real environments, as well as between the SMD and the drone coordinate frames.

For Deep Reinforcement Learning (Deep RL) methods, it is easier to learn when the drone employs a discrete action space. For such controllers, we simplify the drone controls to change each of roll, pitch, height, yaw by a fixed amount in each direction, resulting in 81 discrete actions. For Deep RL training, we also discretize the state vector as follows:

into 8 bins, into a grid, and into 7 bins.

It is not obvious what reward function to use for training a deep RL controller for this application. We design a normalized reward function (ranging from -1 to +1) based on the following principles. At each time-step:

  • +0.25 for each dimension where the state matches the target (+1 if all 4 dimensions match);

  • 0 if a value falls in an adjacent bin, decreasing linearly with bin distance, independently by dimension;

  • -1 if the observation falls out of bounds (e.g., user is not visible in frame)

An episode is terminated when the agent hovers at the target vantage point for three successive time-steps or if it fails to reach the goal within 80 time-steps.

Figure 3:

Learning NPA with the 3D model observations. Each epoch is 100 episodes. The first graph shows the number of drone-shots in each epoch. The second graph shows the loss function between the predicted and the target Q values.

Figure 4: Evaluating SDS in a realistic simulation environment. Bottom left: initial view of subject from drone. Top left: user specifies a target vantage point using his SMD. Bottom: Deep RL controller navigates drone to vantage point and captures long-range selfie.

Navigation and Photography Agent (NPA):

The third agent is responsible for controlling the drone, either using a learned Deep RL controller or with the baseline closed-loop visual servoing PID controller.

For the Deep RL controller, we have to be careful during training not to suffer from the correlated nature of the episodic data. Therefore, we prioritize [schaul2015prioritized] the seeds from our experience replay buffer in order to obtain independent sample batches and expedite the learning of our low capacity network. Greedy Epsilon is used to randomize action selection starting with , annealing to with decay rate of 2e-5 as the agent learns better Q-values. NPA predicts the actions that approximate the drone-shot vector in a deterministic-off-policy agent where . Random seeds of stacked states are passed to our network. Each stacked state is a matrix of four drone-shot vectors in order to handle the temporal limitation problem [mnih2015human]. Instead of using a DQN network, we adopt Double DQN [hasselt2010double] and Target network [mnih2015human] in our design for more stable learning in a dueling network architecture [wang2015dueling] to reduce the long convergence time that comes from the large action space through decomposing the tail of our network to find the advantage value of each action and aggregate it with the value function as in Eq. (1):


Any bad action that happens to occur in successful episodes is eliminated by updating the expected future reward estimation at each time step or temporal difference (TD).

Fig. 3 shows a training run for the Deep RL controller, where the system is able to learn from simulated data. The learned controller is then evaluated in realistic simulated scenarios before being deployed in real-world settings (Fig. 6).

Experimental Results

We seek to evaluate the Selfie Drone Stick (SDS) along several dimensions. The two key questions we discuss here are: 1) how does our learned Deep RL controller compare to a typical visual servoing baseline, and 2) how well do controllers trained in simulation transfer to real-world conditions? We detail the setup below.

During training, our environment consists of an empty 3D domain containing a single human and single UAV (drone) model. This is feasible because our RL method employs an abstracted state vector rather than raw imagery. Training was performed on a single NVidia Titan X GPU rather than a mobile device.

We test the system in several realistic simulated environments, such as the one shown in Fig. 4. This allows us to conduct end-to-end experiments under repeatable conditions with known ground-truth, with exactly the same perception system (DUNet) as we employ for real-world testing, as shown in Fig. 6.

Both the simulated and real-world scenarios follow a consistent script:

  1. The drone is initialized facing the human subject at a safe distance. Take-off is activated by holding the SMD with degree alignment around coordinates, and the landing command is sent by tilting it around the -axis.

  2. Prior to activating the SDS, the drone hovers under DUNet control, with the subject centered in the middle of the screen with .

  3. Once the user activates SDS by taking a selfie using the SMD, the TMA identifies the desired vantage point for the long-range selfie and transmits this to the EMA.

  4. At each time-step, the EMA generates an observation-vector state using DUNet and stacks the previous three states into an observation (with short-term temporal context) that is sent to the NPA.

  5. The NPA autonomously navigates the drone, either using the baseline PID controller or the learned Deep RL controller.

  6. Once the drone arrives at the bearing and range consistent with the specified vantage point, it takes a long-range selfie of the subject.

Baseline: Closed-Loop PID Controller:

The baseline system is a traditional PID control loop that attempts to perform visual servoing to get the drone to the vantage point. Architecturally, it is identical to the Deep RL version except that the NPA operates in a continuous action space.

Table 1 summarizes the percentage of trials for which a given controller can get to within a threshold of the target value corresponding to the vantage point, for each of the state vector dimensions, as well as the overall accuracy (percentage of trials where drone arrived at the vantage point). We see that the deep RL (denoted SDS) outperforms the baseline (closed-loop) in both overall accuracy as well as on each dimension of the observation vector.

Fig. 5 compares the two automated systems against a purely manual interface [alabachi2018intelligently] in three simulated scenarios, where we measure time taken to get to the vantage point. The manual interface works reasonably well under easy conditions (target 1, left) but is outperformed by SDS in all scenarios, because moving the drone manually in the harder scenarios takes longer. Similarly, the baseline automated system eventually gets to the goal but takes a longer time.

Parameter Closed Loop () SDS ()
ratio 83% 85%
Cx 78% 81%
Cy 76% 84%
Yaw 78% 83%
Accuracy 79% 86%
Table 1: Closed Loop vs. SDS comparison under identical conditions in realistic simulated scenarios. The mean value for each parameter is taken from 20 trials, each with 60 time steps, which is adequate for both controllers.
Figure 5:

Mean and standard deviation of 20 successful tests done to achieve 3 targets: Target 1: Subject in the middle with depth ratio=0.08. Target 2: Subject is in the bottom right corner with depth ratio = 0.12. Target 3: Subject in the bottom left corner with depth ratio = 0.12.

Finally, Fig. 6 shows the SDS interface operating in a real-world setting for the three scenarios shown in the previous experiment (easy scenario shown in center). These experiments employed an iPhone SMD in conjunction with an AR.DRONE 2.0 UAV with a 30fps frame rate. The Selfie Drone Stick interface enables the user to take multiple selfies with different backgrounds as the user moves in the environment.

Figure 6: Testing with Parrot ARDrone 2.0 in indoor real-world environment


In this paper, we present the Selfie Drone Stick (SDS), our autonomous navigation and selfie-photography platform that takes long-range selfies using a drone from a vantage point specified by the user using a natural “virtual selfie stick” interface.

In future work, we plan to explore whether a Deep RL controller based on policy gradient operating in a continuous action space can further improve SDS. Additionally, we are interested in extending this natural interface to cinematography for selfie videos of moving users.


The authors would like to thank Yasmeen Alhamdan for assisting with figure generation.