:camera: :camera: :camera: :camera: :camera: Wireless software synchronization of multiple distributed smartphone cameras.
We present a method for precisely time-synchronizing the capture of image sequences from a collection of smartphone cameras connected over WiFi. Our method is entirely software-based, has only modest hardware requirements, and achieves an accuracy of less than 250 microseconds on unmodified commodity hardware. It does not use image content and synchronizes cameras prior to capture. The algorithm operates in two stages. In the first stage, we designate one device as the leader and synchronize each client device's clock to it by estimating network delay. Once clocks are synchronized, the second stage initiates continuous image streaming, estimates the relative phase of image timestamps between each client and the leader, and shifts the streams into alignment. We quantitatively validate our results on a multi-camera rig imaging a high-precision LED array and qualitatively demonstrate significant improvements to multi-view stereo depth estimation and stitching of dynamic scenes. We plan to open-source an Android implementation of our system 'libsoftwaresync', potentially inspiring new types of collective capture applications.READ FULL TEXT VIEW PDF
:camera: :camera: :camera: :camera: :camera: Wireless software synchronization of multiple distributed smartphone cameras.
Video recording app with sub-millisecond synchronization accuracy for multiple Android smartphones, useful for creating affordable and easy-to-setup multi-view camera systems for robotics, SLAM, 3D-reconstruction, panorama stitching
Android app demonstrating precise time synchronization with several smartphone IMUs
Many computer vision algorithms require multiple images from different viewpoints as input. These algorithms can fail spectacularly when applied to dynamic scenes if the images are not captured at the same time. Depth from stereo, view interpolation and view supervision losses are just a few examples of such algorithms. We provide a solution to temporally synchronize the cameras of multiple inexpensive smartphones, so that they can capture images simultaneously (Fig.1).
Our solution is software-based, does not require any additional hardware, and scales up to multiple devices. Furthermore, the devices are synchronized prior to capture and can be located in any physical arrangement as long as they can communicate over a wireless network.
Traditionally, synchronizing cameras in advance of capture is solved by using specialized hardware such as analog video gen-lock or the IEEE 1394 isochronous interface. While effective and reliable, hardware solutions require cumbersome wires, limiting their portability, and more importantly, are unavailable on the vast majority of cameras. For non-specialized cameras, this limits usage to nearly static scenes where several frames of timing error are acceptable.
Another option is to capture image sequences and align the frames after capture. A clapperboard or similar “oracle” device that is visible to all cameras can assist in aligning the frames. However, accuracy is limited to half a frame duration. Post-processing methods to obtain sub-frame alignment exist [1, 2, 3, 4, 5], but can break down for low-texture or noisy scenes where finding correspondences is hard. Even if these algorithms yield perfect sub-frame alignment, it is necessary to temporally interpolate the frames to the aligned sub-frame time , which is another challenging problem. In addition, such methods may require capturing much more data than necessary to ensure that the same instant in time is captured by all cameras.
In contrast, our system is capable of achieving a synchronization accuracy of less than before capture, allowing for the multi-view simultaneous capture of highly dynamic phenomena such as sports action shots, birds in flight, and splashing liquids (Fig. 2). To acquire synchronized images, users are only required to install our software, position the phones, and press the shutter button on one of the phones. Our system does not look at image content; instead, it uses only image timestamps and wireless messages for synchronization. This means the cameras can have non-overlapping fields of view and can be placed in any configuration. For example, a hand-held, inexpensive light field capture rig can be created by placing the cameras in a square array. Or the cameras can be placed in a linear array, so that a view interpolation algorithm can create a “bullet time” effect where the camera moves around an apparently frozen dynamic subject .
Our system is especially useful for collecting data for deep learning algorithms that require synchronized images for training[7, 8, 9] and also need to be deployed on a smartphone. Were such data collected with hardware-synchronized non-smartphone cameras, biases introduced by differences in the sensor or lens could ruin the algorithm’s performance. A portable rig containing smartphones synchronized using our system provides an inexpensive method to collect such datasets in the wild.
Our system operates in two stages. First, the clocks of all the phones are synchronized to the clock of a leader device using a variant of the Network Time Protocol (NTP). This synchronization is only enough to get an accuracy of half a frame duration (e.g., ). In the second stage, the cameras are instructed to capture a continuous stream of images. We then shift the phase of all client streams to that of the leader and achieve an accuracy better than .
In summary, we contribute the following:
Two algorithms for efficiently phase aligning camera image streams to arbitrary precision.
A system combining phase alignment with a variant of NTP that achieves an accuracy better than on unmodified consumer smartphones over WiFi.
A theoretical analysis and empirical confirmation of the accuracy and limits of our system.
We plan to release libsoftwaresync, an Android implementation of our system as open source.
Precision time synchronization in a distributed system is a classically hard problem. The fact that each device is physically independent with its own hardware clock makes synchronization necessary even if networking were reliable with low communications latency. When applied to the camera synchronization problem, misalignment is primarily due to two factors. First, each device has an independent notion of the current time, which may drift and is at best only loosely aligned to an external source (e.g., GPS or cell network). Second, consumer cameras typically do not feature a mechanism by which image capture can be triggered to occur at a specific time. Instead, there is a large firmware and software stack between the camera hardware and the application that requests an image, which introduces additional variable latency. Without physical wires to synchronize all clocks and cameras in the network, it is challenging to trigger all cameras at the same time.
A number of systems focus on aligning video sequences and correcting for rolling shutter after they have been captured. Some systems use active illumination to tag frames from which an offset may be inferred . Others rely on the scene content itself, by detecting abrupt lighting changes , tracking continuous periodic motion , or requiring motion be simple enough that the spatiotemporal transformation can be represented by a low-dimensional model [4, 5]. As the source imagery is fundamentally unsynchronized, these post-capture techniques all ultimately rely on optical flow and interpolation to bring images into sub-frame alignment. Our goal is to synchronize cameras before capture, ensuring they expose the same instant in time and sidestepping the frame interpolation problem.
Gu et al.  describe a simple scheme for projector-camera synchronization by forcing the camera to wait until the projector has displayed an image before triggering capture. This technique is straightforward to implement but cannot capture moving subjects, can introduce unacceptably long shutter lag, and is impractical for video sequences. Litos et al.  describe a system that uses the Network Time Protocol (NTP)  to align device clocks but neglects the variable latency between when software sends a shutter command and when the camera ultimately captures a frame. Petkovć et al.  describe a clever system for synchronizing a projector and a camera by using the camera itself to measure the signal propagation delay. But this requires a projector and also does not account for the variable latency in triggering a camera.
The system by Ahrenberg et al.  is closest in spirit to ours. Like our system, they rely on NTP to synchronize device clocks before sending a single recording command. Their system relies on the IEEE 1394 hardware trigger to keep pace, but like the other systems above, assumes that the delay between the software trigger command and image capture to be the same among all cameras. Our system does not require a hardware trigger and explicitly models the variable latency between the software trigger command and image exposure.
We give a high level overview of our system. We first list our assumptions of the underlying system, which justify our algorithm and where it may be deployed.
Our system consists of a collection of devices, each of which is equipped with a camera and a network card. In practice, these are smartphones with an integrated camera and WiFi. At startup, the devices connect over WiFi, synchronize their clocks, and puts their cameras into continuous streaming mode (Sec. IV). After a second stage where camera streams are phase aligned (Sec. V), initialization is complete. At any time after, when a capture request containing a timestamp arrives, all devices simply save the appropriate images from a ring buffer to disk. Note that this requested timestamp can be in the past, enabling zero shutter lag captures. Our goal is to do the above with reasonable latency and as such, rely on a few assumptions of the underlying hardware:
Modest variance in network latency. The first stage of our algorithm is to bring the devices’ local clocks
into alignment. Theoretically, our system can tolerate arbitrary latency variance, although it may take an unacceptable amount of time to converge. To support arbitrary devices without hardware timestamping of network packets, we include in our latency model variable delays induced by the operating system’s networking stack.
Hardware camera timestamps. We require that the camera hardware tag images with timestamps in the domain of the device’s local clock. A key component of our algorithm is accurately modeling the time between requesting an image and when it is captured. If images were not timestamped in the same clock domain, this modeling would not be possible. iPhones and numerous Android devices support this feature.
Hardware image streaming We require that the camera hardware or firmware be able to latch capture settings such that it can stream frames indefinitely with low timing variance between frames. For example, if we request images at ISO 100, exposure time of , and a frame duration (time between frames) of , we should expect a stream of images with timestamps close to apart. Most commercially available cameras and smartphones support this feature.
Like previous work[11, 14], we use a variant of the Network Time Protocol (NTP)  to synchronize device clocks, for which we give an simplified overview. This step is unnecessary if all devices had support for the Precision Time Protocol (PTP) , which provides sub-microsecond hardware synchronization. However, PTP hardware is typically not available on consumer smartphones. In our system, the user designates one device as the leader and creates a WiFi hotspot. The remaining clients connect to the leader and estimate their clock offsets. Typically, device clock synchronization is only performed once at the beginning of a capture session. In our experiments, clocks remained stable over the course of one hour. If more accuracy is needed, the system can re-synchronize between captures.
A single message in the synchronization routine consists of the following handshake (Fig. 3) between the leader and one of the client devices, each client is handled independently:
At time in the leader’s clock domain the leader sends the message “”.
At time in the client’s clock domain, it receives the message “”.
At time in the client’s clock domain, it sends the message “”.
At time in the leader’s clock domain, it receives the message “”.
We then estimate clock offset and round-trip delay :
Note that the computation of assumes that communications latency is symmetric, typical in networking. If not, there will be a systematic bias which places a limit to our overall synchronization accuracy. Since individual clock readings and network latency measurements may be noisy (due to e.g., buffering), we collect multiple samples and filter them. To synchronize devices, each of the client devices exchanges messages with the leader in a round robin fashion. Following the best practices suggested by NTP, we determine using the mean and min filters.
The mean filter assumes that communications latency is normally distributed. This lets us compute the sample mean and variance after rejecting outliers where the round-trip delay is greater than some threshold (e.g., due to buffering or interference). Since variance decreases as, we can determine the number of messages required to reach a desired accuracy. However, in our experiments, we found that the mean filter converges to a systematic bias of when compared to an external reference clock. To ameliorate this bias, we use the min filter.
The min filter assumes that the sample with the minimum
latency is the most reliable. This heuristic, also used by the NTP standard, is based on the hypothesis that the message with the shortest delay experienced the least amount of processing and therefore is the most symmetric. We confirm this hypothesis quantitatively and compare the mean and min filters in Sec.VI. In our implementation, we can either exchange a fixed number of messages and take the minimum, or iterate until we observe a sample with latency below some threshold. To guarantee a predictable startup time of less than 10 seconds on our rig of 5 smartphones with a mean latency of , we used a fixed samples.
Once all device clocks are synchronized, we can assume that all images are timestamped in the leader’s clock domain (by adding the appropriate offset). However, to actually capture images simultaneously on all devices, we need to account for the variable latency between when a device requests a capture and when the sensor is actually exposed. This latency can be highly variable and is hard to measure accurately. Instead, we exploit hardware image streaming: the ability to indefinitely capture a sequence of images at regular intervals. We seek to shift the phases of each stream, so that all cameras are exposing at the same time. Then, all that remains is to select from the image ring buffer which frames to save to disk.
Suppose that the leader camera and a client camera start streaming at times and respectively and repeatedly capture images with period . We seek to reduce the difference in their phases to a small value by shifting the phase of the client camera (Fig. 4a). As opposed to trying to align and directly, phase shifting a frame stream can be done efficiently and accurately.
We describe two methods to phase shift a frame stream, reset sampling, a slower to converge method that is applicable to many devices, and frame injection, a faster to converge method that may not work on every device.
In reset sampling, we restart the camera until its phase falls within a tolerance of the goal phase (the gray region in Fig. 4b). There is significant variance in how long it takes a camera to reset ( in our experiments). Furthermore, this distribution can vary from camera to camera and in the worst case can result in phases always falling outside the tolerance. Therefore, we sleep for a random amount of time
after stopping the camera and before restarting it to make the overall distribution of resets more uniform. If the latency of starting the camera is described by the random variable, then, the total phase offset can be modeled as:
where the approximation holds as long as is large relative to and the range of . We found a value of
second to be reasonable. Since we are able to shift the phase by a random amount in each iteration, we effectively sample phase offsets from a uniform distribution and reject those that are outside the gray region in Fig.4b. There is an chance a sample will be accepted on any one iteration. This means to have a 95% chance of converging within a tolerance of , we need to reset the camera times.
For and , iterations are required.
Some camera APIs (e.g., Android’s Camera2 API ) allow a high priority capture request to be injected into a normal priority continuous image stream. Typically, the continuous stream is used for the viewfinder, while the high priority request is used to capture a still image with different exposure or gain settings. By injecting a frame with exposure longer than the duration , we can shift the phase of a stream in one iteration. In an ideal camera, if the phase offset between two image streams is , we can align the streams by injecting a frame with exposure time . Real cameras can behave differently than this ideal, but as long as the camera is deterministic, we can fit a function mapping a desired phase offset to an exposure that will shift the camera’s stream by for some integer . For example, on one specific phone model, we empirically found that exposures of length would shift the phase by
. Some cameras require exposure times to be an integer multiple of a scanline’s readout time. While frame injection is deterministic, we model this quantization as Gaussian noise with standard deviationand measured to be on the order of tens of microseconds. Under this assumption, this method will converge in
iterations with 95% probability. If, frame injection converges in 1 iteration.
We quantitatively verify the accuracy of our method and show that it outperforms several common naive approaches by three orders of magnitude. We also demonstrate how our method qualitatively improves the quality of multi-view stereo and panorama stitching. Finally, we show simultaneous multi-view captures of several highly dynamic scenes.
To quantitatively validate our method, we use an LED panel as an external high-precision frequency reference (Fig. 5a). The panel contains a fast-changing 1010 array and a slow-changing 101 array. In the fast-changing array, a column is lit for a fixed time period , then the next column is lit for , and so forth until the last column is reached and the pattern repeats. In the slow-changing array, each LED stays on for . The two arrays allow us to measure synchronization error with a resolution on the order of . Since the entire panel repeats with a period of , it is possible though unlikely for the measured synchronization to be off by exactly where is an integer. We can eliminate this possibility by using multiple values of .
In our first experiment, we placed two Google Pixel 3 smartphones on a tripod and captured exposures of the LED panel with set to (Fig. 5a). The frame duration was . To avoid confounding synchronization error with misalignment caused by mismatched rolling shutters, we oriented the phones so that their center scanlines both capture the fifth row of the fast moving array of the LED panel.
We measured the synchronization error of our min filter and frame injection methods by conducting 239 trials. In each trial, we first reset the synchronization process to undo clock sync and added a random length long exposure frame to undo the phase alignment. We then asked our system to synchronize to a tolerance of . Example captures from the two phones are shown in Fig. 5(b-c). The skewed pattern of LEDs is due to rolling shutter.
In all pairs of images, there is no more than 1 LED difference between the pairs, showing that the total synchronization error is less than . We also ran experiments with set to to verify that our results were not off by an integer multiple of . To get more fine-grained error measurements, we exploit the fact that short exposures and rolling shutter means some LEDs will only be partially captured. This allows us to estimate the error between the two phones with temporal resolution better than . We use the mean pixel intensity in each grid cell in the fifth row of Fig. 5(b-c) to estimate the sub-grid position of the LED lights.
For two devices, we found the maximum error between them to be (Fig. 6c and Table. I). With more than two devices, any pair can have this maximum error but in opposite directions implying a worst case error of .
We also compared to three naive synchronization methods: wired, Bluetooth, and WiFi. In the naive wired method, we used a selfie stick connected via an audio splitter to two Pixel 3 smartphones, such that a single button press would trigger both phones. We used the same setup as in Fig. 5(a), except we set to . In our naive Bluetooth method, we connected wireless Bluetooth photo capture triggers to each device and used a mechanical button to press both triggers at the same time. In our naive WiFi method, we connected one phone to the other’s WiFi hotspot. On shutter press, the phone with the hotspot then instructed itself and the client to take photos immediately. We conducted 20 trials for each of these methods. In the wired and Bluetooth methods, 2 and 1 shot failed to capture respectively due to issues with the extra hardware. These naive methods have errors over a 1000 times larger than our method and occasionally fail to work (Table I). In addition, the wired and Bluetooth solutions require extra hardware making them cumbersome to use and difficult to scale to many phones.
|Method||Max error||Mean Abs Error||Stdev|
The overall accuracy of our system, , depends on many factors. The two main ones are how well estimated the clock offsets are () and how well aligned the image stream phases are (). There are also other sources which we do not account for, such as imperfect hardware timestamps and variances in the readout period . Ignoring error from other sources, we model the total error as:
In the previous section, we directly measured using an external reference. We can also directly measure by comparing the an image’s actual hardware timestamps to the requested ones. can then be estimated as .
We first measure phase error over the same 239 trials as the previous section. We found that the phase error is always within our requested tolerance of and that the distribution of differences has zero mean (Fig. 6a). The clock error is not zero mean and has a slight negative bias (Fig. 6b). We analyze this further in the next section.
Note that both the maximum and mean absolute error are comfortably less than . Table II also shows that the dominant component of the error in our system is due to clock error, which, as noted earlier, can be reduced to less than 1 microsecond by using PTP hardware.
|Max error||Mean Abs Error||Stdev|
We analyze the behavior and effectiveness of using the mean and min clock filters. With the same setup, the two phones exchange 10,000 NTP messages, and capture one image pair of our LED panel. From this image pair, we can estimate the true clock offset, from which we can derive the one-way latency of each message. Like our other experiments, images are hardware timestamped and messages are software timestamped, both in each device’s local clock domain. Each message takes roughly round-trip and the entire procedure takes 34.3 seconds. Fig. 7 shows the difference in one-way latency distributions over 10,000 messages after removing 64 outliers where the latency exceeds .
|Leader to client|
|Client to leader|
|Abs. latency difference|
Mean filter - NTP’s clock offset accuracy relies on network symmetry. Fig. 7 and Table III shows that this is not the case in our setup and that computing the mean offset over many messages converges to a bias that is approximately half the absolute latency difference. The exact values will vary based on network configuration and hardware used. The supplement contains additional details on how we model the relationship between round-trip latency and bias.
Min filter - The Min filter approach uses the message with the smallest round-trip latency. In Table III, NTP bias is approximately half the difference in one-way latencies. The min filter, by choosing the sample with lowest latency, is subject to significantly less latency asymmetry and therefore achieves a much lower bias in its clock offset. We conducted 3 quantitative experiments with sample sizes and found that on average, the shortest round-trip latency to be , , and , respectively with low variance. In practice, we use 300 samples which takes under 1 second per client and still provides total synchronization accuracy.
Errors under are likely to be confounded by measurement noise such as misaligned scanlines between phones, rolling shutter skew, and our method of using an LED panel running at . Errors in the phase are under , which is minor compared to compared to the total error (Table II).
We measure the number of iterations and total wall time required for phase alignment using both the frame injection and reset sampling methods. Note that once a client has synchronized its clock with the leader, phase alignment can proceed independent of other clients and therefore does not scale with the number of clients.
For the frame injection method, we used the same dataset of 239 trials as in earlier experiments. Aligning to took between 4-7 injected frames, with a mean of 5.5 frames and frames. On Pixel 3, each injected frame takes so phase alignment converges in less than 3 seconds.
To measure reset sampling, we ran 10 additional trials with a tolerance of . As expected, reset sampling took significantly longer than frame injection, taking on average 28.7 iterations before convergence with a standard deviation of 25.2 iterations. Because reset sampling requires restarting the camera and sleeping for a random amount ([0, 1] seconds), each iteration took on average 1.23 seconds ( seconds). According to Eq. 3, it will take 98 iterations or 120 seconds to achieve a 95% probability of alignment. Although reset sampling is significantly more expensive than frame injection, it does not require any modeling of camera request latency and is deployable on a wider variety of devices.
While our system has many applications, we focus on two here – stereo reconstruction and image compositing of dynamic scenes. In addition, we use our method to capture multiple views of highly dynamic scenes (Fig. 10
and the supplement). Our system has also been used for training a machine learning algorithm shipping with a commercial smartphone[redacted].
Stereo reconstruction of dynamic scenes - Stereo techniques reconstruct 3D geometry from two or more photos of the same scene from different viewpoints . Assuming known camera poses, these techniques typically first identifies a scene point’s projections in multiple photos and then estimates its depth by triangulation. A fundamental assumption is that the scene point is stationary across photos. Therefore, we need synchronized capture to apply these techniques to dynamic scenes.
To qualitatively demonstrate the efficacy of our synchronization for depth estimation of dynamic scenes, we capture a pair of stereo images and compute depth using COLMAP , a state-of-art Structure-from-Motion  and Multi-View Stereo system  (Fig. 8). We compare it with naive synchronization, which we simulate by selecting frames that are apart, a synchronization error that is comparable to the smallest average error among the different naive synchronization methods listed in Table I. Clearly, synchronization enables higher quality depth. Moreover, since our approach scales to multiple cameras, we can further increase the accuracy of depth, especially around the occluded regions, by adding more synchronized cameras (see the supplement).
Image compositing for dynamic scenes - Image composites stitch multiple images of the same scene from different viewpoints into a single image[22, 23] that may otherwise be difficult or impossible to capture using a single camera, e.g., an extremely wide angle image or a multi-perspective image[24, 25]. A typical approach consists of aligning the images, projecting them onto a suitable compositing surface, and blending them to minimize alignment errors and account for differences in exposure, color balance, etc. While a composite may be constructed from images captured by a single camera, a dynamic scene requires synchronized captures from multiple cameras to minimize stitching artifacts.
Fig. 9 demonstrates that our system can be used to avoid stitching artifacts due to motion. Our system can also synchronize exposure and white-balance across devices making it easier to blend between images in the composite.
We presented an entirely software-based system for capturing time-synchronized image sequences on a collection of smartphones. Our method is wireless, runs on commodity hardware, and is able to achieve an accuracy of , which we verified using a precision chronometer to be significantly better than naive methods that synchronize cameras prior to capture.
We empirically confirmed that the accuracy of our method largely depends on NTP clock bias, which can be minimized following the existing best practice of min filtering. We also analyzed and verified the convergence behavior of two phase alignment approaches: reset sampling, which is slow but widely applicable, and frame injection, which converges rapidly but has a minor hardware requirement.
As future work, we would like to scale up our system, i.e., to thousands of different devices distributed over a large geographical area maintaining synchronization over extended periods. Our system has only been tested on Google Pixel 1, 2, and 3 smartphones. The network is also restricted to devices (a limit imposed by the Android OS) and all clients must currently communicate directly with a designated leader device. A peer-to-peer model can address both these limitations and adapt to both clock drift and motion. We will open-source libsoftwaresync in hopes of inspiring new possibilities in social, collective photography applications. We envision such a system being used to capture large-scale dynamic events such as concerts and sports matches.
We thank Sam Hasinoff, Jon Barron, Roman Lewkow and Ryan Geiss for their helpful comments and suggestions, as well as Nikhil Karnard, Tim Brooks, Orly Liba, and David Jacobs for their help with collecting qualitative results.
J. Xie, R. Girshick, and A. Farhadi, “Deep3D: Fully automatic 2D-to-3D video conversion with deep convolutional neural networks,”ECCV, 2016.