A Modified Fourier-Mellin Approach for Source Device Identification on Stabilized Videos

by   Sara Mandelli, et al.

To decide whether a digital video has been captured by a given device, multimedia forensic tools usually exploit characteristic noise traces left by the camera sensor on the acquired frames. This analysis requires that the noise pattern characterizing the camera and the noise pattern extracted from video frames under analysis are geometrically aligned. However, in many practical scenarios this does not occur, thus a re-alignment or synchronization has to be performed. Current solutions often require time consuming search of the realignment transformation parameters. In this paper, we propose to overcome this limitation by searching scaling and rotation parameters in the frequency domain. The proposed algorithm tested on real videos from a well-known state-of-the-art dataset shows promising results.


page 1

page 2

page 3

page 4


A Hybrid Approach to Video Source Identification

Multimedia Forensics allows to determine whether videos or images have b...

Video Camera Identification from Sensor Pattern Noise with a Constrained ConvNet

The identification of source cameras from videos, though it is a highly ...

Facing Device Attribution Problem for Stabilized Video Sequences

A problem deeply investigated by multimedia forensics researchers is the...

Computing Egomotion with Local Loop Closures for Egocentric Videos

Finding the camera pose is an important step in many egocentric video ap...

Warwick Image Forensics Dataset for Device Fingerprinting In Multimedia Forensics

Device fingerprints like sensor pattern noise (SPN) are widely used for ...

GPU-accelerated SIFT-aided source identification of stabilized videos

Video stabilization is an in-camera processing commonly applied by moder...

1 Introduction

Multimedia forensics keeps developing technologies to identify the camera originating a digital image or a digital video. Currently, the most promising technique is based on the analysis of spn or prnu, left by the acquisition device into the visual content. This trace is useful to identify the video source since it is universal (i.e., every camera sensor introduces one) and unique (i.e., prnu from two different sensors are uncorrelated) [10, 14]. Moreover, prnu has proved to be significantly robust to commonly used processing, like JPEG compression [10], or uploading to social media platforms [3, 1].

prnu-based source identification process consists in verifying the match between a query image or video frame and a fingerprint characterizing a reference camera. The strategy involves two main steps: i) a reference fingerprint is derived from still images or videos acquired by the source device; ii) the query fingerprint is estimated from the investigated content and then compared with the reference to verify the possible match, in form of a correlation. If the query content was acquired by the reference camera, then a high correlation is expected.

The previous scheme works under the hypothesis of perfect geometrical alignment between the reference and test fingerprints. If a geometrical transformation is applied to the query content, a pixel grid misalignment between the query and the reference fingerprint arises, thus hindering the detection. Such a case occurs in multiple scenarios: when an image or a video has been acquired with different resolution settings or it is cropped and resized due to the upload in a social media; if a malicious user slightly distorts the content to remove the sensor traces; when a query video is tested against a reference estimated from still images; when a video has been created in presence of electronic image stabilization. In all these cases, the prnu extracted from the query is misaligned with the reference fingerprint, and thus a geometric re-synchronization between them has to be carried out before the matching operation.

The first solution to this problem was proposed in [6], where the case of cropped and downscaled images was studied. The authors show that it is possible to parameterize the ncc between the reference fingerprint and the query noise with respect to the scaling factor. The ncc peak position for a given scaling factor provides an estimate of the shift. While the ncc can be efficiently computed in the Fourier domain, a brute force search is needed to determine the scaling factor. By following the same rationale, more recent papers [17, 8, 11] extend the proposed methodology also considering rotation and the more challenging scenario of video analysis. As a matter of fact, modern acquisition pipelines usually include electronic stabilization that undermines prnu-based attribution technique. In these cases, prnu-based techniques only work if geometric transformations are properly estimated and compensated for, which is a computational complex operation.

In this paper, we focus on the problem of camera attribution of stabilized video sequences based on prnu. Specifically, we propose a method to align frame fingerprints with the reference prnu by recovering the scaling, shift and rotation parameters introduced by electronic stabilization. We overcome the problem of computational complexity by searching for scaling and rotation parameters in the frequency domain thanks to a modified version of the fm. Results obtained on the well known Vision dataset [16] show that the proposed method provides extremely efficient results whenever rotation and scaling operations are applied to video frames. When also shift is taken into account, the gain compared against the state-of-the-art [11] depends on the video content.

2 Background and problem statement

In this section we introduce the background on Fourier-Mellin (FM) transform and define the problem we are tackling in this paper.

Fourier-Mellin Transform. The fm transform enables to estimate scale, rotation and shift transformations between two images in closed form [15].

Given an image , the fm transform is expressed as the log-polar mapping of the magnitude of the image Fourier transform, i.e.,


where is the operator computing the log-polar mapping, and is the magnitude of the Fourier transform.

Let us consider two images and that are linked through a similarity transformation, i.e., , where applies the transformation identified by the matrix


where represents scaling, rotation and horizontal and vertical shift. In this scenario, it is possible to show that is a shifted version of . More formally,


where is the radial coordinate and the rotational coordinate. It is therefore possible to estimate scale and rotation by looking at the peak position of the phase correlation function between and independently from shift [15]. Once and are estimated, the two images can be realigned apart from translation. The relative shift can then be estimated by looking at the peak position of the phase correlation computed between the two realigned images in the pixel domain [15].

Problem formulation. prnu is typically modeled as a multiplicative noise pattern introduced by any device in all acquired images or videos [10, 4]. In the field of forensics analysis, it is well known that prnu can be exploited for inferring whether an image was shot by a certain device. For instance, given a test image and a device PRNU , we can compute the pce between the noise residual extracted from the image and the prnu pixel-wise scaled by , i.e., . Indeed, PCE measures the correlation between the noise traces left on and the device prnu independently of potential shift misalignment, as the correlation peak is searched over all possible mutual shifts between them. If the pce is greater than a confidence threshold, we attribute to the device [10, 4].

The extension of prnu-based strategies for attributing video frames to a specific device suffers from some issues due, for instance, to higher compression rates and lower pixel resolutions. As a matter of fact, the previously described pce test cannot be directly performed, being the prnu resolution typically higher than the size of the recorded video frames. Moreover, in-camera video stabilization techniques, which are now becoming one of the must-have device specifications, strongly hinder the traces left by prnu, as video frames may be warped by means of geometrical transformations (e.g., cropping, rotation, scaling, etc.) in order to generate a stable video sequence [11, 7]. As a consequence, the attribution of a video frame to a specific device can represent a much more challenging task than common image-camera attribution.

In this paper, we exploit prnu-based traces to investigate the problem of device attribution when testing in-camera stabilized video frames. Specifically, given a device fingerprint and a frame coming from a stabilized video sequence, we aim at exploiting the prnu traces left on in order to detect whether it has been recorded by the analyzed device. To do so, we assume that geometric transformations can be approximated by similarities [8, 11] and we propose a geometrical realignment strategy based on a modified version of fm transform applied to both the device fingerprint and the frame noise residual. Specifically, the proposed mfm enables comparing a device fingerprint and a noise residual independently from scaling and rotation operations. The next section provides all the details of the proposed method.

3 Proposed method

In order to attribute a video frame to a device whose reference fingerprint is , we follow a pipeline based on a few steps: (i) noise extraction; (ii) geometric transformation estimation; (iii) geometric compensation and matching. In the following, we illustrate all the steps of the pipeline.

Noise extraction. As in the common prnu-based attribution algorithm, we extract the noise residual from frame . This is done using the strategy proposed in [10, 4]: (i) the noise is extracted through wavelet-based denoising; (ii) a series of post-processing steps (e.g., zero-averaging rows and columns, Wiener filtering, etc.) are applied to further enhance the noise residual .

Geometric transformation estimation. In order to match and the fingerprint , we first need to search for the geometrical transformation that might link them. In principle, assuming that video frames warping can be approximated by a similarity transformation [8, 11], aligning a noise residual and a reference device fingerprint by means of Fourier-Mellin may seem straightforward. In practice, differently from the Fourier-Mellin theory presented in Section 2, the two terms to compare (i.e., and ) are not exactly one the transformed version (by means of a similarity transformation) of the other. First, the geometric transformation introduced by stabilization is not necessarily a similarity, but can include perspective distortions (on the entire frame or a localized portion of it) as well [7, 18]. Second, the noise residuals of video frames may contain scene content and noise contributions which are not present in the reference device fingerprint.

The primary consequence of this dissimilarity is that selecting only the Fourier magnitudes for estimating scale and rotation between the two terms, as reported in (3), may be not precise. Indeed, we verified that phase correlation between and does not show a pronounced peak, thus leading to a strongly hindered estimation of scale, rotation and shift. In order to overcome this issue we modify the Fourier-Mellin pipeline in two ways.

First, we propose to embed the phase term of the Fourier transform in addition to the magnitude to the Fourier-Mellin pipeline. The modified Fourier-Mellin transform of can be thus defined as:


where is the log-polar mapping of the image Fourier transform (including magnitude and phase). On one hand, phase adds more information, which is very useful for angle and scale estimation. On the other hand, this operation comes with a cost. The natural drawback of this approach is that we cannot isolate anymore the estimation of scale and rotation from the estimation of the shift. Indeed, in this case, phase correlation does not exclusively depend on scale and rotation transformations, but also on translation between the two terms. The Fourier-Mellin pipeline works only if and are almost perfectly aligned in terms of translation, i.e., if their mutual shift is basically pixels in both horizontal and vertical directions. In other words, including the Fourier phase term, we first have to correctly realign the prnu traces left on the noise residual with those on the reference fingerprint for what concerns the relative shift, then we can convert the Fourier transforms into log-polar domain and estimate the remaining parameters.

The second proposed modification helps enabling faster computations. It has been shown that a properly selected portion of the prnu frequency spectrum can be sufficient to achieve good attribution performance (e.g., through subsampling [2]). In this vein, notice that a 2D frequency band becomes a rectangular band if the frequency spectrum is converted in log-polar domain. We propose to literally cut the frequency content of and by cropping the log-polar Fourier transform of samples along the dimension. The cropping center corresponds to the coordinate of the highest energy peak of evaluated as a function of . Despite this step might seem irrelevant, this strongly reduces the amount of frequency samples to be correlated, thus lowering the computational cost. We define the modified Fourier-Mellin transform followed by cropping as:

Figure 1: Scheme of the proposed method for similarity estimation between noise residual and reference device fingerprint . The global optimizer searches for shift candidates, while the proposed transform provides an estimate of scale and rotation for each shift.

By considering the added phase term and the frequency cropping step, the best similarity parameters can be estimated solving a maximization problem. Formally,



represents the phase correlation and vector

refers to horizontal and vertical pixel coordinates.

Notice that, for each shift candidate value, scale and rotation parameters can be very quickly estimated in closed form through phase correlation. Therefore, we only need to optimize over different shift values. However, gradient descent strategies to solve (6) suffer from the non-convex behavior of phase correlation as a function of the shift. Especially in video sequences characterized by outdoor scenarios or user motion, the actual peak value can be hard to find with gradient descent algorithms. The maximization problem as a function of the shift can be solved by resorting to global optimization techniques. It is worth noting that the translation between and

can be assumed with slight approximation to imply integer shift in horizontal and vertical directions, i.e., to represent a certain number of pixels. We propose to exploit a global optimization algorithm known as genetic algorithm that allows an efficient estimation of integer parameters

[12]. In a nutshell, our method is shown in Fig. 1.

Geometric compensation and matching. After estimating the similarity transformation , last steps consist in: (i) applying to in order to realign the prnu traces left on with those of ; (ii) resorting to pce as strategy for a correct source device identification. We compute as


As in standard prnu attribution tests, by thresholding it is possible to detect whether the frame under analysis belongs to the tested device. In case multiple frames are available, it is possible to repeat the whole procedure and fuse the results obtained with different frames (e.g., maximum pce picking, majority voting, etc.).

4 Experimental analysis

In this section we report all the details about the performed experimental campaign and the achieved results.

Dataset. Our datasets have been extracted from Vision dataset, which includes both images and videos from major brand devices [16]. For building the prnu related to each device, we select all the available images taken by the device depicting flat scenes [4]. Then, each fingerprint is built by scaling and cropping the prnu, using the image to video warping parameters reported in [11]. Regarding video frames, we select only devices with Full-HD video resolution (i.e., pixels). For the sake of clarity, we make use of the same device nomenclature presented in [16], creating two test datasets: a non-stabilized dataset, selecting non-stabilized devices D03, D11, D17, D21, D24 from different brands, and a stabilized dataset that includes all the available stabilized devices.

Notice that the considered video frames contain both static and motion scenes, depicted as still, panrot, move in [16], and can include almost flat content as well as significant texture presence, denoted as flat, indoor, outdoor in [16]. In particular, we only make use of the I-frames, as the prnu traces left on them are likely to be more reliable than those left on inter-predicted frames [17, 5]. Furthermore, in light of past investigations about the first I-frame of stabilized video sequences, we always discard it from the experiments [11, 7].

mfm parameters. To compute the transform, we evaluate the 2D Fourier transform over

samples after zero-padding residue and reference fingerprint in the pixel domain, in order not to introduce undesired border effects. Then, we convert both terms into log-polar domain, following the default parameters provided by

[13], ending up with transforms having -samples and -samples. We verified that the sampling grid for and dimensions allows a correct estimation of scaling factor and rotation angle. Eventually, we crop transforms along dimension according to the chosen number of samples .

The exploited genetic algorithm mimics biological evolution to find a reliable shift estimation. Precisely, it has the following parameter configuration: a population size of individuals, which iteratively update the cost function for a maximum of iterations. Remaining parameters are those defined in [12].

Performance in a controlled scenario. In order to assess the accuracy in attributing video frames to the correct device, we investigate the proposed method in a controlled scenario. Specifically, considering the non-stabilized dataset, we randomly select I-frames per device, taking care of equally distributing motion and static scenes, as well as flat and textured content. We end up with a total amount of video frames. In particular, we select only frames which report acceptable pce values with the device fingerprint (i.e., pce , as suggested in [17, 11]). Then, we warp each frame by means of a similarity transformation, randomly selecting the parameters from some realistic ranges [7], namely , , , related to scale, rotation angle, horizontal and vertical shifts, respectively. We verified these ranges include the vast majority of possible similarity transformations between stabilized video frames and reference fingerprint.

Figure 2: Accuracy on synthetically warped non-stabilized video frames: (a) only scale and rotation are applied; (b) a complete similarity is applied. The proposed method (orange) can be tuned to used different amount of frequency samples (), thus becoming slower but more accurate.

We aim at estimating the applied transformation using the proposed strategy, comparing the performance with the method presented in [11]. Specifically, we exploit the same parameter configuration for the particle swarm strategy [12, 9] as reported in [11], which enables to estimate the similarity transformation returning the highest pce between and . For what regards the search bounds of scale and rotation parameters, we suppose these to be known at investigation side, thus they coincide with and . Notice that method [11] does not need to fix bounds for the shift parameters as these can be estimated without the need of optimization. Following similar considerations, the proposed strategy fixes the search range for shift parameters exactly to , while scale and rotation do not require optimization.

Computational time and true positive rate evaluated for a PCE threshold of (i.e., ) are the chosen accuracy metrics to compare the two strategies. The average time for estimating the similarity transformation on each frame with the method [11] is , while strategy changes its temporal requirement depending on (e.g, using requires only on average). Generally, the required time linearly grows with .

Fig. 2 shows results as a function of the scene content of video frames (i.e., flat, indoor and outdoor). Specifically, Fig. 2(a) reports results where only scale-rotation transformations were applied. The shift between noise residuals and is assumed to be known. Fig. 2(b) reports results where a complete similarity transformation has been applied. It is worth noting that, in case the shift parameter is known and only scale and rotation parameters should be estimated, our proposal can be a viable solution for very fast identification. Since scale and rotation can be estimated without the need of optimization, the computational time reduces to less than one second. The more the selected samples, the better the accuracy of strategy, which overcomes results of [11]. Furthermore, in this case there is no need for global optimizers, thus the potential optimization error reduces to zero. In case (b), shows better or basically equivalent results to [11] for flat and indoor scenarios, while outdoor frames seem to be more challenging for the proposed method.

Performance on stabilized videos. In order to show the potentiality of approach in dealing with source device identification problem on real videos, we apply the proposed method to the stabilized video sequences. Following previous considerations, we set as search range for the mutual shift both in horizontal and vertical directions. For clarity’s sake, we use the very same accuracy metrics presented in [11], i.e., the area-under-the-curve and of ROC curves, averaged over all devices. Precisely, corresponds to the rate of correct attributions evaluated when the false positive attribution rate is equal to .

Figure 3: ROC curves obtained testing I-frames with the proposed strategy, as a function of the number of used frequency samples, i.e., . Results are compared to those of [11] evaluated using I-frames.
Time [s]
Table 1: and testing random I-frames per video query, together with average computational time per frame, evaluated with and [11] methods.

We show the attribution results achieved by testing random I-frames per video query and picking the maximum value among the computed pces. Specifically, we test different values for the number of used frequency samples (i.e., ) and always report results achieved by [11] over the same dataset. Fig. 3 draws the ROC curves and Table 1 depicts the achieved and as a function of . Moreover, last row of Table 1 reports the average required computational time [seconds] for testing one query frame according to the chosen strategy, considering matching cases as well as non-matching ones. It is worth noticing that the proposed approach can overcome results of [11], provided that a sufficient amount of frequency samples is selected. Furthermore, the strategy enables fast computations as well, at the expense of a slightly reduced accuracy, but still acceptable.

5 Conclusions

In this paper, we propose an alternative solution for solving the source device identification problem on stabilized videos. Specifically, we re-synchronize video frames and device reference fingerprint by estimating the re-alignment transformation with a modified version of the Fourier-Mellin transform. In doing so, we search the scaling and rotation parameters in the frequency domain, whereas unknown translations can be estimated leveraging global optimization strategies. Moreover, we propose to use a reduced amount of Fourier-Mellin transform samples to estimate the warping configuration, thus enabling fast computations.

The experimental campaign is conducted on a publicly available dataset. Results are promising and show enhanced performance with respect to state-of-the-art. This is especially true in situations where only scale and rotation parameters should be estimated: experiments performed in a synthetic set-up reveal that the proposed method can be much faster and accurate than existing methodologies.


  • [1] F. Bertini, R. Sharma, A. Iannı, D. Montesi, and M. A. Zamboni (2016) Social media investigations using shared photos. In Proceedings of the International Conference on Computing Technology, Information Security and Risk Management (CTISRM), pp. 47. Cited by: §1.
  • [2] L. Bondi, P. Bestagini, F. Perez-Gonzalez, and S. Tubaro (2019) Improving PRNU compression through preprocessing, quantization and coding. IEEE Transactions on Information Forensics and Security (TIFS) 14, pp. 608–620. External Links: Document Cited by: §3.
  • [3] A. Castiglione, G. Cattaneo, M. Cembalo, and U. F. Petrillo (2013) Experimentations with source camera identification and online social networks. Journal of Ambient Intelligence and Humanized Computing 4 (2), pp. 265–274. Cited by: §1.
  • [4] M. Chen, J. Fridrich, M. Goljan, and J. Lukáš (2008) Determining image origin and integrity using sensor noise. IEEE Transactions on Information Forensics and Security 3 (1), pp. 74–90. Cited by: §2, §3, §4.
  • [5] W. Chuang, H. Su, and M. Wu (2011) Exploring compression effects for improved source camera identification using strongly compressed video. In IEEE International Conference on Image Processing (ICIP), External Links: Document Cited by: §4.
  • [6] M. Goljan and J. Fridrich (2008) Camera identification from cropped and scaled images. In Security, Forensics, Steganography, and Watermarking of Multimedia Contents X, Vol. 6819, pp. 68190E. Cited by: §1.
  • [7] M. Grundmann, V. Kwatra, and I. Essa (2018-February 6) Cascaded camera motion estimation, rolling shutter detection, and camera shake detection for video stabilization. Google Patents. Note: US Patent 9,888,180 Cited by: §2, §3, §4, §4.
  • [8] M. Iuliani, M. Fontani, D. Shullani, and A. Piva (2019) Hybrid reference-based video source identification. Sensors 19 (3). External Links: ISSN 1424-8220, Document, Link Cited by: §1, §2, §3.
  • [9] J. Kennedy (2011) Particle swarm optimization. In

    Encyclopedia of Machine Learning

    pp. 760–766. Cited by: §4.
  • [10] J. Lukas, J. Fridrich, and M. Goljan (2006) Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security 1 (2), pp. 205–214. Cited by: §1, §2, §3.
  • [11] S. Mandelli, P. Bestagini, L. Verdoliva, and S. Tubaro (2019) Facing device attribution problem for stabilized video sequences. IEEE Transactions on Information Forensics and Security 15, pp. 14–27. Cited by: §1, §1, §2, §2, §3, Figure 3, Table 1, §4, §4, §4, §4, §4, §4, §4, §4.
  • [12] Mathworks (2016) Global Optimization Toolbox - MATLAB R2016a. Note: https://www.mathworks.com/products/global-optimization.html Cited by: §3, §4, §4.
  • [13] Mathworks (2016) Image Processing Toolbox- MATLAB R2016a. Note: https://www.mathworks.com/products/image.html Cited by: §4.
  • [14] N. Mondaini, R. Caldelli, A. Piva, M. Barni, and V. Cappellini (2007) Detection of malevolent changes in digital video for forensic applications. In Security, steganography, and watermarking of multimedia contents IX, Vol. 6505, pp. 65050T. External Links: Link Cited by: §1.
  • [15] B. S. Reddy and B. N. Chatterji (1996) An fft-based technique for translation, rotation, and scale-invariant image registration. IEEE transactions on image processing 5 (8), pp. 1266–1271. Cited by: §2, §2.
  • [16] D. Shullani, M. Fontani, M. Iuliani, O. A. Shaya, and A. Piva (2017-10-03) VISION: a video and image dataset for source identification. EURASIP Journal on Information Security 2017 (1), pp. 15. External Links: ISSN 2510-523X, Document, Link Cited by: §1, §4, §4.
  • [17] S. Taspinar, M. Mohanty, and N. Memon (2016) Source camera attribution using stabilized video. In IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6. Cited by: §1, §4, §4.
  • [18] Z. Wang, L. Zhang, and H. Huang (2018) High-quality real-time video stabilization using trajectory smoothing and mesh-based warping. IEEE Access 6, pp. 25157–25166. Cited by: §3.