1 Introduction
Multimedia forensics keeps developing technologies to identify the camera originating a digital image or a digital video. Currently, the most promising technique is based on the analysis of spn or prnu, left by the acquisition device into the visual content. This trace is useful to identify the video source since it is universal (i.e., every camera sensor introduces one) and unique (i.e., prnu from two different sensors are uncorrelated) [10, 14]. Moreover, prnu has proved to be significantly robust to commonly used processing, like JPEG compression [10], or uploading to social media platforms [3, 1].
prnubased source identification process consists in verifying the match between a query image or video frame and a fingerprint characterizing a reference camera. The strategy involves two main steps: i) a reference fingerprint is derived from still images or videos acquired by the source device; ii) the query fingerprint is estimated from the investigated content and then compared with the reference to verify the possible match, in form of a correlation. If the query content was acquired by the reference camera, then a high correlation is expected.
The previous scheme works under the hypothesis of perfect geometrical alignment between the reference and test fingerprints. If a geometrical transformation is applied to the query content, a pixel grid misalignment between the query and the reference fingerprint arises, thus hindering the detection. Such a case occurs in multiple scenarios: when an image or a video has been acquired with different resolution settings or it is cropped and resized due to the upload in a social media; if a malicious user slightly distorts the content to remove the sensor traces; when a query video is tested against a reference estimated from still images; when a video has been created in presence of electronic image stabilization. In all these cases, the prnu extracted from the query is misaligned with the reference fingerprint, and thus a geometric resynchronization between them has to be carried out before the matching operation.
The first solution to this problem was proposed in [6], where the case of cropped and downscaled images was studied. The authors show that it is possible to parameterize the ncc between the reference fingerprint and the query noise with respect to the scaling factor. The ncc peak position for a given scaling factor provides an estimate of the shift. While the ncc can be efficiently computed in the Fourier domain, a brute force search is needed to determine the scaling factor. By following the same rationale, more recent papers [17, 8, 11] extend the proposed methodology also considering rotation and the more challenging scenario of video analysis. As a matter of fact, modern acquisition pipelines usually include electronic stabilization that undermines prnubased attribution technique. In these cases, prnubased techniques only work if geometric transformations are properly estimated and compensated for, which is a computational complex operation.
In this paper, we focus on the problem of camera attribution of stabilized video sequences based on prnu. Specifically, we propose a method to align frame fingerprints with the reference prnu by recovering the scaling, shift and rotation parameters introduced by electronic stabilization. We overcome the problem of computational complexity by searching for scaling and rotation parameters in the frequency domain thanks to a modified version of the fm. Results obtained on the well known Vision dataset [16] show that the proposed method provides extremely efficient results whenever rotation and scaling operations are applied to video frames. When also shift is taken into account, the gain compared against the stateoftheart [11] depends on the video content.
2 Background and problem statement
In this section we introduce the background on FourierMellin (FM) transform and define the problem we are tackling in this paper.
FourierMellin Transform. The fm transform enables to estimate scale, rotation and shift transformations between two images in closed form [15].
Given an image , the fm transform is expressed as the logpolar mapping of the magnitude of the image Fourier transform, i.e.,
(1) 
where is the operator computing the logpolar mapping, and is the magnitude of the Fourier transform.
Let us consider two images and that are linked through a similarity transformation, i.e., , where applies the transformation identified by the matrix
(2) 
where represents scaling, rotation and horizontal and vertical shift. In this scenario, it is possible to show that is a shifted version of . More formally,
(3) 
where is the radial coordinate and the rotational coordinate. It is therefore possible to estimate scale and rotation by looking at the peak position of the phase correlation function between and independently from shift [15]. Once and are estimated, the two images can be realigned apart from translation. The relative shift can then be estimated by looking at the peak position of the phase correlation computed between the two realigned images in the pixel domain [15].
Problem formulation. prnu is typically modeled as a multiplicative noise pattern introduced by any device in all acquired images or videos [10, 4]. In the field of forensics analysis, it is well known that prnu can be exploited for inferring whether an image was shot by a certain device. For instance, given a test image and a device PRNU , we can compute the pce between the noise residual extracted from the image and the prnu pixelwise scaled by , i.e., . Indeed, PCE measures the correlation between the noise traces left on and the device prnu independently of potential shift misalignment, as the correlation peak is searched over all possible mutual shifts between them. If the pce is greater than a confidence threshold, we attribute to the device [10, 4].
The extension of prnubased strategies for attributing video frames to a specific device suffers from some issues due, for instance, to higher compression rates and lower pixel resolutions. As a matter of fact, the previously described pce test cannot be directly performed, being the prnu resolution typically higher than the size of the recorded video frames. Moreover, incamera video stabilization techniques, which are now becoming one of the musthave device specifications, strongly hinder the traces left by prnu, as video frames may be warped by means of geometrical transformations (e.g., cropping, rotation, scaling, etc.) in order to generate a stable video sequence [11, 7]. As a consequence, the attribution of a video frame to a specific device can represent a much more challenging task than common imagecamera attribution.
In this paper, we exploit prnubased traces to investigate the problem of device attribution when testing incamera stabilized video frames. Specifically, given a device fingerprint and a frame coming from a stabilized video sequence, we aim at exploiting the prnu traces left on in order to detect whether it has been recorded by the analyzed device. To do so, we assume that geometric transformations can be approximated by similarities [8, 11] and we propose a geometrical realignment strategy based on a modified version of fm transform applied to both the device fingerprint and the frame noise residual. Specifically, the proposed mfm enables comparing a device fingerprint and a noise residual independently from scaling and rotation operations. The next section provides all the details of the proposed method.
3 Proposed method
In order to attribute a video frame to a device whose reference fingerprint is , we follow a pipeline based on a few steps: (i) noise extraction; (ii) geometric transformation estimation; (iii) geometric compensation and matching. In the following, we illustrate all the steps of the pipeline.
Noise extraction. As in the common prnubased attribution algorithm, we extract the noise residual from frame . This is done using the strategy proposed in [10, 4]: (i) the noise is extracted through waveletbased denoising; (ii) a series of postprocessing steps (e.g., zeroaveraging rows and columns, Wiener filtering, etc.) are applied to further enhance the noise residual .
Geometric transformation estimation. In order to match and the fingerprint , we first need to search for the geometrical transformation that might link them. In principle, assuming that video frames warping can be approximated by a similarity transformation [8, 11], aligning a noise residual and a reference device fingerprint by means of FourierMellin may seem straightforward. In practice, differently from the FourierMellin theory presented in Section 2, the two terms to compare (i.e., and ) are not exactly one the transformed version (by means of a similarity transformation) of the other. First, the geometric transformation introduced by stabilization is not necessarily a similarity, but can include perspective distortions (on the entire frame or a localized portion of it) as well [7, 18]. Second, the noise residuals of video frames may contain scene content and noise contributions which are not present in the reference device fingerprint.
The primary consequence of this dissimilarity is that selecting only the Fourier magnitudes for estimating scale and rotation between the two terms, as reported in (3), may be not precise. Indeed, we verified that phase correlation between and does not show a pronounced peak, thus leading to a strongly hindered estimation of scale, rotation and shift. In order to overcome this issue we modify the FourierMellin pipeline in two ways.
First, we propose to embed the phase term of the Fourier transform in addition to the magnitude to the FourierMellin pipeline. The modified FourierMellin transform of can be thus defined as:
(4) 
where is the logpolar mapping of the image Fourier transform (including magnitude and phase). On one hand, phase adds more information, which is very useful for angle and scale estimation. On the other hand, this operation comes with a cost. The natural drawback of this approach is that we cannot isolate anymore the estimation of scale and rotation from the estimation of the shift. Indeed, in this case, phase correlation does not exclusively depend on scale and rotation transformations, but also on translation between the two terms. The FourierMellin pipeline works only if and are almost perfectly aligned in terms of translation, i.e., if their mutual shift is basically pixels in both horizontal and vertical directions. In other words, including the Fourier phase term, we first have to correctly realign the prnu traces left on the noise residual with those on the reference fingerprint for what concerns the relative shift, then we can convert the Fourier transforms into logpolar domain and estimate the remaining parameters.
The second proposed modification helps enabling faster computations. It has been shown that a properly selected portion of the prnu frequency spectrum can be sufficient to achieve good attribution performance (e.g., through subsampling [2]). In this vein, notice that a 2D frequency band becomes a rectangular band if the frequency spectrum is converted in logpolar domain. We propose to literally cut the frequency content of and by cropping the logpolar Fourier transform of samples along the dimension. The cropping center corresponds to the coordinate of the highest energy peak of evaluated as a function of . Despite this step might seem irrelevant, this strongly reduces the amount of frequency samples to be correlated, thus lowering the computational cost. We define the modified FourierMellin transform followed by cropping as:
(5) 
By considering the added phase term and the frequency cropping step, the best similarity parameters can be estimated solving a maximization problem. Formally,
(6) 
where
represents the phase correlation and vector
refers to horizontal and vertical pixel coordinates.Notice that, for each shift candidate value, scale and rotation parameters can be very quickly estimated in closed form through phase correlation. Therefore, we only need to optimize over different shift values. However, gradient descent strategies to solve (6) suffer from the nonconvex behavior of phase correlation as a function of the shift. Especially in video sequences characterized by outdoor scenarios or user motion, the actual peak value can be hard to find with gradient descent algorithms. The maximization problem as a function of the shift can be solved by resorting to global optimization techniques. It is worth noting that the translation between and
can be assumed with slight approximation to imply integer shift in horizontal and vertical directions, i.e., to represent a certain number of pixels. We propose to exploit a global optimization algorithm known as genetic algorithm that allows an efficient estimation of integer parameters
[12]. In a nutshell, our method is shown in Fig. 1.Geometric compensation and matching. After estimating the similarity transformation , last steps consist in: (i) applying to in order to realign the prnu traces left on with those of ; (ii) resorting to pce as strategy for a correct source device identification. We compute as
(7) 
As in standard prnu attribution tests, by thresholding it is possible to detect whether the frame under analysis belongs to the tested device. In case multiple frames are available, it is possible to repeat the whole procedure and fuse the results obtained with different frames (e.g., maximum pce picking, majority voting, etc.).
4 Experimental analysis
In this section we report all the details about the performed experimental campaign and the achieved results.
Dataset. Our datasets have been extracted from Vision dataset, which includes both images and videos from major brand devices [16]. For building the prnu related to each device, we select all the available images taken by the device depicting flat scenes [4]. Then, each fingerprint is built by scaling and cropping the prnu, using the image to video warping parameters reported in [11]. Regarding video frames, we select only devices with FullHD video resolution (i.e., pixels). For the sake of clarity, we make use of the same device nomenclature presented in [16], creating two test datasets: a nonstabilized dataset, selecting nonstabilized devices D03, D11, D17, D21, D24 from different brands, and a stabilized dataset that includes all the available stabilized devices.
Notice that the considered video frames contain both static and motion scenes, depicted as still, panrot, move in [16], and can include almost flat content as well as significant texture presence, denoted as flat, indoor, outdoor in [16]. In particular, we only make use of the Iframes, as the prnu traces left on them are likely to be more reliable than those left on interpredicted frames [17, 5]. Furthermore, in light of past investigations about the first Iframe of stabilized video sequences, we always discard it from the experiments [11, 7].
mfm parameters. To compute the transform, we evaluate the 2D Fourier transform over
samples after zeropadding residue and reference fingerprint in the pixel domain, in order not to introduce undesired border effects. Then, we convert both terms into logpolar domain, following the default parameters provided by
[13], ending up with transforms having samples and samples. We verified that the sampling grid for and dimensions allows a correct estimation of scaling factor and rotation angle. Eventually, we crop transforms along dimension according to the chosen number of samples .The exploited genetic algorithm mimics biological evolution to find a reliable shift estimation. Precisely, it has the following parameter configuration: a population size of individuals, which iteratively update the cost function for a maximum of iterations. Remaining parameters are those defined in [12].
Performance in a controlled scenario. In order to assess the accuracy in attributing video frames to the correct device, we investigate the proposed method in a controlled scenario. Specifically, considering the nonstabilized dataset, we randomly select Iframes per device, taking care of equally distributing motion and static scenes, as well as flat and textured content. We end up with a total amount of video frames. In particular, we select only frames which report acceptable pce values with the device fingerprint (i.e., pce , as suggested in [17, 11]). Then, we warp each frame by means of a similarity transformation, randomly selecting the parameters from some realistic ranges [7], namely , , , related to scale, rotation angle, horizontal and vertical shifts, respectively. We verified these ranges include the vast majority of possible similarity transformations between stabilized video frames and reference fingerprint.
We aim at estimating the applied transformation using the proposed strategy, comparing the performance with the method presented in [11]. Specifically, we exploit the same parameter configuration for the particle swarm strategy [12, 9] as reported in [11], which enables to estimate the similarity transformation returning the highest pce between and . For what regards the search bounds of scale and rotation parameters, we suppose these to be known at investigation side, thus they coincide with and . Notice that method [11] does not need to fix bounds for the shift parameters as these can be estimated without the need of optimization. Following similar considerations, the proposed strategy fixes the search range for shift parameters exactly to , while scale and rotation do not require optimization.
Computational time and true positive rate evaluated for a PCE threshold of (i.e., ) are the chosen accuracy metrics to compare the two strategies. The average time for estimating the similarity transformation on each frame with the method [11] is , while strategy changes its temporal requirement depending on (e.g, using requires only on average). Generally, the required time linearly grows with .
Fig. 2 shows results as a function of the scene content of video frames (i.e., flat, indoor and outdoor). Specifically, Fig. 2(a) reports results where only scalerotation transformations were applied. The shift between noise residuals and is assumed to be known. Fig. 2(b) reports results where a complete similarity transformation has been applied. It is worth noting that, in case the shift parameter is known and only scale and rotation parameters should be estimated, our proposal can be a viable solution for very fast identification. Since scale and rotation can be estimated without the need of optimization, the computational time reduces to less than one second. The more the selected samples, the better the accuracy of strategy, which overcomes results of [11]. Furthermore, in this case there is no need for global optimizers, thus the potential optimization error reduces to zero. In case (b), shows better or basically equivalent results to [11] for flat and indoor scenarios, while outdoor frames seem to be more challenging for the proposed method.
Performance on stabilized videos. In order to show the potentiality of approach in dealing with source device identification problem on real videos, we apply the proposed method to the stabilized video sequences. Following previous considerations, we set as search range for the mutual shift both in horizontal and vertical directions. For clarity’s sake, we use the very same accuracy metrics presented in [11], i.e., the areaunderthecurve and of ROC curves, averaged over all devices. Precisely, corresponds to the rate of correct attributions evaluated when the false positive attribution rate is equal to .
[11]  
Time [s] 
We show the attribution results achieved by testing random Iframes per video query and picking the maximum value among the computed pces. Specifically, we test different values for the number of used frequency samples (i.e., ) and always report results achieved by [11] over the same dataset. Fig. 3 draws the ROC curves and Table 1 depicts the achieved and as a function of . Moreover, last row of Table 1 reports the average required computational time [seconds] for testing one query frame according to the chosen strategy, considering matching cases as well as nonmatching ones. It is worth noticing that the proposed approach can overcome results of [11], provided that a sufficient amount of frequency samples is selected. Furthermore, the strategy enables fast computations as well, at the expense of a slightly reduced accuracy, but still acceptable.
5 Conclusions
In this paper, we propose an alternative solution for solving the source device identification problem on stabilized videos. Specifically, we resynchronize video frames and device reference fingerprint by estimating the realignment transformation with a modified version of the FourierMellin transform. In doing so, we search the scaling and rotation parameters in the frequency domain, whereas unknown translations can be estimated leveraging global optimization strategies. Moreover, we propose to use a reduced amount of FourierMellin transform samples to estimate the warping configuration, thus enabling fast computations.
The experimental campaign is conducted on a publicly available dataset. Results are promising and show enhanced performance with respect to stateoftheart. This is especially true in situations where only scale and rotation parameters should be estimated: experiments performed in a synthetic setup reveal that the proposed method can be much faster and accurate than existing methodologies.
References
 [1] (2016) Social media investigations using shared photos. In Proceedings of the International Conference on Computing Technology, Information Security and Risk Management (CTISRM), pp. 47. Cited by: §1.
 [2] (2019) Improving PRNU compression through preprocessing, quantization and coding. IEEE Transactions on Information Forensics and Security (TIFS) 14, pp. 608–620. External Links: Document Cited by: §3.
 [3] (2013) Experimentations with source camera identification and online social networks. Journal of Ambient Intelligence and Humanized Computing 4 (2), pp. 265–274. Cited by: §1.
 [4] (2008) Determining image origin and integrity using sensor noise. IEEE Transactions on Information Forensics and Security 3 (1), pp. 74–90. Cited by: §2, §3, §4.
 [5] (2011) Exploring compression effects for improved source camera identification using strongly compressed video. In IEEE International Conference on Image Processing (ICIP), External Links: Document Cited by: §4.
 [6] (2008) Camera identification from cropped and scaled images. In Security, Forensics, Steganography, and Watermarking of Multimedia Contents X, Vol. 6819, pp. 68190E. Cited by: §1.
 [7] (2018February 6) Cascaded camera motion estimation, rolling shutter detection, and camera shake detection for video stabilization. Google Patents. Note: US Patent 9,888,180 Cited by: §2, §3, §4, §4.
 [8] (2019) Hybrid referencebased video source identification. Sensors 19 (3). External Links: ISSN 14248220, Document, Link Cited by: §1, §2, §3.

[9]
(2011)
Particle swarm optimization.
In
Encyclopedia of Machine Learning
, pp. 760–766. Cited by: §4.  [10] (2006) Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security 1 (2), pp. 205–214. Cited by: §1, §2, §3.
 [11] (2019) Facing device attribution problem for stabilized video sequences. IEEE Transactions on Information Forensics and Security 15, pp. 14–27. Cited by: §1, §1, §2, §2, §3, Figure 3, Table 1, §4, §4, §4, §4, §4, §4, §4, §4.
 [12] (2016) Global Optimization Toolbox  MATLAB R2016a. Note: https://www.mathworks.com/products/globaloptimization.html Cited by: §3, §4, §4.
 [13] (2016) Image Processing Toolbox MATLAB R2016a. Note: https://www.mathworks.com/products/image.html Cited by: §4.
 [14] (2007) Detection of malevolent changes in digital video for forensic applications. In Security, steganography, and watermarking of multimedia contents IX, Vol. 6505, pp. 65050T. External Links: Link Cited by: §1.
 [15] (1996) An fftbased technique for translation, rotation, and scaleinvariant image registration. IEEE transactions on image processing 5 (8), pp. 1266–1271. Cited by: §2, §2.
 [16] (20171003) VISION: a video and image dataset for source identification. EURASIP Journal on Information Security 2017 (1), pp. 15. External Links: ISSN 2510523X, Document, Link Cited by: §1, §4, §4.
 [17] (2016) Source camera attribution using stabilized video. In IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6. Cited by: §1, §4, §4.
 [18] (2018) Highquality realtime video stabilization using trajectory smoothing and meshbased warping. IEEE Access 6, pp. 25157–25166. Cited by: §3.