Online social networks are becoming more and more popular. According the related report , the number of registered accounts on online social networks has exceeded 3 billion, and it continues to grow routinely. However, each account doesn’t exclusively correspond to a person. Generally, the average number of social media accounts is about 6 for an individual. The main reason lies in that people usually register multiple accounts on several platforms for different social needs. For example, people always communicate with workmates on LinkIn, maintain the friend contact on Facebook, and follow the celebrities on Twitter, etc. Therefore, analyzing social behaviors from a single account is partial for characterizing a user. Towards constructing a relatively complete user profile, we need to identify all the accounts belonging to the same individual.
In recent years, the studies of user identification have drawn more attention, since it is beneficial to many practical applications [2, 3, 4, 5, 6]. First, we can build a more complete user profile by fully exploiting behavior patterns converged from multiple accounts, which can be used to provide personalized network services, such as customized content delivery or friend recommendation. In addition, having comprehensive view of users would make it easier to determine user’s identity in real world, which is very helpful to fight against the Internet criminal, terrorism, Internet porn and other social problems [7, 8, 9, 10].
The core problem of user identification is how to correctly estimate similarity between two different accounts by employing some discriminative user patterns. However, the problem is very challenging since it is difficult to discover a kind of universal pattern for an individual with diverse accounts. Traditionally, user patterns are extracted from either the user’s public profiles (e.g., username, geography location and E-mail address)  or activity characters (e.g., linguistic stylistics, writing styles)[12, 13]. In fact, these patterns are not always reliable enough to identify user’s real identity, since all of them can be easily faked. Therefore, discovering a universal and reliable user pattern is the fundamental step of user identification.
In this paper, we aim to re-inspect the user identification problem from a totally different perspective, i.e.
, identifying users by matching their photography devices. The basis idea is based on a generally existing finding that most of photos published in different accounts by the same person are captured by several frequently used photography devices. That is, different accounts belonging to the same person include the same implicit fingerprint information of these cameras. Camera fingerprint is a kind of discriminative and reliable feature extracted from images, which is traditionally used for digital camera identification. Totally different from the explicitly published information, camera fingerprint is mainly dependent on the light sensitivity of different camera sensors, which is unique for each camera and difficult to forge. The basic idea is illustrated in Fig.1, where accounts 1 and 4 are asserted to belong to the individual A since all images in these accounts are from the same camera source.
Intuitively, we can individually extract camera fingerprints for all accounts and identify users by directly estimating the similarity of camera fingerprints. However, the online social networks are far more complex than what we think, and the relationship among individuals, accounts and cameras are also complex and uncertain. To make our exposition more clear, we first illustrate some difficulties when identifying users from the new perspective.
Multiple camera problem. As shown in Fig.(a)a, a person commonly owns more than one camera. That is, it is possible that the images in an account are from more than one camera source. If we estimate only a single camera fingerprint for the account, both its discriminative capability and reliability will inevitably decreased. Therefore, it is vital to distinguish the camera sources of images in the same account in an explicit or implicit manner.
Reposted image problem. As shown in Fig.(b)b, two different users may repost some popular images from the same camera source (called Reposted camera). If the reposted images are not removed before estimating camera fingerprints, two different users may not be clearly distinguished by the confused user patterns. Therefore, it is necessary to eliminate the effect of reposted images.
Single camera sharing problem. As shown in Fig.(c)c, two different individuals share the same camera source. In this case, it is extremely difficult to distinguish two users by using only camera fingerprints.
Multiple camera sharing problem. As shown in Fig.(d)d, multiple individuals share multiple camera sources. In this case, the relationship between users and cameras is more complex, and some new strategies should be taken into account.
However, it is very challenging to address all these problems in a single scheme. Therefore, we focus mainly on first two problems in this paper and propose a camera fingerprint based user identification framework, called User Camera Identification (UCI) to address them. To the best of our knowledge, it is the first time to re-inspect the social network reconciliation problem from the perspective of camera fingerprint. The main contributions can be summarized as follows:
A new perspective is proposed to tackle the user identification problem. Totally different from previous methods, the proposed approach explores camera fingerprints to identify user’s identity, which is more reliable and difficult to forge. In addition, the new perspective makes it possible to meet the requirements of many practical applications, such as criminal detection.
A novel estimation approach of camera fingerprints is proposed to incrementally extract multiple camera fingerprints in an account. Different from existing methods, the proposed approach can significantly deal with the confused problem caused by multiple cameras and reposed images, which is beneficial to both user identification and camera identification areas.
A new dataset is constructed to benchmark camera fingerprint based user identification framework. We will release this dataset so as to provide a benchmark for evaluating new approaches.
Ii Problem Statement
Given a user , its image set is denoted to , where represents the image of user . The core problem is how to determine whether users and belong to a single individual or not. In our work, we attempt to employ camera fingerprint to address this problem. Therefore, the problem is converted to extract camera fingerprints of users and measuring the similarity of any two users’ camera fingerprints.
Before defining our problem, we first discuss the process of camera fingerprint extraction. Based on [14, 15], given a set of images (captured by the same photography device), we first extract a residual noise for any image as follows
where is a wavelet denoising filter. Then, a camera fingerprint is achieved by averaging all images’ residual noises as follows
For two camera fingerprints and of users and , their similarity can be measured by
where and are the means of and , respectively. The correlation value can be taken as the similarity score of these two cameras.
Intuitively, we can individually extract camera fingerprints for all accounts by directly employing Eq.2 and 3. However, the online social networks are far more complex than what we think, and the users’ behaviors are also varied. Therefore, we have to take some practical issues into account when addressing the user identification problem.
The first is multiple camera problem. In real world, people generally own more than one photography device, e.g., cellphone camera, SLR camera. That is, the images in may not be always captured by a unique camera. If we extract only one camera fingerprint to represent a user, it will cause some extra error. Therefore, it is necessary to model the user’s camera feature by employing several camera fingerprints.
The second is reposting problem. Generally, some popular photos in an account are frequently reposted by other users. As shown in Fig.7
, for example, users 1 and 2 don’t belong to the same individual, but both of them repost the images from user 3. Without tackling this issue, users 1 and 2 will be incorrectly asserted to belong to the same individual, since reposted images have high probability to be captured by the same camera.
To clearly illustrate the feasibility of the new perspective and the key issues to be addressed, we conduct a series of experiments on a dataset with 1,576 images captured by 11 cameras. For each camera, all its images are randomly divided into two groups. Firstly, we verify whether camera fingerprints have the capability to reconcile the accounts belonging to the same individual. To this end, we simulate each group of images as the album of a user, and the groups (or users) deriving from the same camera are treated as positive pairs and others are negative pairs. After estimating the camera fingerprint for each simulated user, the correlation values between any two users are calculated by using Eq.3. The statistical results are illustrated in Fig.(a)a, where red points indicate the correlation values between positive pairs and blue points are for negative pairs. Clearly, positive pairs can be easily distinguished from negative ones. That is, camera fingerprint is indeed a kind of discriminative feature for distinguishing camera sources. However, an underlying assumption in this experiment is that the album in each account only contains images from a single camera. In more practicable scenarios, this assumption does not always hold, since people generally have more than one camera and frequently repost popular images from other cameras. Therefore, it is necessary to evaluate the effect of these confused factors. To construct a proper dataset, we combine any two groups from different cameras to simulate a user, and two users who share one or two camera sources are treated as a positive pair and otherwise a negative pair. For each user, all the images from two cameras are used to estimate a unique camera fingerprint by using Eq.2. The statistical results on correlation values are illustrated in Fig.(b)b, where many red points are mixed with blue points. That is, positive pairs cannot be clearly distinguished from negative ones. The reason lies in that using a unique camera fingerprint to represent two cameras will unavoidably lead to confusion. Therefore, the discriminative capability of camera fingerprints are remarkably decreased. This is so-called multiple camera problem. If we can identify camera sources of images in an account and individually estimate camera fingerprints for different camera sources, it is possible to avoid the multiple camera problem. To verify this conclusion, we directly use the prior information of camera sources and estimate two camera fingerprints for each user. To calculate the correlation value between any two users, we first calculate correlation values between any two camera fingerprints in two users, and select the maximum value as the final correlation value. The experimental results are shown in Fig.(c)c. As expected, the positive pairs are clearly distinguished from negative ones. That is, it is feasible to solve the multiple camera problem by estimating multiple camera fingerprints. However, introducing reposted images into album in an account will also lead to some confusion, Fig.(d)d illustrates this conclusion.
In brief, it is feasible to reconcile multiple users belonging to the same individual from the camera fingerprint perspective, and the key problems to be addressed are how to correctly extract multiple camera fingerprints and how to alleviate the effect of reposted images.
As discussed above, the key problem to be addressed is to overcome the confusion problem caused by the multiple cameras and reposted images. An intuitive solution is to first cluster images in an account into different groups according to their camera source information, and then individually estimate camera fingerprints for all groups. However, the camera source of each image in an account is unknown beforehand. Therefore, we must design a method to avoid extracting multiple camera fingerprints in batch. In this section, we propose an incremental estimation approach of multiple camera fingerprints. The system framework is illustrated in Fig.14, which includes four key components: seed selection, incremental estimation, reposted image removing, and account matching.
Iii-a Seed Selection
In essence, the proposed estimation approach is a tailored clustering method. Different from the scenarios of classic clustering methods, the multiple camera fingerprint estimation case requires accurate initial seeds (or initial camera fingerprints). To obtain several accurate initial seeds, a pair correlation based method is proposed, which selects initial seeds by thresholding the pair correlation of residual noises of any two images.
Specially, given an images set , its noise pattern set can be obtained correspondingly by Eq.(2), and the correlation values between any two images’ residual noises are calculated by Eq. (3). For any correlation value that is greater than a predefined threshold, the corresponding two images are grouped together, and a camera fingerprint is estimated as a seed for further steps. In fact, the seed selecting method is based on the following assumption:
Assumption 1: If the correlation of two images’ residual noises is high enough, the two images have high probability to be captured by the same camera.
To verify this assumption, a database containing 1,576 images captured by 11 cameras is used. For any image pair from the database, if two images are captured by the same camera, we call it a positive pair, otherwise a negative pair. The correlation value of two images in any pair is separately calculated by using their residual noises, and their results are illustrated in Fig.13. As we can see, the correlation distribution of negative pairs are limited to a low and narrow range, while positive pairs are scattered in a higher but larger range. In addition, there is a narrow but dense overlap for negative and positive image pairs, as indicated by two blue dot lines.
In a sense, it seems to be conflict with the conclusion in Fig.(a)a. In fact, the positive or negative pairs in this case are quite different from these in Fig.12. In Fig.12, a pair is consist of two groups of images from two cameras, and a camera fingerprint is estimated from each group by using Eq.2
. Since the noise component ,which is Gaussian white noise in essence, can be smoothed during the averaging procedure, stable and reliable camera fingerprints can be achieved. In contrast, a pair contains only two images here. In this way, the camera fingerprints here are essentially the residual noises of the images and Gaussian white noises are not removed. Therefore, the correlation between two images in a pair cannot be estimated correctly. That is, employing only residual noise cannot completely distinguish images coming from different cameras. Fortunately, we can also observe that a pair of images has a high probability to be captured by the same camera if their pair correlation value is high enough, as indicated by the points above the red dot line in Fig.13. These observations support our assumption.
In brief, we can choose some stable and reliable seeds from images in an account by using the residual noises and setting a high threshold value. It should be noticed that setting high threshold value will lead to high false rejection problem, i.e., many positive image pairs are treated as negative ones. We will leave this problem to the next sub–section.
Iii-B Incremental Estimation of Multiple Camera Fingerprints
Our goal is to put all images captured by the same camera into a group so as to estimate a reliable camera fingerprint. To do that, we propose an incremental camera fingerprint estimation method, which includes three key steps: initializing camera fingerprints, merging consistent groups, and estimating new camera fingerprints.
Initializing camera fingerprints
Using the above-mentioned seed selecting method, a number of positive image pairs are chosen and each image pair is separately treated as a group. As shown in the step 1 in Fig.14, four positive image pairs are selected as seeds. That is, the user is initially assumed to have four different cameras and each camera has only captured two images in corresponding group. More formally, let denote the set of all different groups (or Clusters) in user and represents group. Then, the set of camera fingerprints can be estimated individually from these groups by Eq.4,
Here, we denote as the set of camera fingerprints .
Initially, all positive image pairs are treated individually as groups, and one camera fingerprint is estimated from each group. In this way, total camera fingerprints are obtained, which are used as initial seeds.
Merging consistent groups
As discussed above, all positive pairs are selected by thresholding the correlation value of residual noises of two images. Although the two images belonging to the same pair have high probability to be captured by the same camera, we cannot ensure that any two pairs of images are captured by totally different two cameras. In other words, some image pairs may share the same camera source. Therefore, it is necessary to merge these groups. Toward this end, a similar strategy to seed selection is employed here to merge consistent groups. In particular, given any two groups , we merge them into one cluster if correlation value of their corresponding camera fingerprints , is greater than a pre-defined threshold . Formally, the updating procedure can be formulated as
where is the merged image set. That is, any image pair which correlation is greater than will be merged into the same group.
After group merging, a new set of groups is generated, and an updated camera fingerprint set can be estimated. In fact, group merging will lead to two benefits. First, some redundant camera fingerprints are removed, which will result in a more compact set of feature patterns. Second, a more reliable camera fingerprint for a unique camera can be estimated, since more samples are collected individually for cameras and used to smooth Gaussian white noise. Therefore, using group merging procedure will lead to more reliable user identification.
Estimating new camera fingerprints
In order to obtain reliable and stable seeds, is generally set to a high value. In this way, many images that are captured by one of cameras associated with will be rejected, while the selected image pairs are guaranteed to be true positives. It is useful for generating accurate camera fingerprints to correctly reassign these rejected images into correct groups. An intuitive and straightforward solution is to assign each rejected image to the most similar group by estimating the correlation values between the residual noise of rejected image and all groups’ camera fingerprints. However, if some rejected images are incorrectly assigned to one group, it will result in some false acceptation. In this case, images from different cameras will be in the same group, which leads to unreliable camera fingerprints. To avoid incorrect acceptation, we further pre-define a threshold to determine whether we assign a rejected image into a group. Formally, given any rejected image’s residual noise , we assign the image into the group if the maximum correlation value achieved between and is greater than .
Once the reject image is assigned to the group , will be updated by using the new image set . Repeating these steps will assign most of rejected images to corresponding groups, and more accurate camera fingerprints will be estimated incrementally for these groups.
The complete algorithm of multiple camera feature estimation is listed in Algorithm 1. Generally, there are an initialization step and an iterative step. In the first step, possible positive image pairs are selected to individually form a set of groups , and the corresponding camera fingerprints are extracted as seeds. Then, both group merging and rejected image reassigning steps are iteratively performed to incrementally improve the reliability of camera fingerprints.
Iii-C Dealing with Reposted Images
As we mentioned before, reposted images may cause remarkable confusion. Therefore, it is necessary to take reposted images into account when extracting multiple feature patterns to represent a user. However, reposting behaviors are quite complex and it is difficult to distinguish them from normal images. Therefore, we attempt to address the problem by investigating the reposting behaviors.
However, the reposting behaviors are both complex and uncertain due to the diversity of users. In order to simplify the difficulty of modeling, we consider only two common reposting behaviors. First, users repost many images, and these images come from different sources (e.g., from several different accounts). In this case, the number of reposted images from a specific camera source are far less than user’s own images. Second, users repost multiple images from a single source (e.g., an account). In this case, the reposted images from a specific camera source are still scarce, since the reposted images are more likely be captured by different cameras of original user. In brief, we assume that the reposted images are from different cameras and the number of reposted images captured by the same camera are scarce. Based on this assumption, we can efficiently suppress the confusion of reposted images.
In fact, the algorithm of multiple camera feature estimation discussed above has already had a certain capability of eliminating reposted images, since it can directly reject the reposted images who don’t meet the condition of positive image pair. For example, when people repost only a single image from other user’s album, it can neither be chosen as a seed nor be assigned into a group by the proposed algorithm. Therefore, this kind of reposted image has no effect on the estimation of users’ camera features. However, when more than one reposted image is captured by the same camera, it will lead to some mistakes. For these reposted images, they may be chosen as seeds at the initial step, and most of them are grouped into the same group after several iterations. To address this issue, we adopt a simple but effective post-processing step to alleviate their effect. According to the observation on users’ reposting behaviors, the number of reposted images are generally scarce for a specific camera. Therefore, the camera fingerprint set achieved by incremental estimation steps can be refined as follows
where means the number of images in class , and can be taken as the set of camera fingerprints of user .
For any pairwise of social users, we assert them to belong to a single individual if their feature correlation is greater than a predefined threshold. In this way, the groups with few images are treated as reposted image sets and filtered out.
Iii-D Similarity Estimation
In this section, we mainly discuss about how to identify multiple accounts belonging to the same individual by estimating their similarity based on their camera features. Using multiple camera fingerprint estimation algorithm, a camera feature set is obtained, and element denotes one camera fingerprint. As we mentioned above, we consider two users belong to a single individual if they share at least one camera. Therefore, the problem is changed to determine whether two users have similar members of their camera features. Formally, the problem can be represented as follows. For any two users and , their camera features and are obtained. We estimate their similarity by using the maximum correlation value between two camera fingerprints in and , which can formulated as follows:
where denotes the similarity between and .
Iv UID-BJTU: A Dataset for user identification based on camera fingerprint
To the best of our knowledge, it is the first time to re-inspect the user identification problem from the perspective of camera fingerprint. Therefore, no public dataset is available for testing. In order to evaluate the performance of proposed scheme, we collect and construct a new benchmark, named UID-BJTU. This benchmark is consist of two different collections, i.e., simulated user dataset and online social user dataset. The images in former dataset are acquired from multiple cameras directly, which have clear information of camera sources. This dataset is mainly used to clearly evaluate the effectiveness of each stage in our approach. Instead, the images in the latter one are crawled from online social networks, which come from real online users. Details about the benchmark is described in this section.
Iv-a Simulated User Datasets
In order to comprehensively evaluate effectiveness of the proposed approach, we need know the accurate information about users, such as the relationship among users, the number of cameras, the number of images captured by any camera. However, it is not an easy task to obtain an ideal evaluation benchmark by crawling the data of online users. Generally speaking, an image’s camera source can be determined by the information provided by Exchangeable image file format (Exif), which contains camera series number, brand and model information. Unfortunately, most of these images’ Exif files are incompletely due to various reasons such as privacy, post-processing, so it is impossible to construct a reliable camera source groundtruth from online user dataset. Therefore, in order to evaluate the performance of proposed incremental estimation scheme, we directly acquire a set of images by using multiple pre-selected cameras to simulate the data of online users. Specially, total 11 cameras are collected and 1,576 images with clear camera source information are acquired.
To guarantee data diversity and avoid potential confusion of camera brand and model, we take camera brand and model into account when choosing cameras. Table LABEL:table:1 lists the details of cameras’ brands and the number of images from each camera.
|NIKON D7000||DVTs||HUAWEI||iPhone6 Plus|
|NIKON D7000||iPhone6||Canon 900Ti||Canon 650D|
|HM-NOTE||NIKON D7000||PENTAX K-50||-|
The key advantage of the proposed method is to significantly alleviate the confusion problem from multiple cameras and reposted images. Therefore, in order to clearly evaluate the effectiveness of the proposed approach, we need to simulate users with diverse camera sources and reposted images. In our experiments, three kinds of users are simulated, which are as follows:
Offline Dataset: One of 11 cameras corresponds to a single individual, and all images from the camera is randomly divided into two groups. Each group of images simulates one user of the individual. If two users are derived from the same camera, they are treated as a positive pair, otherwise a negative pair. In this way, total 22 users, 11 positive pairs and 220 negative pairs are constructed. In order to involve the reposed images, total 110 images from an extra camera are randomly added into 22 users.
Offline Dataset: Any combination of two cameras corresponds to a single individual, and all images from the two cameras are randomly divided into two groups. Each group of images simulates one user of the individual. The two users belonging to the same individual is called a positive pair. For any two users, if they don’t share any camera source, we call them a negative pair. In this way, total 110 users, 55 positive pairs and 3,960 negative pairs are constructed. Similarly, in order to involve the reposed images, total 550 images from an extra camera are randomly added into 110 users.
Offline Dataset: The constructing process is quite similar with Offline. The main difference lies in that a combination of three cameras corresponds to a single individual and total 550 images from an extra camera are randomly added into 110 users.
A summary about these datasets is listed in Table LABEL:table:2.
|Sum of Reposted||110||550||550||960|
Iv-B Online Social User Dataset
The final goal of proposed identification method is to identify online user’s identity by employing camera fingerprints, so it is necessary to evaluate the performance on real online data. However, it is difficult to construct a reliable groundtruth for online users’ identity. Therefore, we attempt to construct a simulated network with controllable data.
Toward this end, we first crawled 15,328 images from 96 Flickr users’ albums, and the image number of users are varied from 82 to 250. Although the simulated network with 96 users is far from a large-scale dataset, we can manually ensure that any pair of these users don’t belong to the same individual and no any reposted images are included in these albums. That is, one original user corresponds to a single individual. To construct positive pair of users belonging to a single individual, we randomly divide the images in its original user into two groups, where each group is treated as a new user. In this way, each individual corresponds to two users, which are treated as a positive pair. Totally, 192 users, 96 positive pairs and 18,240 negative pairs are constructed. In addition, 107 images from an independent user are treated as reposted images and randomly added to 192 users.
V Experimental Evaluation
In this section, we conduct several experiments to evaluate the proposed user identifying methods. In particular, the grouping performance of the proposed method on offline datasets are first evaluated, and then the reposted image problem is verified. Finally, we test our algorithm on online benchmark and give some useful conclusions. It is worth noting that the perspective of the proposed framework is totally different from the traditional user identification frameworks. Therefore, we cannot compare the proposed approach with previous works due to the lack of testing database containing both camera fingerprint and user public profiles or activity characters.
V-a Metrics and Baseline
Before we present our experimental results, we first introduce several metrics to evaluate the algorithm’s performance. In the proposed approach, the incremental grouping method plays very important role. Therefore, in addition to evaluate the final identification, we should also fully evaluate the grouping effectiveness. The grouping is the foundation of the proposed method, which aims to identify the camera sources of images. Although the proposed grouping algorithm is quite different from classical clustering algorithm, we can still employ the evaluation metrics of classical clustering methods. Here, we employ purity, precision and recall to evaluate the grouping performance.
Purity is a simple and transparent measure for evaluating the performance of clustering. Given the incremental estimation grouping result and the groundtruth of camera source , purity is defined as follows
where denotes the number of images in user . It should be noticed that high purity is easy to achieve when the number of clusters is large enough (for example, purity is to 1 if each image is grouped a single group). To address this problem, we employ precision and recall as the additional measures to further evaluate the performance. Formally,
where (true positive) decision assigns two images from the same camera to the same group, (false negative) decision assigns two images from the same camera to different groups, and (false positive) decision assigns two images from different cameras to same group. More details can be found in . Briefly, a high precision means that images are mostly assigned to the true group, and high recall reflects that most images are grouped. In our scenario, we mainly pursuit a high precision so as to guarantee that estimated camera fingerprint is not influenced by multiple camera sources. After that, we allow recall to be a little lower, since not every image is necessarily grouped.
To evaluate the user identification performance, we employ true positive rate and false positive rate. Furthermore, we also take user identification problem as a user retrieval problem, and employ Mean Average Precision (MAP) to evaluate the identification performance.
To fully show the feasibility of the proposed framework, we design three schemes for comparisons.
SCF (Single Camera Fingerprint): It assumes that all images in a user’s album are captured by the same camera. Under this assumption, each user is represented by only one camera fingerprint estimated from all images. By estimating the similarity among different users, we can determine whether two users belong to the same individual or not. Compared with the proposed approach, this method take neither the multiple camera problem nor reposting behaviors into account.
MCF (Multiple Camera Fingerprints): This scheme incrementally divides images of a user into several groups, and individually estimates camera fingerprints for groups by using Eq. 4. Then, the groups that meet the threshold are removed, and all the other groups’ camera fingerprints are treated as the user’s camera feature. MCF takes only multiple camera problem into account.
UCI (User Camera Identification): In this scheme, both multi-camera problem and reposting behaviors are taken into account.
V-B Grouping Performance Evaluation
|Purity||0.76 (0.91)||0.54 (0.76)||0.45 (0.48)|
|Precision||0.91 (0.83)||0.90 (0.62)||0.85 (0.35)|
|Recall||0.72 (1.00)||0.50 (1.00)||0.42 (1.00)|
The purpose of grouping is to accurately divide all images in a user into different groups according to their camera sources. In this way, a camera fingerprint can be estimated for each camera source and multiple camera problem can be alleviated significantly. In fact, high precision of clustering is more important than purity and recall in our case, since high precision means that there are less confusing images in a group belonging to a single camera source. In this section, we conduct a series of experiments to verify the effectiveness of the proposed incremental clustering method, and the experimental results are listed in Table LABEL:table:4. To clearly show the advantage of the proposed method, we also provide a baseline. In this baseline, we assume that all images of one user belong to the same camera source. That is, all images of one user is treated as a unique group. The statistical results of baseline on purity, precision and recall are associated with the results of the proposed method in brackets for comparisons. Clearly, for the Offline dataset, all the indicators of baseline outperform the proposed method, since each user in this dataset has only a camera source. When more cameras are introduced into users, the recall of the baseline keeps perfect one, which obviously better than the proposed method. We also notice that the purity of the proposed method is also worse than the baseline. However, it does not prove that the performance of the baseline is better. As we mentioned before, not all images are assigned to one of groups during the incremental clustering procedure, since some images whose correlations to any group centroid are lower than iteration terminal criterion are filtered out. Therefore, the recall and purity will be inevitably decreased when we calculate them by taking all image into account. If the rejected images are not counted in, high recall and precision can be obtained. For example, when we remove these rejected images from users, and the recalculated recall values are 0.9524, 0.9136, 0.9197, respectively. In our scenario, however, how to alleviate the mutual confusion among multiple cameras is more important than providing high recall. Therefore, the key of a good grouping method is to ensure that the images in a group are accurately derived from the same camera.
That is, the precision should be high. As expected, the proposed method outperforms the baseline in clustering precision in all datasets and varied camera amounts. In brief, using the proposed incremental grouping method, most of images coming from the same camera are returned in the same group, and confusion is significantly alleviated.
V-C Evaluation on Reposted Image Removing
In order to alleviate the effect of the reposted images, we perform a post-processing step to automatically remove the possible groups with reposted images. In this step, the groups whose sizes are lower than a predefined threshold are filtered out. However, in addition to the post-processing step, the incremental clustering procedure also rejects many reposted images. Therefore, we evaluate the effectiveness of reposted image removing by taking both steps into account. In addition to ratio of correctly rejected reposted images, we also make a statistic on false rejected images. The statistical results are shown in Table LABEL:table:_5. As shown in Table LABEL:table:_5, the ratios of correctly rejected reposted images are always remarkably higher than the ratios of false rejected images in all the cases. That is, the proposed method indeed removes reposted images but preserves positive images as well, which can remarkably contribute to the estimation of multiple camera fingerprints. Meanwhile, we can also observe that the ratios for both cases are remarkably increased with the growth of camera number, from Offline (one camera) to Offline (three cameras). In fact, it is reasonable. When the number of cameras increases, the probability that the reposted images are selected as seeds will increased. In this way, more reposted images will be clustered into the same group and be filtered out in group filtering. Meanwhile, more positive images also be rejected, since the error between image and group centroids are more larger.
V-D User Identification Evaluation
In this section, we investigate the performance of our identification schemes. In fact, to the best of our knowledge, it is the first time to re-inspect the user identification problem from the camera fingerprint view. Therefore, no previous works are available for peer comparisons. To fully show the feasibility of the proposed framework, only the schemes designed by us are used for comparisons.
In our experiments, we evaluate the identification performance from two aspects. First, we treat user identification problem as a user retrieval problem. That is, we take each user as a query and retrieve the users that belong to same individual with the query. We evaluate the performance by the MAP, and the results are listed in Table LABEL:table:6.
As expected, all the camera fingerprints based user identification methods work well on all datasets. It means that re-inspecting the user identification problem from the camera fingerprint view is quite effective. In addition, after taking into account the multiple camera and image reposting problems, the user identification performance can be further improved clearly. Therefore, it is necessary to handle the multiple camera problem and reposting problem.
Secondly, we also employ ROC curves to further evaluate the performance of the proposed method. The experimental results are illustrated in Fig. 19. Clearly, for both online and offline datasets, UCI remarkably outperforms the single-clustering method, which is consistent with the MAP.
Vi Related Work
The problem of user identification has been studied in different research communities [17, 18, 19, 20]. In essence, the core of the problem is to find a certain similarity measure to assess the relationships among accounts, i.e., account matching problem. More specifically, it is to extract some discriminative features from accounts so as to change the account identifying problem to the feature matching problem. Therefore, the key of account matching is to extract reliable and effective features of accounts.
A kind of commonly used method is to extract features from users’ public profiles, such as username, E-mail address, cellphone number. Based on these features, some simple but effective algorithms can be designed to identify multiple accounts across different platforms . Several related works ,  have been reported, and good performance on some datasets has been achieved. To further improve the reliability of account matching, more profile attributes (e.g., location , interaction activities [11, 23], friends ) are involved into the account matching process. An assumption underlying these methods is that people maintain the same or similar profiles in different accounts. However, the assumption does not always hold, since people frequently register different accounts with different profiles due to certain purposes ( e.g., privacy concerns, fraud). Therefore, these identification methods cannot deal with such cases. In addition, information barriers among different social network platforms further limit their scope of applications.
To address the issues above, more attentions have been paid on matching accounts by exploiting the user’s behaviors in online social networks (e.g., linguistic stylistics, preferred geographic radius, hobbies and interests). The basic assumption is that the same person maintains similar behavior patterns among his/her multiple accounts. For example, some algorithms [25, 12, 26, 27] attempt to identify multiple accounts by matching linguistic stylistics of their posted content. In addition, due to the popularization of GPS and cellphone, geo-tagged information posted by online users also become an effective discrimination features [22, 28]. Although these user identification schemes achieve a good performance on some datasets, they still have some limitations. An obvious drawback is that these methods require massive training samples. Generally, long text and geo-information are always incomplete for most users, which will remarkably influence the identification effectiveness.
Another kind of method is to change user identifying problem into approximate graph isomorphism problem, which can identify more users when a few seed links are given . In essence, these schemes [12, 30, 31, 32, 18, 33] identify users by matching accounts’ networks graph structure, i.e., they take users’ topology property as discrimination features. This kind of methods is suitable for large scale heterogeneous network reconciliation, however, they are limited to deal with the problem of identifying users on same social network platforms.
In this paper, we attempt to re-inspect the user identification problem from a new perspective, i.e., camera fingerprint. Camera fingerprint is a noise-like invisible component existed in digital images, and unique to each imaging equipment . With this property, camera fingerprint can be used to determine image’s camera source [15, 34, 7, 35, 36, 37] and plenty of related researches have been proposed [38, 39, 40, 41, 36].
In fact,  and  are closely related with our work. The former proposes a picture-to-identity linking algorithm to investigate the owner of a particular image, the latter aims to find out the corresponding accounts of a specific camera. However, the problems to be addressed are quite different from the proposed scheme.
In this paper, we attempt to address the problem of user identification from a new perspective. Instead of using the public information explicitly released by users, we attempt to employ a more reliable feature, i.e., camera fingerprint, to identify multiple accounts belonging to the same individual. To further alleviate the hard problems of multiple cameras and reposting, a novel incremental multi-camera fingerprint estimation algorithm is introduced into the identifying process. The experimental results show that using camera fingerprint information indeed effectively tackles the social media reconciliation problem and the proposed method indeed remarkably alleviates both the multiple cameras and reposting problems.
-  S. Kemp, “Digital, social & mobile worldwide in 2015,” We Are Social, Available: http://wearesocial.com/uk/special-reports/digital-social-mobile-worldwide-2015.
J. Tang, C. Zhang, K. Cai, L. Zhang, and Z. Su, “Sampling Representative
Users from Large Social Networks,”
Association for the Advancement of Artificial Intelligence, 2015.
-  W. Wei, G. Cong, C. Miao, F. Zhu, and G. Li, “Learning to Find Topic Experts in Twitter via Different Relations,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1764–1778, 2016.
-  J. Li, C. Liu, J. Yu, Y. Chen, T. Sellis, and J. Culpepper, “Personalized Influential Topic Search via Social Network Summarization,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 7, pp. 1820–1834, 2016.
-  H. Li, Z. Bu, A. Li, Z. Liu, and Y. Shi, “Fast and Accurate Mining the Community Structure : Integrating Center Locating and Membership Optimization,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 9, pp. 2349–2362, 2016.
-  Q. Fang, J. Sang, C. Xu, and M. Hossain, “Relational User Attribute Inference in Social Media,” IEEE Transactions on Multimedia, vol. 17, no. 7, pp. 1031–1044, 2015.
R. Satta and P. Stirparo, “On the usage of Sensor Pattern Noise for
Picture-to-Identity Linking through Social Network Accounts,”
International Conference on Computer Vision Theory and Applications, pp. 5–11, 2014.
-  S. Tan, Y. Li, H. Sun, Z. Guan, X. Yan, J. Bu, C. Chen, and X. He, “Interpreting the Public Sentiment Variations on Twitter,” IEEE Transactions on Knowledge and Data Engineering, no. 5, pp. 1158––1170, 2013.
-  X. Zhou, X.and Liang, H. Zhang, and Y. Ma, “Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks,” IEEE transactions on knowledge and data engineering, no. 2, p. 411–424, 2016.
-  X. Qian, H. Feng, G. Zhao, and T. Mei, “Personalized Recommendation Combining User Interest and Social Circle,” IEEE Transactions on Multimedia, no. 7, pp. 1763––1777, 2014.
-  T. Iofciu, P. Fankhauser, F. Abel, and K. Bischoff, “Identifying Users Across Social Tagging Systems,” International AAAI Conference on Web and Social Media, 2011.
-  A. Narayanan, H. Paskov, G. N. Z., J. Bethencourt, E. Stefanov, E. Shin, and D. Song, “On the Feasibility of Internet-scale Author Identification,” 2012 IEEE Symposium on Security and Privacy, pp. 300–314, 2012.
-  R. Zafarani and H. Liu, “Connecting Users across Social Media Sites : A Behavioral-Modeling Approach,” Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,, pp. 41–49, 2013.
-  J. Lukáš, J. Fridrich, and M. Goljan, “Digital Camera Identification from Sensor Pattern Noise,” IEEE Transactions on Information Forensics and Security, vol. 1, no. 2, pp. 205–214, 2006.
-  M. Chen, J. Fridrich, M. Goljan, and J. Lukáš, “Determining Image Origin and Integrity Using Sensor Noise,” IEEE Transactions on Information Forensics and Security, vol. 3, no. 1, pp. 74–90, 2008.
-  D. Christopher, R. Prabhakar, and S. Hinrich, “Introduction to Information Retrieval,” Cambridge University Press, pp. 356–360, 2008.
-  X. Zhou, X. Liang, H. Zhang, and Y. Ma, “Cross-Platform Identification of Anonymous Identical Users in Multiple Social Media Networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 28, no. 2, pp. 411–424, 2016.
-  S. Ji, W. Li, N. Gong, P. Mittal, and R. Beyah, “On Your Social Network De-anonymizablity: Quantification and Large Scale Evaluation with Seed Knowledge,” NDSS, pp. 8–11, 2015.
-  R. Zafarani and H. Tang, L.and Liu, “User Identification Across Social Media,” ACM Transcations on Knowledge Discovery from Data, vol. 10, no. 2, 2015.
-  S. Liu, S. Wang, and F. Zhu, “Structured Learning from Heterogeneous Behavior for Social Identity Linkage,” IEEE Transactions on Knowledge and Data Engineering, vol. 27, no. 7, pp. 2005–2019, 2015.
-  D. Perito, C. Castelluccia, M. Kaafar, and P. Manils, “How Unique and Traceable are Usernames?” International Symposium on Privacy Enhancing Technologies Symposium, pp. 1–17, 2011.
-  M. Korayem and D. Crandall, “De-anonymizing Users Across Heterogeneous Social Computing Platforms,” International AAAI Conference on Web and Social Media, 2013.
-  O. Goga, P. Loiseau, R. Sommer, R. Teixeira, and K. P. Gummadi, “On the Reliability of Profile Matching Across Large Online Social Networks,” Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp. 1799–1808, 2015.
-  P. Jain, P. Kumaraguru, and A. Joshi, “@I Seek ’Fb.Me’: Identifying Users Across Multiple Online Social Networks,” Proceedings of the 22nd International Conference on World Wide Web, pp. 1259–1268, 2013.
-  J. Novak and A. Tomkins, “Anti-Aliasing on the Web,” Proceedings of the 13th International Conference on World Wide Web. ACM, pp. 30–39, 2004.
-  J. Liu, F. Zhang, X. Song, Y. Song, C. Lin, and H. Hon, “What’s In A Name?: An Unsupervised Approach to Link Users Across Communities,” Proceedings of the Sixth ACM International Conference on Web Search and Data Mining, pp. 495–504, 2013.
-  K. Cortis, S. Scerri, I. Rivera, and S. Handschuh, “Discovering Semantic Equivalence of People Behind Online Profiles,” Proceedings of the Resource Discovery (RED) Workshop., 2012.
-  A. Malhotra, L. Totti, W. Meira, P. Kumaraguru, and V. Almeida, “Studying User Footprints in Different Online Social Networks,” Proceedings of the 2012 International Conference on Advances in Social Networks Analysis and Mining, pp. 1065–1070, 2012.
-  N. Korula and S. Lattanzi, “An Efficient Reconciliation Algorithm for Social Networks,” Proceedings of the VLDB Endowment, vol. 7, no. 5, pp. 377–388, 2014.
-  Y. Zhang, J. Tang, Z. Yang, J. Pei, and P. Yu, “COSNET : Connecting Heterogeneous Social Networks with Local and Global Consistency Categories and Subject Descriptors,” Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1485–1494, 2011.
-  G. Schoenebeck, “Potential Networks, Contagious Communities, and Understanding Social Network Structure,” International Conference on World Wide Web, ACM, pp. 1123–1132, 2013.
-  S. Tan, Z. Guan, D. Cai, X. Qin, J. Bu, and C. Chen, “Mapping Users across Networks by Manifold Alignment on Hypergraph,” Association for the Advancement of Artificial Intelligence, pp. 159–165, 2014.
-  O. Goga, “Matching User Accounts Across Online Social Networks: Methods and Applications,” Doctoral dissertation, LIP6-Laboratoire d’Informatique de Paris 6, 2014.
-  F. Bertini, R. Sharma, A. Iannì, and D. Montesi, “Profile Resolution across Multilayer Networks through Smartphone Camera Fingerprint,” Proceedings of the 19th International Database Engineering & Applications Symposium, ACM, pp. 23–32, 2015.
-  F. Peng, J. Shi, and M. Long, “Identifying Photographic Images and Photorealistic Computer Graphics Using Multifractal Spectrum Features of PRNU,” IEEE International Conference on Multimedia and Expo, pp. 1–6, 2014.
D. Valsesia, G. Coluccia, T. Bianchi, and E. Magli, “Large-scale Image Retrieval Based on Compressed Camera Identification,”IEEE Transactions on Multimedia., vol. 17, no. 9, pp. 1439–1449, 2015.
-  R. Caldelli, I. Amerini, F. Picchioni, and M. Innocenti, “Fast Image Clustering of Unknown Source Images,” IEEE International Workshop on Information Forensics and Security, pp. 1–5, 2010.
-  L. Chang-tsun, “Source Camera Identification Using Enhanced Sensor Pattern Noise,” IEEE Transactions on Information Forensics and Security, vol. 5, no. 2, pp. 280–287, 2010.
-  X. Kang, Y. Li, Z. Qu, and J. Huang, “Enhancing Source Camera Identification Performance with a Camera Reference Phase Sensor Pattern Noise,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 393–402, 2012.
T. Thai, R. Cogranne, and F. Retraint, “Camera Model Identification Based on the Heteroscedastic Noise Model,”IEEE Transactions on Image Processing., vol. 23, no. 1, pp. 250–263, 2014.
-  Y. Tomioka and H. Kitazawa, “Digital Camera Identification Based on the Clustered Pattern Noise of Image Sensors,” IEEE International Conference on Multimedia and Expo, pp. 1–4, 2011.
-  A. Castiglione, G. Cattaneo, M. Cembalo, and U. Petrillo, “Experimentations With Source Camera Identification and Online Social Networks,” Journal of Ambient Intelligence and Humanized Computing, vol. 4, no. 2, pp. 265–274, 2013.