Pre-Trained Language Transformers are Universal Image Classifiers

by   Rahul Goel, et al.

Facial images disclose many hidden personal traits such as age, gender, race, health, emotion, and psychology. Understanding these traits will help to classify the people in different attributes. In this paper, we have presented a novel method for classifying images using a pretrained transformer model. We apply the pretrained transformer for the binary classification of facial images in criminal and non-criminal classes. The pretrained transformer of GPT-2 is trained to generate text and then fine-tuned to classify facial images. During the finetuning process with images, most of the layers of GT-2 are frozen during backpropagation and the model is frozen pretrained transformer (FPT). The FPT acts as a universal image classifier, and this paper shows the application of FPT on facial images. We also use our FPT on encrypted images for classification. Our FPT shows high accuracy on both raw facial images and encrypted images. We hypothesize the meta-learning capacity FPT gained because of its large size and trained on a large size with theory and experiments. The GPT-2 trained to generate a single word token at a time, through the autoregressive process, forced to heavy-tail distribution. Then the FPT uses the heavy-tail property as its meta-learning capacity for classifying images. Our work shows one way to avoid bias during the machine classification of images.The FPT encodes worldly knowledge because of the pretraining of one text, which it uses during the classification. The statistical error of classification is reduced because of the added context gained from the text.Our paper shows the ethical dimension of using encrypted data for classification.Criminal images are sensitive to share across the boundary but encrypted largely evades ethical concern.FPT showing good classification accuracy on encrypted images shows promise for further research on privacy-preserving machine learning.



There are no comments yet.


page 8

page 10

page 12

page 17


Privacy-Preserving Image Classification Using Isotropic Network

In this paper, we propose a privacy-preserving image classification meth...

Privacy-Preserving Image Classification Using Vision Transformer

In this paper, we propose a privacy-preserving image classification meth...

A Reversible Data Hiding Method in Compressible Encrypted Images

We propose a reversible data hiding (RDH) method in compressible encrypt...

2^B3^C: 2 Box 3 Crop of Facial Image for Gender Classification with Convolutional Networks

In this paper, we tackle the classification of gender in facial images w...

Encrypted statistical machine learning: new privacy preserving methods

We present two new statistical machine learning methods designed to lear...

Private Facial Diagnosis as an Edge Service for Parkinson's DBS Treatment Valuation

Facial phenotyping has recently been successfully exploited for medical ...

Use HMM and KNN for classifying corneal data

These days to gain classification system with high accuracy that can cla...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Vision is a vital modality for humans to understand the complex world around them [14]. In modern-day lives, the police force, security agencies, defense agencies use images of a person to understand their criminal tendency [17]

. As goes with the folklore, "a picture speaks thousands of words", we can determine people’s feelings, emotions, aggressiveness just by looking at their face with fairly good accuracy. The sensitivity of detecting a person’s criminal tendency from their face in railway stations, airports, mass gatherings, immigration demands very high accuracy. Security persons monitor sensitive places by looking at the crowd through computer monitors or in persons. Any chance of detection error may have huge bearings on people’s lives. On top of it, defense forces nowadays rely heavily on automated machine learning systems to detect criminal tendencies from facial images. This comes from the practical necessity of detecting images at scale in the crowded areas. Convolution neural networks (CNN) are employed as classifiers in the automated classification of images to binary classes, e.g., (Class 0 = criminal, Class 1 = non-criminal). Though CNN acts as a good classifier, they use only the features of images to classify


. For example, if CNN sees many squarish-faced, thick nose images as criminal people in the training set, the CNN will be biased towards labeling similar features in the test set as criminal. This inductive bias of CNN comes both from the combination of CNN’s architectural constraints and the imbalanced distributions of the training data. The architectural constraints of CNN force to combine the low-level features of images and pass them to higher layers. The CNN generally used for the classification are powerful pattern matchers and has the ability to contort themselves to fit almost any unbalanced facial images dataset. CNN model understands the world by gluing together thousands (even millions) of linear and non-linear functions, and adapting each of these functions using a training example. It thus builds a high dimensional manifold that fits the training image data set, and generalizes by inductively pattern-matching onto what it has seen before in the training dataset. CNN (and a wide variety of machine learning models) thus generalize their prediction by induction and not by deduction. This explains why big deep vision models require large amounts of data to learn to classify. Big vision models are good inductive interpolators but not so good extrapolators. Thus, deep vision models of CNN (Capsule Networks) can be risky to classify the images of people, and these models’ understanding of the images is superficial. Moreover, the limited availability of criminal images compared to non-criminal images add further problem. To better understand the facial images, we need to give them additional context in the form of architectural priors and statistical priors. Building image priors for CNN is an arduous task, and the deep learning community relies on data at scale for better generalization.

In this paper, we circumvent the above problems of deep vision models for binary classification of images using a novel method of transfer learning. We use the language model of the Frozen Pre-Trained Transformer (FPT) model for the classification task. Our FPT is pre-trained on a vast amount of text, and then the learned FPT is fine-tuned on facial images for classification. FPT is pre-trained on publicly available data sets, e.g., Wikipedia, Google books, etc., to generate the next word sequence through the autoregressive method. The autoregressive pre-training method gives the heavy-tailed distribution of the output generated words. The heavy-tailed behavior gained by FPT during word generation act as a prior for the fine-tuning process of the image classification. Because of the pre-training, our FPT bypasses the limitations of the inductive assumption that the unseen test data will resemble the training data. Though this is true in some cases, it does not always hold. Facial images with similar features in the training and test sets may tell a completely different story. The model thus may wrongly interpret the unseen test image. We have used the autoregressive pre-training method to overcome the inductive assumption, and our FPT model the world in a non-linear fashion. The FPT may predict similar images in the training and test data differently. The long-tail distribution helps the model to catch the criminal cases that are less common (e.g., old white women). These training examples of these outliers and the hardest to come across for the FPT to train. The heavy-tail also assists FPT to circumvent different biases (racial, ethnic, facial features) during classification by non-linear modeling of training and test data. To the best of our knowledge, this paper is the first attempt to use the language model to detect the criminal tendency from a facial image.

The main contributions of our paper are as follows:

  • We use a pre-trained language model to classify facial images for criminal detection.

  • We have theoretically proved and experimentally validated that Frozen Pre-Trained Transformer (FPT) is an Universal Image Classifier.

  • We have shown experimentally that FPT can classify encrypted images.

  • We built a balanced facial images data set by collecting the images from 11 different sources through web scraping.

2 Related work

Research has shown that people judge a person by their facial features [41, 48]. People infer whether a person is likely to be trustworthy, competent, or dominant by looking at photos or computer-generated images [6]. Psychological research shows we judge a person when we see him/her first time, within a couple of seconds. The long history of evolution has trained humans (and other primates, animals) to quickly infer its danger, risk, etc., by looking at his/her group members’ facial features. Present vision systems have developed amazingly to detect a person’s trustworthiness, honesty, etc., simply by looking at the face [13, 4]. The police and defense agencies have largely relied on the ingenuity of human vision systems to detect criminals—robbers and people who choose to live shady life, unfortunately. Moreover, the criminals are also getting smarter every day and become quite successful in evading the tactics of the police[6]

. Thus, there always lies a finite probability of a

Type I error (false positive conclusion) and Type II error (false negative conclusion)[16]. Though the initial impression comes from facial images, they are inherently influenced by bias (racial, ethnic, color, personal). Multiple sources of information need to be fused to overcome the embedded bias[41, 37]. This complexity brings many challenges to empirical studies of facial images. The criminal manifold and non-criminal manifold are mixed, and it is a nontrivial task to separate them.

The criminal tendency has been thoroughly investigated in work by Lombroso. His works on "born criminal" influenced European and American thinking about the causes of criminal behavior. It also laid the foundations of theories of crime, explained by the facial structures and genetic explanations[25]. Lombroso’s work put the research perspective of facial structure and emotions entangling to the forefront. In [47]

, authors study the possibility of teaching machines to pass the Turing test on the task of duplicating humans in their first impressions with facial images, personality traits etc. The thesis of their study is to face the perception of criminality and to validate the hypothesis on the correlations between the innate traits of a person and his/her facial features. A flurry of research activity of using deep neural networks for computer vision challenges started from the paper of Alex Krizhevsky et al. in the year of 2012

[24]. Deep neural networks, mainly convolutional networks, capsule networks, and recently vision transformers, are ubiquitous use in computer vision research[40]. Neural networks extract the features from facial images data only[33, 39]

. These features then passed through non-linear transform functions to classify sexual orientation from facial images. The features to identify sexual orientation are learned from the data exclusively through backpropagation and have shown the accuracy of 81% for men and 71% for women


. The accuracy of their deep neural systems surpasses human judges, showing the accuracy of 61% for men and 54% for women. Xin Geng et al. estimate the age through facial images using algorithms of IIS-LLD and CPNN

[12]. The authors exploited the changes of facial features as a slowly varying function of time to overcome the scarcity of training data. In [12], the authors estimate the age through facial images. Along similar lines, Andrew G Reece et al. used Instagram face images as markers to detect depression and psychiatric disorders[36]. Using color analysis metadata components, they compute multiple statistical features from the Instagram photos. Their proposed models also outperformed humans’ success rate for identifying depression. The authors of [8] presented a method for detecting criminal activity in a video stream by recording the person’s aberrant actions in subsequent video frames. They have employed a hybrid deep learning algorithm to analyze video stream data for surveillance in urban residential areas. Umadevi V Navalgund et al presented a fascinating work of capturing crime intention in public places of ATM, Bank etc by detecting the weapons in hand[28]. This is in the direction of pre-crime technology. Where the machine learning model tries to predict the crime before it happens. The authors have used pre-trained deep models of GoogleNet, VGGNet-19 for the predictive behavior. In closely related works, pre-trained deep learning models such as VGG-19 [27], and GoogleNet [49] have been used to identify a knife or pistol in a person’s hand and aim it towards another person. The authors of [35] advocated real-time criminal detection utilizing ML and Deep Learning for crime prevention. The use of pre-trained deep learning models e.g. VGG-16, VGG-19, GoogleNet, and Inception V3 is to train on a wider variety of data set[43, 27, 49]. The pre-trained models thus encode worldly knowledge through pre-training, which is used to detect criminal tendencies/intentions from facial images, crowd images, etc. As the deep models build their features from the data (without any hand coding), its unreasonable effectiveness (accuracy in predicting criminal tendencies) comes from the volume and veracity of data.

3 Theory

The FPT is a transformer-decoder block trained on massive web text crawled from the internet. The FPT is trained to generate a one-word token at a time. Our FPT architecture, unlike a normal transformer, has no encoder. For example, at the start, the output word token is that generates the next word token “cat”. The new sequence is now fed in FPT to generate the next word token . This is an autoregressive method by which FPT is pretrained on the web text to generate a single word token every time. We use this autoregressive pre-trained model to generate a word token as a stochastic recurrence equation. The word token generated at time depends on as[7, 22]:


where and matrices comes from the FPT architecture. The solution for Equation 1 is then given by the recursive method as:


Thus word token generated at time depends on the initial word token and follows:


From the condition that if the expected value of the matrix is finite


then second term of Equation 3 converges to exponentially fast and thus the distribution of the word token generated at time converges to:


This shows the word token generated at the time is independent of the initial word . This is of extreme importance, as violating this condition would make future word token highly sensitive to the initial word . The distribution of the word generated is dependent on . The renewal theory gives under reasonable conditions there exists and the probability of word follows [9]:


Thus, our FPT model generates a word with long-range interactions because of the heavy-tail distributions. This heavy tail distribution introduces variance in the learning. So our model transcends from the memorization of the data-set to true learning during the fine-tuning process. This true learning or meta-learning capability is exploited for better generalizability during the fine-tuning process for binary classification.

Figure 1: Learning Pipeline for Binary Classification with FPT

4 Dataset Description

We have collected facial image data from multiple resources through web crawling. The advantage of using multiple web sources to collect facial images is a balanced distribution between criminal and non-criminal classes. The facial dataset we have used from multiple web sources generates a mixed data distribution for training during the fine-tuning process. Thus, the bias embedded in the training set is reduced from multiple websites. We describe the method of data set collection in their prepossessing in detail (Section 4.1). In the next section, we explain a Chaos-Based Image-Encryption algorithm [2] to encrypt facial images. The encrypted images are used as input images for training. The purpose of using encryption on facial images is to show that the pretrained transformers are powerful enough for binary classification. The practical usage comes from the fact that sharing criminal images across nations, continents etc are sensitive and have legal/ethical bindings, while sharing the encrypted images (without revealing the true identity) is more viable, as explained in Section 4.2.

4.1 Data Collection and Preprocessing

Dataset Total Pre-processed
Criminal Dataset
Smoking Gun 10,990 8,217
National Institute of Standards and Technology (NIST) 1,756 1,512
Drug Enforcement Administration (DEA) 587 337
Crime stoppers 125 125
Federal Bureau of Investigation (FBI) 118 72
Office of Inspector General (OIG) 42 38
U.S. Immigration and Customs Enforcement (ICE) 41 12
National Criminal Agency (NCA) 22 17
Tennessee Bureau of Investigation (TBI) 9 6
Non-Criminal Dataset
10k US Adult Faces Database 10,168 10,168
Flickr-Faces-HQ Dataset (FFHQ) 70,000 168
Table 1: Criminal and Non-criminal dataset Collection Information.

We have collected a total of 13,690 images of arrested or wanted criminals, and mugshots are collected from nine different sources (see Table 1, Column 1 and 2 for detail). From the collected images a total of 11,934 RGB facial images are collected using web-scraping from eight sources (Smoking gun [15], Drug Enforcement Administration (DEA) [10], Crime stoppers [38], Federal Bureau of Investigation (FBI) [11], Office of Inspector General (OIG) [31], U.S. Immigration and Customs Enforcement (ICE) [20], National Criminal Agency (NCA) [1], Tennessee Bureau of Investigation (TBI) [30]) and 1,756 gray-scale mugshot images of arrested individuals are obtained from National Institute of Standards and Technology (NIST) Special Database [29]. Images are in different formats PNG, JPG, or JPEG. The dataset contains images of individuals of various races, gender, and facial expression and contains both front and side (profile) views. Since we focus on frontal face shots, we need to eliminate profile views. Haar basis function-based cascade classifier detects images containing frontal face views and also detects the rectangular area containing the face[44]. Images are first passed to a pre-trained version of this classifier, available in the OpenCV library in Python[5]. This is to select only the images containing frontal face views, and then we crop the rectangular area containing the face. Cropping the facial rectangle from the rest of the image prevents the classifier from being biased by peripheral or background effects surrounding the face. The non-frontal face images that are misclassified as frontal face images from the Haar feature-based cascade classifier are manually deleted. To further reduce the effect of peripheral or background surroundings, we crop the face using ellipse masking of the same size as shown in Figure 2. The result contains approximately 10k front view face images of different formats (see Table 1

, Column 3 for detail). We have also transformed image format to PNG and all RGB images to gray-scale to preserve consistency. To maintain the output dimension same as the input, we resize images with padding to 256×256.

Figure 2: Face Detection & Image Crop Using Ellipse Mask: In (a), we start by showing the original photo. The face is highlighted using the green box in (b) once detected. Finally, face is cropped using an ellipse mask (shown in (c)).

A total of 10,168 RGB facial images are obtained from 10k US Adult Faces Database and are converted to the standard png format [3]. We consider these images as noncriminal face shots. Images are sampled from different race, gender, and facial expressions. The final database contains only front views and with reduced effect of peripheral and background surrounding through ellipse masking. We additionally added some images from Flickr-Faces-HQ Dataset (FFHQ) [21] to balance both criminal and noncriminal datasets. FFHQ is a dataset consisting of human faces and includes more variation in terms of age, ethnicity, and image background. Since our focus is on frontal face shots, we need to eliminate profile views using the Haar feature-based cascade classifier[44]. The non-frontal face images are then manually deleted. Also, to keep the age uniformity, images of the elderly and children are manually deleted from this dataset.

4.2 Data Encryption

We have used Chaos-Based Image-Encryption algorithm to encrypt the images[2]. Chaos-based encryption uses some dynamical systems method to generate random sequences of numbers. This sequence is then used to generate public key for the encryption. Chaotic systems are sensitive on initial conditions, similarity to random behavior. For the key generation a Chaotic map of degree is defined by a recurrent relation (e.g. Chebyshev map) [23]:


In order to generate keys we first generate a large integer and a random number and then compute the chaotic map . The public key is the pair and private key is . Now in order to encrypt facial image we again generate a large integer and compute the following functions . The encrypted image is then on which our learning FPT does the binary classification. We view our chaos-based encryption on the facial image as a non-linear transformation on the pixel space. We have shown some examples of chaos-based encrypted images and their corresponding pre-processed images in Figure 3. Though these images are indistinguishable and unidentifiable for human eyes, our pretrained transformer FPT exploits this equivariance on encrypted images to classify [32]. The high accuracy of binary classification on the encrypted space points us to an exciting direction of the equivariance principle deep models use. The practical advantage of working in the encrypted pixel space comes from sharing sensitive (e.g. criminal images) are not always possible. Such sharing of data has both legal and ethical concerns. Sharing encrypted images across the boundary will alleviate the ethical and legal issues. The transnational police agencies e.g. Interpol Europol that works accross the nations will be able to run smoothly because of the encrypted images.

(a) Criminal X encrypted image
(b) Criminal X preprocessed image
(c) Non-criminal Y encrypted image
(d) Non-criminal Y preprocessed image
Figure 3: A few examples of encrypted and preprocessed criminal and non-criminal images.

5 Architecture

The model we use in our experiments is a frozen pretrained transformer (FPT) model of GPT-2 model[26]. Our FPT model belongs to the Transformer class of architecture introduced by Ashish Vaswani et al [42]. Our choice of FPT, which is a decoder only block transformer because of its generative capability of single word at a time. This recursive generation of words, along with the forward masking gives the model heavy-tail properties. This heavy-tail gives the ability to understand the pixel content. The architecture of FPT has the following attributes [34, 46]: embedding size is , the number of layers is , output dimension is . FPT is pretrained (for word token generation) on a large corpus of 40 GB of text data. The pretrained model is then fine-tuned with 20K frontal images as explained in the  4.1

. During the FPT finetuning only the linear input, output layer, positional embeddings, and layer norm parameters are updated during the stochastic gradient descent method. The rest of the architecture is not updated and thus remains frozen.

6 Results

In this section, we first describe the performance metrics to measure the generalization capacity of the FPT (in Section 6.1), and then we discuss the results of binary classification (in Section 6.2). We finetune our model with approximately 20k images (criminals plus non-criminals) in the original image and its encrypted form separately as shown in the Table 2. A total of 16,672 trained images and 4,000 test images are used in our experiments, the results of which are reported in Section 6.2. Our experiments are conducted on NVIDIA Tesla V100 GPU and 512GB memory.

Criminal Non-Criminal
Train dataset 8336 8336
Test dataset 2000 2000
Table 2: Train-test data information.

6.1 Measuring Performance

6.1.1 Auc (Auroc)

The area under the receiver operating characteristic (AUROC) is a performance metric used to evaluate classification models. The AUROC is the probability that a randomly selected positive example has a higher predicted probability of being positive than a randomly selected negative example. The AUROC is calculated as the area underneath a curve that measures the trade-off between true positive rate (TPR) and false positive rate (FPR) at different decision thresholds d. For balanced two-class data, a random baseline classifier has an AUROC as 0.5 (d = 0.5), while a perfect classifier has an AUROC of 1.0.

6.1.2 Average Precision

Average precision or AUPRC (Area Under the Precision-Recall Curve) is calculated as the area under Precision-Recall (PR) curve. A PR curve shows the trade-off between precision and recall across different decision thresholds. Thus, the average precision is high when a model can correctly handle positive examples. With AUPRC, the baseline is equal to the fraction of positive examples. The fraction of positive examples is calculated as the number of positive examples divided by the total number of examples. This gives different classes different AUPRC baselines. For balanced two-class data, the AUPRC baseline is 0.5, while a perfect classifier has an average precision of 1.0.

(a) Pre-processed data: Accuracy
(b) Pre-processed data: Loss
(c) Encrypted data: Accuracy
(d) Encrypted data: Loss
Figure 4: Train-test accuracy and loss for pre-processed facial images (a, b) and encrypted facial images (c, d) respectively.
(a) Pre-processed data
(b) Encrypted data
Figure 5: AUROC and AUPRC plots for pre-processed, and encrypted data.

6.2 Generalization Results

We have used two sets of datasets: (1) pre-processed frontal facial images, (2) chaos-based encrypted images, and fine-tuned two separate FPT models. Our models’ train and test accuracies and losses with epochs are plotted in Figure

4. Figure (a)a, and (b)b shows the accuracy and loss of fine-tuned FPT model on preprocessed frontal image data. Figure (c)c, and (d)d shows the accuracy and loss of fine-tuned FPT model on encrypted data. For pre-processed data, we can infer that the model is trained properly as the test and train accuracies reach to maximum in around 350-400 epochs and corresponding loss also saturates to the minimum value in similar number of epochs (see Figure (a)a, and (b)b

). On the other hand, for the encrypted data, the training accuracy is always higher than the test accuracy even after 1500 epochs, and corresponding loss is always higher for test data. Even though we can explain the case of encrypted images as a standard case of overfitting from the statistical learning theory principles, but from the

scaling hypothesis

. The billion parameter FPT model is trained in an unsupervised manner on a large internet-scale text for word generation. While trained to generate text, our FPT has no choice but to solve many hard problems. This drives our FPT to go for meta-learning. Our FPT is forced from memorizing parts of the data during pretraining to the true learning. Thus, our FPT builds informative priors while pretraining on the large text data set. The FPT as a meta-learner learns as a amortized Bayesian inference model. Even though in our paper we show the accuracy for binary classification, we propose two hypotheses based on our observation:

  • Accuracy won’t suffer much even for the problem of multi-label classification of the facial images with FPT.

  • We will further improve the training and accuracy of the encrypted images, by choosing carefully the initial conditions of Chaotic map (Chebyshev map) in chaos-based encryption.

Next, we calculate the performance of our fined tuned FPT models for pre-processed facial images and encrypted images separately using two metrics, AUROC, and Average precision. FPT generates many true positives and true negatives during the testing as our AUROC, and average precision are both high (Figure 5). In the ROC plot (red), we observe the decision thresholds d = 0.9 to d = 0.5 span a small interval of . Because of high decision thresholds, the are low, and the

are high, thus producing a small FPR. The average precision is not improved by the number of true negatives, as true negatives are not used in the calculation. The average precision is improved by the decrease in false positives. This is because some examples are shifted from false positives to true negatives. This is in order to keep the dataset balanced. Precision = tp/(tp+fp), so when we make the fp smaller, we increase the precision. We use a confusion matrix as shown in Figure

6 that calculates how many images from both criminal and non-criminal classes got correctly classified. We can observe from here that 0.02% of images from criminal and 0.01% of from non-criminal classes are getting misclassified using preprocessed images. Similarly, for the encrypted data, we can observe that 0.03% of images from criminal and 0.02% of images from non-criminal classes are getting misclassified.

(a) Pre-processed data
(b) Encrypted data
Figure 6: Normalized confusion matrix

7 Conclusion

We have presented a novel method of using a frozen pretrained transformer (FPT) trained on the text corpus and then fine-tuned on the images for binary classification. We have shown that large language models of a transformer are expressive enough in solving vision tasks. We have used a pretrained transformer model trained to generate text and use this pretrained transformer to solve the binary classification of facial images. The high accuracy of our pretrained model on the classification task shows the phenomena of "blessings of scale. Our FPT with billions of parameters while doing the pretraining of word generation started showing its meta-learning capability. Because of this phenomena’s blessings of scale, our FPT becomes powerful and more generalizable while solving the image classification problem[19]. The other surprising capacity of our model is its ability to classify even on the encrypted images. Our work contributes towards using transformer models in the privacy-preserving machine learning method.

On the practical side, our paper contributes a step towards an automated classification of criminal images with high accuracy. These models can be deployed by police and defense agencies, border security for airports, railway stations, and other important and crowded venues as a pre-crime tech tool. The other important side we address is the inherent bias by the people in the security in convicting people wrongly. Social movement like Black Lives Matter highlights the police brutality towards particular sections in the society. We need to build more ethical tools that are free from human bias, and one crucial way is to build large machine learning models and trained on a wide variety of data. This helps the model not to have a myopic view about the world and judge humans ethically.

Ethical Statement

While collecting and analyzing the data set for this investigation, the researchers were committed to high ethical standards. All researchers were guided by three ethical principles throughout the various stage of the research, including data collection, data processing, data analysis, and dissemination of the key findings:

  1. The researchers ensured the quality and integrity of all research practices.

  2. Personal information contained in the collected data has been treated as confidential and was anonymized in public presentations and publications.

  3. The researchers testify that their research is independent and impartial.

Throughout the project, the researchers considered both positive and negative outcomes of the research. For instance, the publication of outcomes, such as the identification of people in the dataset, could possibly affect their well-being. To mitigate such risks, the researchers anonymized any personal information in published materials to elicit informed consent on the draft paper. Finally, the data sets used for the investigation were stored at a protected locations in the cloud, and all personal information gathered throughout the research will be destroyed after the last publication of the project.


This research (or publication) has been financed by the Europan Social Fund, Estonian Research Council Institutional Research grant PUT PRG306 and ERDF via the IT Academy Research Programme, and H2020 framework project, SoBigData++ and CHIST-ERA project SAI. This work has also been supported by NASA (Awards 80NSSC20K1720 and 521418-SC), NSF (Awards 2007202 and 2107463).


  • [1] National Criminal Agency. National criminal agency., 2020.
  • [2] Somaya Al-Maadeed, Afnan Al-Ali, and Turki Abdalla. A new chaos-based image-encryption and compression algorithm. Journal of Electrical and computer Engineering, 2012, 2012.
  • [3] Wilma A Bainbridge, Phillip Isola, and Aude Oliva. The intrinsic memorability of face photographs. Journal of Experimental Psychology: General, 142(4):1323, 2013.
  • [4] Flavio Bertini, Rajesh Sharma, Andrea Iannì, and Danilo Montesi. Profile resolution across multilayer networks through smartphone camera fingerprint. In Proceedings of the 19th International Database Engineering & Applications Symposium, pages 23–32, 2015.
  • [5] Gary Bradski and Adrian Kaehler. Opencv. Dr. Dobb’s journal of software tools, 3, 2000.
  • [6] Gavin Buckingham, Lisa M DeBruine, Anthony C Little, Lisa LM Welling, Claire A Conway, Bernard P Tiddeman, and Benedict C Jones. Visual adaptation to masculine and feminine faces influences generalized preferences and perceptions of trustworthiness. Evolution and Human Behavior, 27(5):381–389, 2006.
  • [7] D. Buraczewski, E. Damek, and T. Mikosch. Stochastic Models with Power-Law Tails: The Equation X = AX + B. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, 2018.
  • [8] Sharmila Chackravarthy, Steven Schmitt, and Li Yang.

    Intelligent crime anomaly detection in smart cities using deep learning.

    In 2018 IEEE 4th International Conference on Collaboration and Internet Computing (CIC), pages 399–404. IEEE, 2018.
  • [9] D.R. Cox. Renewal Theory. Methuen’s monographs on applied probability and statistics. Methuen, 1962.
  • [10] DEA. Drug enforcement administration., 2020.
  • [11] FBI. Federal bureau of investigation., 2020.
  • [12] Xin Geng, Chao Yin, and Zhi-Hua Zhou. Facial age estimation by learning from label distributions. IEEE transactions on pattern analysis and machine intelligence, 35(10):2401–2412, 2013.
  • [13] Rahul Goel, Tymofii Brik, and Rajesh Sharma. Do facial trait correlates with roll call voting in parliament? using fwhr to study performance in politics. arXiv preprint arXiv:2110.00780, 2021.
  • [14] Elkhonon Goldberg. The new executive brain: Frontal lobes in a complex world. Oxford University Press, 2009.
  • [15] The Smoking Gun. The smoking gun: Public documents, mug shots., 2020.
  • [16] Michael P Haselhuhn and Elaine M Wong. Bad to the bone: facial structure predicts unethical behaviour. Proceedings of the Royal Society B: Biological Sciences, 279(1728):571–576, 2012.
  • [17] Mahdi Hashemi and Margeret Hall. Retracted article: Criminal tendency detection from facial images and the gender bias effect. Journal of Big Data, 7(1):1–16, 2020.
  • [18] Nima Hatami, Yann Gavet, and Johan Debayle. Classification of time-series images using deep convolutional neural networks. In Tenth international conference on machine vision (ICMV 2017), volume 10696, page 106960Y. International Society for Optics and Photonics, 2018.
  • [19] Zhichun Huang, Shaojie Bai, and J Zico Kolter. $(\textrm{Implicit})^2$: Implicit layers for implicit representations. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021.
  • [20] U.S. Immigration and Customs Enforcement. U.s. immigration and customs enforcement., 2020.
  • [21] Vahid Kazemi and Josephine Sullivan. One millisecond face alignment with an ensemble of regression trees. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    , pages 1867–1874, 2014.
  • [22] Harry Kesten. Random difference equations and renewal theory for products of random matrices. Acta Math., 131:207–248, 1973.
  • [23] L. Kocarev and S. Lian. Chaos-based Cryptography: Theory, Algorithms and Applications. Studies in Computational Intelligence. Springer Berlin Heidelberg, 2011.
  • [24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • [25] Cesare Lombroso. Criminal man. Duke University Press, 2006.
  • [26] Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines, 2021.
  • [27] Muhammad Mateen, Junhao Wen, Sun Song, Zhouping Huang, et al. Fundus image classification using vgg-19 architecture with pca and svd. Symmetry, 11(1):1, 2019.
  • [28] Umadevi V Navalgund and K Priyadharshini. Crime intention detection system using deep learning. In 2018 International Conference on Circuits and Systems in Digital Enterprise Technology (ICCSDET), pages 1–6. IEEE, 2018.
  • [29] NIST. National institute of standards and technology (nist)., 2020.
  • [30] Tennessee Bureau of Investigation. Tennessee bureau of investigation., 2020.
  • [31] OIG. Office of inspector genera., 2020.
  • [32] Chris Olah, Nick Cammarata, Chelsea Voss, Ludwig Schubert, and Gabriel Goh. Naturally occurring equivariance in neural networks. Distill, 2020.
  • [33] Wanli Ouyang, Xiaogang Wang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Chen-Change Loy, et al. Deepid-net: Deformable deep convolutional neural networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2403–2412, 2015.
  • [34] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  • [35] Charuni Rajapakshe, Shashikala Balasooriya, Hirumini Dayarathna, Nethravi Ranaweera, Namalie Walgampaya, and Nadeesa Pemadasa. Using cnns rnns and machine learning algorithms for real-time crime prediction. In 2019 International Conference on Advancements in Computing (ICAC), pages 310–316. IEEE, 2019.
  • [36] Andrew G Reece and Christopher M Danforth. Instagram photos reveal predictive markers of depression.

    EPJ Data Science

    , 6:1–12, 2017.
  • [37] Janka I Stoker, Harry Garretsen, and Luuk J Spreeuwers. The facial appearance of ceos: Faces signal selection but not performance. PloS one, 11(7):e0159950, 2016.
  • [38] Crime Stoppers. Crime stoppers., 2020.
  • [39] Yi Sun, Ding Liang, Xiaogang Wang, and Xiaoou Tang. Deepid3: Face recognition with very deep neural networks. arXiv preprint arXiv:1502.00873, 2015.
  • [40] Anu Taneja and Anuja Arora.

    Modeling user preferences using neural networks and tensor factorization model.

    International Journal of Information Management, 45:132–148, 2019.
  • [41] Alexander Todorov, Christopher Y Olivola, Ron Dotsch, and Peter Mende-Siedlecki. Social attributions from faces: Determinants, consequences, accuracy, and functional significance. Annual review of psychology, 66, 2015.
  • [42] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [43] Harsh Verma, Siddharth Lotia, and Anurag Singh. Convolutional neural network based criminal detection. In 2020 IEEE REGION 10 CONFERENCE (TENCON), pages 1124–1129. IEEE, 2020.
  • [44] Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, volume 1, pages I–I. Ieee, 2001.
  • [45] Yilun Wang and Michal Kosinski. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. Journal of personality and social psychology, 114(2):246, 2018.
  • [46] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  • [47] Xiaolin Wu and Xi Zhang. Automated inference on criminality using face images. arXiv preprint arXiv:1611.04135, pages 4038–4052, 2016.
  • [48] Wenyi Zhao, Rama Chellappa, P Jonathon Phillips, and Azriel Rosenfeld. Face recognition: A literature survey. ACM computing surveys (CSUR), 35(4):399–458, 2003.
  • [49] Zhuoyao Zhong, Lianwen Jin, and Zecheng Xie. High performance offline handwritten chinese character recognition using googlenet and directional feature maps. In 2015 13th International Conference on Document Analysis and Recognition (ICDAR), pages 846–850. IEEE, 2015.