On the Inference of Soft Biometrics from Typing Patterns Collected in a Multi-device Environment

06/16/2020 ∙ by Vishaal Udandarao, et al. ∙ Haverford College IIIT Delhi 7

In this paper, we study the inference of gender, major/minor (computer science, non-computer science), typing style, age, and height from the typing patterns collected from 117 individuals in a multi-device environment. The inference of the first three identifiers was considered as classification tasks, while the rest as regression tasks. For classification tasks, we benchmark the performance of six classical machine learning (ML) and four deep learning (DL) classifiers. On the other hand, for regression tasks, we evaluated three ML and four DL-based regressors. The overall experiment consisted of two text-entry (free and fixed) and four device (Desktop, Tablet, Phone, and Combined) configurations. The best arrangements achieved accuracies of 96.15 respectively, and mean absolute errors of 1.77 years and 2.65 inches for age and height, respectively. The results are promising considering the variety of application scenarios that we have listed in this work.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 2

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

”Everyone is special, and nobody is like anyone else. Everyone’s got an act.”– The Greatest Showman.

While we interact with computing devices, we leave a variety of footprints such as typing, swiping, walking, among others. These footprints have been studied for authentication, identification, forensic analysis, health monitoring, cognitive assessment, and inferring soft biometric traits (Banerjee and Woodard, 2012; Vizer and Sears, 2015; Neal and Woodard, 2019; Buriro et al., 2016; Brizan et al., 2015; Dantcheva et al., 2016; Nixon et al., 2015; Miguel-Hurtado et al., 2016a). Typing (Banerjee and Woodard, 2012; Teh et al., 2013; Roth et al., 2015, 2014), swiping (Frank et al., 2013; Patel et al., 2016; Serwadda et al., 2016), gait (Kumar et al., 2018, 2016b, 2015; Primo et al., 2014), body movements (Kumar et al., 2017), and fusion are some of the widely studied behavioral patterns in the context of desktop, mobile, and wearable devices. Typing is commonly characterized as key press and release timings, keystroke sounds, and video sequence (Banerjee and Woodard, 2012; Teh et al., 2013; Roth et al., 2015, 2014). Security critical organizations such as the Defense Advanced Research Projects Agency (DARPA) have already adapted typing-based active authentication technology for desktops (Keromytis, 2015).

However, the majority of the keystroke studies focus on either authentication or identification under free or fixed-text entry environments (Kumar et al., 2016a; Teh et al., 2013; Banerjee and Woodard, 2012; Belman and Phoha, 2020). The number of studies on the inference of soft biometrics from typing patterns is limited or confined to a particular device/environment or both (Fairhurst and Da Costa-Abreu, 2011; Giot and Rosenberger, 2012; Plank, 2018; Tsimperidis et al., 2018; Li et al., 2019; Buker et al., 2019; Akis et al., 2014; Idrus et al., 2014; Buriro et al., 2016; Bandeira et al., 2019; Uzun et al., 2015; Pentel, 2017). Inference of a variety of personal attributes including but not limited to age, gender, cognitive assessment, handedness, typing hand, and number of fingers used for typing have been explored in the past (Tsimperidis et al., 2018; Antal and Nemes, 2016; Pentel, 2017; Idrus et al., 2014; Buriro et al., 2016; Rattani and Agrawal, 2019). Considering that typing is an indispensable part of our lives, we believe that it reveals a great deal of information and should be studied in depth for the inference of useful identifiers. The identifiers inferred from typing patterns can be used in a variety of ways. Some of them are listed below:

Figure 1. Person (on the left) impersonated Benjamin (in the middle, a handsome American businessman) to fool a divorced and lonely woman Rosely (on the right) and scam out her lifelong savings () by promising her lifelong love (Australia, 2019)

. The typing patterns of the person could have been used to estimate the gender, age, height, and weight, and alarm Rosely that the person she is thinking the love of her life may be fake as his/her soft traits do not match with the information provided to her. Besides, law enforcement personnel can use soft biometrics for tracing and convicting the person.

  • Personalized user experience: Consumers often refrain from providing too much information while signing up for an information technology-enabled service. Besides, people with disabilities may find it difficult to enter too much information to start using a software platform. The automated estimation of soft biometrics can be useful in such cases. Organizations can tailor their platforms and services as per the user’s demography for a seamless and personalized experience. Moreover, estimated soft biometrics can be used for controlling access to certain resources or platforms. For example, access to certain TV channels and websites can be restricted to individuals of certain age groups.

  • Improved recognition rate: The performance of an authentication and identification systems can be improved by incorporating the inferred soft biometrics such as age, gender, weight, and height in the pipeline (Rattani and Agrawal, 2019; Dantcheva et al., 2016; Thanganayagam et al., 2019; Syed Idrus et al., 2015).

  • Targeted advertising: Organizations can use the soft biometrics for customized their advertisement and target people of a specific height, weight, gender, and age groups who might be interested in the product more than the rest (Rattani and Agrawal, 2019; Dantcheva et al., 2016).

  • Identification of fake profiles on social media: The social-media platforms are suffering from fake profiles and fake news spread. It is not uncommon for individuals to fake their identity, i.e., to be a different gender, height, age, and profession. It is difficult to determine the legitimacy of individuals based on the type of information they post. The accurately estimated soft identifiers based on the typing pattern can help detect these profiles and take appropriate actions (Li et al., 2019; Fairhurst and Da Costa-Abreu, 2011).

  • Forensics: Covert identification of individuals has never been more critical than today as the number and nature of cybercrimes are rapidly evolving (Li et al., 2019). As per Federal Bureau Investigation (FBI)’s 2019 Internet Crime Report, online scams were registered alone in 2019 (Federal Bureau Investigation (FBI), 2019). These scams cost innocent people a total of billion. Business email compromise, romance fraud, and spoofing caused the highest financial losses. Several victims ended up losing their entire life savings or even sinking into debt. The law enforcement agencies often lack credible information to trace and convict these scammers. Soft biometrics inferred from typing footprints that the scammers leave while they interact with the victims could be useful in such scenarios (see Figure 1 for an example).

The above-mentioned applications motivated us to study the inference of soft biometrics from typing patterns of individuals in a multi-device environment. In summary, this work makes the following set of contributions:

  • Investigate inference of five soft biometrics, namely, gender, major/minor, and typing style, age, and height from typing patterns collected from 117 individuals while they typed a predefined text and answered a series of questions on a desktop, tablet, and smartphone.

  • Benchmark six Machine Learning (ML) and four Deep Learning (DL) algorithms for the classification of gender, major/minor, and typing style. Additionally, we benchmark eight different configurations generated from two factors (free and fixed-text entry), and devices (Desktop, Phone, Tablet, and Combined).

  • Besides using unigraphs, digraphs, and word-level features with a mutual information-based feature selector, we explore a novel method of constructing the feature space for the application of DL methods.

  • Provide detailed results and discussion on the inference of gender, major/minor, typing style, age, and height of the participants. Besides, present a qualitative performance comparison with the existing studies.

  • Share the code base for reproducibility of results and foster future research in this direction.111Code is available upon request. Please send an email to the last author.

The rest of the paper is organized as follows. Section 2 discusses the closely related works. Section 3 presents the design of experiments. Section 4, and Section 5 present and discuss the results, respectively. Finally, we conclude the paper and provide future research directions in Section 6.

2. Related work

The inference of soft biometrics (gender, age, ethnicity, hair/eye/skin colors, and hairstyle) from physical biometrics (e.g., face, fingerprint, iris, hand, and body), as well as gait and voice, have been substantially covered by Dantcheva et al. (Dantcheva et al., 2016). Thus, in this section, we describe the works related to the inference of soft biometric from typing patterns, and the gap that this work attempts to fill in.

Early attempts to infer the gender of the typists from keystroke analysis were made in (Fairhurst and Da Costa-Abreu, 2011; Giot and Rosenberger, 2012). One (Fairhurst and Da Costa-Abreu, 2011) was inspired by developing trust and reliability among social network users, while the other (Giot and Rosenberger, 2012) was motivated from improvement in the performance of user recognition systems by including estimated soft-biometrics as features. For example, Idrus et al. (Giot and Rosenberger, 2012; Syed Idrus et al., 2015) utilized the determined gender, age, and handedness to achieve about 7% of reduction in user recognition error rate. A separate study by Idrus et al. (Idrus et al., 2014) was conducted under fixed- and free-text entry environment to predict the hand category (use one or both hands), gender (male, female), age ( or ), and dominant hand (lefty or righty). Brizan et al. (Brizan et al., 2015) used hybrid (keystroke, stylometry, and language production) set of features to predict the cognitive demands of a given task. Yasin et al. (Uzun et al., 2015) were able to differentiate between children (below 15) and adults (above 15) by analyzing the participant’s typing behaviors. Recently, Abeer et al. (Buker et al., 2019) predicted gender from live chats.

Pentel (Pentel, 2017) combined mouse patterns with keystrokes to predict the age and gender of individuals. Likewise, Li et al. (Li et al., 2019) analyzed stylometry and keystroke dynamics to predict the gender of the person from 15 minutes of chat with 72% accuracy. Bandeira et al. (Bandeira et al., 2019) combined handwritten signature and keystroke dynamics for gender prediction. Abreu et al. (Julliana Caroline Gonc¸alves de A.S.M, 2019) also combined three modalities (keystrokes, touch strokes, and handwritten signature) to predict the gender of the typists. The authors suggested that the fusion-based system outperformed the rest. Buriro et al. (Buriro et al., 2016) estimated age, gender, and operating hands from the typing behavior of individuals collected on smartphones.

Other than age, gender, handedness, and dominant hand, researchers have predicted some interesting indicators from typing patterns. For example, Epp et al. (Epp et al., 2011) investigated the prediction of fifteen emotional states, including confidence, hesitance, nervousness, relaxation, sadness, and tiredness from typing patterns. Tsimperidis et al. (Tsimperidis et al., 2020) predicted the educational level of participants based on the keystroke dynamics information only. Beyond typing patterns, researchers have explored other behavioral patterns such as walking patterns, swiping patterns, calling patterns, device usage patterns to estimate a wide variety of soft identifiers (Neal* and Woodard, 2018; Miguel-Hurtado et al., 2016b; Garofalo et al., 2019; Acien et al., 2018; Neal and Woodard, 2018a).

The aforementioned studies have shown that an individual’s behavioral pattern reveal about their gender, age, handedness, dominant hand, emotional stress, cognitive ability, etc. These studies, however, were limited in terms of types of devices used in the experiments, data collection protocol (free or fixed text), application of algorithms, and prediction of specific soft biometric. The majority of the application scenario mentioned in the introduction would require the study on the inference of soft biometrics from behavioral patterns to be more thorough. By thorough, we mean the inclusion of a variety of users, devices, text entry mode, and a variety of learning paradigms that could be more suitable, in addition to collecting the absolute ground truth.

Conducting such a comprehensive study on this topic would require a grand data collection experiment. One of the datasets that aligned well with our hypothesis is the dataset recently posted by Belman et al. (Belman et al., 2019), which includes fixed as well as free text collected from 117 users who answered a wide variety of questions on a desktop, tablet, and smartphone. The specific soft traits that we included in this study are age, gender, height, typing style (must look at the keyboard, occasionally looks at the keyboard, and need not look at the keyboard), major/minor (computer science or non-computer science). Apart from considering five soft traits, we study keystroke features that (e.g., word-level features) have not been studied in this context but shown to be better than traditional keystroke features in the context of user recognition (Sim and Janakiraman, 2007; Belman and Phoha, 2020). Moreover, we apply numerous learning algorithms, which have not been studied in this context before, to the best of our knowledge.

3. Design of experiments

3.1. Dataset

We used Syracuse University and Assured Information Security-Behavioral Biometrics Multi-Device and Multi-Activity Data from the Same Users (SU-AIS BB-MAS) (Belman et al., 2019). The dataset consists of multiple modalities; however, we consider only the keystroke part, therefore refer to the dataset as BB-MAS-Keystroke in this document.

The BB-MAS-Keystroke consists of million keystrokes collected from users who typed two given sentences (fixed) and answered a series of questions (free-text) on desktop (Dell kb212-b), tablet (Samsung-S6), and smartphone (HTC-Nexus-9). A summary of the dataset is provided in Table 1. Please see (Belman et al., 2019) for more details.

3.2. Feature extraction and analysis

Following previous studies (Sim and Janakiraman, 2007; Belman and Phoha, 2020; Huang et al., 2016; Teh et al., 2013)

, we extracted unigraph (Key Hold Time), digraph (Flight or Key Interval Time), and word-level features. Before feature extraction, we removed outlier using interquartile range (IQR) method. The description of features computation is provided below and pictured in

Figure 2:

  • Unigraphs: Unigraphs are defined as the difference between the key release and key press timings. These features were extracted for all unigraphs in the data and aggregated. For example, if the key is pressed and released times in the dataset, the key hold feature of would be a list of values.

  • Digraphs: Digraph captures information about the press and release timings of two consecutive keys. There are four different digraphs that can be defined for two consecutive keys (say and ) as demonstrated as follows:

    We observed that in some cases, the key was pressed before the release of key , which resulted into negative values for the features and for those occurrences. The aggregation process was same as unigraphs.

  • Word level features: The word-level features capture different characteristics of the data than the uni and digraphs. They are also shown to be highly discriminative among users (Sim and Janakiraman, 2007; Belman and Phoha, 2020). Thus, we adapted these features in this study. These features were computed as described as follows:
    Consider a word of length consisting of the keys {, , …, } in that order. Then word-level features were defined and extracted as follows:

    1. Word Hold Time (

    2. Word-unigraph features ()

      : These features consisted of mean, standard deviation, and median of the unigraphs of

      . Assume we use an aggregation function , then for the word ,

    3. Word-digraph features (): Similar to word-unigrah features, we computed the word-digraph features. Assume the aggregation function and flight features (where ), then for the word ,

More details on how these features were utilized during the classification is provided in Section 3.3.1 and 3.3.2.

Figure 2. An illustration of the extraction of unigraphs , digraphs , and world-level features.
Figure 3. Training, cross-validation, and testing setup. The data was divided in user sets (Training and cross-validation for hyper-parameter tuning) and (Testing). Where, . Adopted from (scikit, 2020).
Soft biometric Description
Gender male (72), female (45)
Major/minor CS(66), non-CS (50), missing (1)
Typing style
a: must look at the keypad (6),
b: occasional look at the keypad (31),
c: need not look at the keyboard (80)
Age (years)
range (19, 35), mean = 24.97,
median = 24.0, std = 3.11
Height (inches)
range(54, 74), mean = 66.96,
median = 67.0, std = 4.02
Ethnicity Asian (104), non-Asian (13)
Handedness
right (114), left (1),
ambidextrous (2)
Table 1. Number of samples available in the dataset (Belman et al., 2019). We studied only the first five as the last two were extremely imbalanced which is one of the limitations of the dataset.

3.3. Learning framework

Prediction of gender (female or male), typing style (must look at the keyboard or occasionally looks at the keyboard or need not look at the keyboard), and major/minor (computer science or non-computer science) were considered as classification tasks. On the other hand, age and height estimation was considered as regression tasks in our experiments. The block diagram of the learning framework adopted in this study is illustrated in Figure 3. We divided the dataset in two parts Training and Testing. The Training

data consisted of 70% of the users, and as the name indicated, it was used to train the model and tune the hyperparameters using five-fold cross-validation. The best-performing values of the hyperparameters were then used to train the model again on the

Training dataset. The trained model was then tested on the Testing dataset, which consisted of the remaining 30% users. The adopted learning framework creates a realistic experimental setup as it allowed us to test our model on completely unseen data, unlike some previous works (Miguel-Hurtado et al., 2016a; Tsimperidis et al., 2018; Fairhurst and Da Costa-Abreu, 2011; Plank, 2018; Belman and Phoha, 2020), which have reported the results using k-fold cross-validation on the whole dataset. Nevertheless, we tried this strategy as well and got near-perfect results.

Also, we observed that the dataset has a class imbalance problem. For example, the number of males was higher than the number of females (see Table 1 for more details). Borderline over-sampling based on SMOTE (Synthetic Minority Oversampling Technique) (Nguyen et al., 2011) was included in the classification pipeline to over-sample the minority class samples and make it equal to the majority class samples. Borderline SMOTE was chosen over vanilla SMOTE (Chawla et al., 2002) and Adaptive Synthetic (ADASYN) sampling technique (He et al., 2008) based on the loss obtained during training.

3.3.1. Classical Machine Learning (ML)

We included a variety of algorithms for implementing the classification and regression tasks. The decision to include algorithms such as Naive Bayes, Decision Trees, Support Vector Machine (SVM), Adaptive Boosting (AdaBoost), and Multi-Layer Perceptron (MLP) with single hidden layer was based on the previous studies

(Brizan et al., 2015; Buriro et al., 2016; Miguel-Hurtado et al., 2016a; Morales et al., 2016; Baluja and Rowley, 2007; Neal and Woodard, 2018b; Plank, 2018; Tsimperidis and Katos, 2013; Na Cheng et al., 2009)

. Besides, we included algorithms, namely extreme gradient boosting (XGBoost), that have been rarely studied in this context but drew attention due to its success in online competition platforms such as Kaggle

(Kaggle, 2020). The hyperparameters associated with these algorithms were tuned using five-fold cross-validation and grid search (see Figure 3).

In addition to tuning the listed parameter, we also experimented with the number of features and presented the best results obtained. The encouraging performance of ML algorithms, as well as the size of data, motivated us to experiment with deep learning methods that have been effectively used for solving typing pattern-based identification and authentication, recently (Baldwin et al., 2019; Bernardi et al., 2019; Sun et al., 2017; Acien et al., 2020).

3.3.2. Deep Learning (DL)

Deep learning has been used with great success in recent years. The combination of deep networks, along with the non-linear activation, has been influential in the popularity of deep learning algorithms. Recently, there have been several attempts at using deep learning architectures for analyzing keystroke biometric data (Baldwin et al., 2019; Bernardi et al., 2019; Sun et al., 2017; Acien et al., 2020). Inspired by these approaches, we leverage the following deep learning models:

  • Fully Connected (FC) Network

    : We use a four-layered neural network with relu activation. We additionally incorporate dropout as a regularization technique for our model. We believe that using a deep FC network will help capture the intrinsic differentiating factors within the aggregated feature vectors to help discern the privacy factors better.

  • Convolution Neural Network (CNN)

    : We use a seven-layer CNN with four 2D convolution layers and three fully connected layers. We further use dropout and batch normalization to regularize our network. Since our data features are in the form of vectorized arrays, we use a trick of converting them into squared images. For a given feature vector of dimensionality

    , we find the largest perfect square S just smaller than N and convert the feature vector to an image of size . We hypothesize that the trick will help us leverage CNNs to exploit the structural and spatial biases present in our feature data efficiently.

  • Recurrent Neural Network (RNN)

    : We use a three-layer RNN with tanh activation functions and a final softmax classification layer. In the case of RNNs, we require our input data to be sequential. However, our data is in the form of tabulated feature vectors. We use a heuristic to convert our feature vectors into sequential data points to feed it into the RNN. For a given feature vector of dimensionality

    , we find the largest non-prime number just smaller than and find two factors and such that . We then manipulate the feature vector to seem like proxy sequential data of sequence length and vector dimension . The trick, therefore, can help us utilize the episodic nature of RNNs to gauge sequential correlations in our data.

  • Long Short Term Memory (LSTM) Network

    : We use a three-layer LSTM network with a final softmax classification layer, similar to the one used for the RNN model. We make use of LSTMs to mitigate the widely known vanishing gradient problem

    (Hochreiter, 1998) of simple RNNs. We follow the same heuristical procedure to make our feature vectors suitable for training a sequential LSTM network. We believe that the LSTM should further help capture sequential dependencies inherent in our feature vectors.

3.4. Performance evaluation

The performance of the classification, as well as regression models, were evaluated on the test dataset that was kept separate from the training and validation process (see Figure 3

). Accuracy and mean absolute error (MAE) were used as the performance evaluation metric for the classification and regression models, respectively. The accuracy is defined as the ratio of the number of correctly predicted instances and the number of instances tested. MAE is defined as an average of absolute differences between the actual and predicted values. The accuracy could be biased in cases where the number of instances for each class are unequal. However, as we had applied SMOTE to oversample the instances of minority classes and make the number of instances belonging to each class equal, accuracy in our case is an unbiased measure.

Device Setting Naive Bayes SVM Decision Trees AdaBoost MLP XGBoost RNN LSTM FC CNN
Desktop Free 72.09 81.39 76.74 81.39 83.72 83.72 77.50 72.50 72.09 86.04
Fixed 72.09 86.04 79.06 81.39 74.41 79.06 77.50 77.50 62.50 82.50
Phone Free 53.48 83.72 67.44 81.39 76.74 81.39 80.00 75.00 67.44 79.07
Fixed 55.81 76.74 74.41 74.41 72.09 72.09 75.00 85.00 62.79 88.37
Tablet Free 60.46 79.06 76.74 76.74 76.74 79.06 83.33 72.50 69.76 79.06
Fixed 67.44 72.09 67.44 72.09 67.44 67.44 82.5 75.00 65.11 79.07
Combined Free 67.44 83.72 79.06 79.06 76.74 81.39 80.00 77.50 74.42 93.02
Fixed 67.44 79.06 74.41 81.39 74.41 72.09 77.50 62.50 67.44 83.72
Table 2. Percentage accuracies (the higher, the better) obtained by different ML and DL algorithms for gender classification. Arrangement-wise, Combined-Free-CNN (93.02%) outperformed the rest. Device-wise, Combined (93.02%), Phone (88.37%), Desktop (86.04%), and Tablet (83.33%) closely followed each other in that order.
Device Setting Naive Bayes SVM Decision Trees AdaBoost MLP XGBoost RNN LSTM FC CNN
Desktop Free 68.29 78.04 73.17 73.17 73.17 73.17 80.00 75.00 70.73 78.04
Fixed 75.60 70.73 70.73 75.60 60.97 78.04 67.50 70.00 56.09 85.37
Phone Free 60.97 51.21 70.73 65.85 53.65 53.65 75.00 77.50 68.29 82.92
Fixed 63.41 60.97 68.29 58.53 58.53 53.65 72.50 77.50 63.41 78.04
Tablet Free 63.41 53.65 68.29 73.17 53.65 58.53 83.33 82.50 68.29 82.92
Fixed 75.60 56.09 68.29 73.17 56.09 73.17 72.50 85.00 63.41 78.04
Combined Free 65.85 75.60 73.17 68.29 65.85 68.29 85.00 80.00 73.17 85.37
Fixed 70.73 73.17 63.41 68.29 53.65 60.97 82.50 72.50 65.85 87.80
Table 3. Percentage accuracies (the higher, the better) obtained by different ML and DL algorithms for major/minor classification. Arrangement-wise, Combined-Fixed-CNN (87.80%) outperformed the rest. Device-wise, Combined (87.80%), Desktop (85.37%), Tablet (85.0%), and Phone (82.92%) closely followed each other in that order. The results align with the with common intuition that CS majors may be more comfortable and fluent on Desktop and Tablet keypads compared to Phone than non-CS majors.
Device Setting Naive Bayes SVM Decision Trees AdaBoost MLP XGBoost RNN LSTM FC CNN
Desktop Free 77.27 93.18 76.92 90.38 86.53 81.81 80.00 83.33 82.85 91.42
Fixed 76.92 90.38 86.53 90.38 90.38 88.46 50.00 48.00 82.14 66.07
Phone Free 78.84 88.63 82.69 86.36 86.53 86.36 83.33 83.33 80.70 85.71
Fixed 86.53 96.15 80.76 84.61 96.15 86.53 50.00 42.00 91.22 49.12
Tablet Free 65.38 95.55 82.22 82.69 78.84 80.00 86.67 83.33 90.47 82.85
Fixed 78.84 90.38 78.84 82.69 88.46 88.46 56.00 44.00 78.57 57.14
Combined Free 76.92 96.15 82.69 88.46 92.30 94.23 83.33 80.00 84.21 88.57
Fixed 86.53 94.23 82.69 90.38 90.38 90.38 70.00 56.00 89.47 64.91
Table 4. Percentage accuracies (the higher, the better) obtained by different ML and DL algorithms for typing style classification. Arrangement-wise, Combined-Free-SVM (96.15%) was closely followed by Combined-Fixed-SVM (94.23%) and outperformed the rest. Device-wise, Combined (96.15%), Phone (96.15%), Tablet (95.55%), and Desktop (93.18%) closely followed each other in that order. The results do not fall beyond our expectations as we hypothesized that the typing patterns of individuals who look, occasionally look, and never look at the keypad to be very different, in general.
Age Height
Device Free/Fixed SVM KNN XGBoost RNN LSTM FC CNN SVM KNN XGBoost RNN LSTM FC CNN
Desktop Free 2.37 2.38 2.26 5.53 2.24 2.26 3.78 2.97 3.02 2.84 8.67 10.70 7.33 7.21
Fixed 2.43 2.54 2.27 5.24 2.04 2.92 4.97 2.92 3.20 2.82 9.54 10.66 8.63 7.24
Phone Free 2.46 2.41 2.59 7.11 2.03 1.77 6.10 2.94 3.04 2.70 10.43 10.39 4.75 7.20
Fixed 2.38 2.36 2.42 8.41 2.48 2.36 5.44 2.87 2.65 2.92 10.55 11.10 5.72 7.20
Tablet Free 2.42 2.47 2.38 6.19 2.45 2.39 5.02 2.85 3.18 3.23 8.75 9.57 4.83 7.22
Fixed 2.43 2.49 2.34 9.41 2.73 2.09 5.20 2.74 2.95 3.02 8.42 9.95 5.74 7.20
Combined Free 2.37 2.40 2.21 5.61 2.23 2.84 5.41 2.93 2.99 3.23 8.52 9.16 7.06 7.20
Fixed 2.32 2.34 2.27 9.17 2.11 3.63 4.33 3.09 3.01 2.67 7.79 10.61 11.57 7.20
Table 5. MAE (the lower, the better) for age and height estimation. Arrangement-wise, Phone-Free-FC (1.77 years) and Phone-Fixed-KNN (2.65 inches) were the best performers. Device-wise, Phone (1.77 years), Desktop (2.04 years), Tablet (2.09 years), Combined (2.11 years) closely followed each other in that order. Similarly, Phone (2.65 inches), Combined (2.67 years), Tablet (2.74 inches), Desktop (2.82 inches) closely followed each other in that order. Interesting observation here is that ML algorithms have outclassed the DL algorithms.
Ref. Users
Free/Fixed
Class
Desktop/Phone
kFCV/HOCV
Accuracy/MAE
(Giot and Rosenberger, 2012) 133 Fixed Gender Desktop kFCV 91.63
(Fairhurst and Da Costa-Abreu, 2011) 133 Fixed Gender Desktop kFCV 97.50
(Uzun et al., 2015) 100 Fixed Age Desktop kFCV 91.20
(Pentel, 2017) 1519 Both Both Desktop kFCV 73.00
(Plank, 2018) 144 Free
Age
Gender
Desktop kFCV
63.50
73.25
(Tsimperidis et al., 2018) 75 Free Gender Desktop kFCV 95.60
(Li et al., 2019) 45 Free Gender Desktop kFCV 72.00
(Buker et al., 2019) 60 Free Gender Desktop kFCV 98.30
(Akis et al., 2014) 132 Fixed
Age
Gender
Phone HOCV
60.30
75.20
(Idrus et al., 2014) 110 Both
Age
Gender
Desktop HOCV
78.00
86.00
(Buriro et al., 2016) 150 Fixed
Age
Gender
Phone HOCV
82.80
87.70
(Bandeira et al., 2019) 100 Both Gender Desktop HOCV 71.30
This work 117 Free
Gender
Major
Style
Age
Height
The best of
Desktop,
Phone, Tablet,
and Combined
HOCV
93.02
85.37
96.15
1.77
2.70
This work 117 Fixed
Gender
Major
Style
Age
Height
The best of
Desktop,
Phone, Tablet,
and Combined
HOCV
88.37
87.80
96.15
2.04
2.65
Table 6. Qualitative comparison with previous works that attempted to infer the soft biometrics that we have considered. kFCV means k-Fold Cross-Validation, while HOCV means Hold the test set Out Cross-Validation in this study (see Figure 3). We achieved almost perfect Accuracy and MAE between 1-2 for both age and height under kFCV. We are not reporting kFCV results as it is a less realistic evaluation setup than HOCV, especially for the application scenarios listed in this paper.

4. Results

4.1. Classification results

The following subsections discuss the results obtained by different ML and DL based classification models used in this study:

4.1.1. Gender classification

The gender classification accuracies are presented in Table 2. In terms of devices, the combined case achieved the best results (93.02%) followed by Phone (88.37%), Desktop (86.04%), and Tablet (83.33%). Free-text (93.02%) yielded better results than the Fixed-text (88.37%), overall. Classifier-wise, CNN (93.02%), SVM (86.04%), MLP/XGBoost (83.72%), and RNN (83.33%) outperformed the rest.

4.1.2. Major/Minor classification

The accuracies for the major/minor classification task can be found in Table 3. In terms of devices, the combined-device setting achieved the best results (87.8%) followed by Desktop (85.37%), Tablet (85%), and Phone (82.92%). Overall, Fixed-text (87.8%) yielded slightly better results than Free-text (85.37%). The top-performing classifiers were CNN (87.8%), LSTM (85%), RNN (83.33%), SVM (78.04%) and XGBoost (78.04%) followed by the rest.

4.1.3. Typing style classification

The accuracies for the typing style classification task can be found in Table 4. In terms of devices, the combined-device setting and Phone achieved the best results (96.15%) followed by Tablet (95.55%), and Desktop (93.18%). Overall, both Fixed-text and Free-text yielded the same best results (96.15%). The top-performing classifiers were SVM (96.15%), MLP (96.15%), CNN (91.42%), FC (91.22%) and AdaBoost (90.38%) followed by the rest.

4.2. Regression results

The following subsections discuss the results obtained by different ML and DL based regression models used in this study:

4.2.1. Age estimation

The collated results for both ML and DL models for the task of age prediction can be found in Table 5. In terms of devices, the phone-only setting achieved the best results (1.77) followed by desktop (2.04), tablet (2.09), and combined (2.11). Free-text (1.77) yielded better results than the Fixed-text (2.04), overall. Regressor-wise, FC (1.77), LSTM (2.04), and XGBoost (2.21) outperformed the rest.

4.2.2. Height estimation

The results for both the ML and DL models for the height prediction problem can be found in Table 5. In terms of devices, the phone-only setting achieved the best results (2.65) followed by combined (2.67), tablet (2.74), and desktop (2.82). In contrast to age regression, Fixed-text (2.65) yielded better results than the Free-text (2.70), overall. Regressor-wise, KNN (2.65), XGBoost (2.67), and SVM (2.74) outperformed the rest. For the height prediction problem, ML regressors clearly outperformed DL regressors.

5. Discussion

5.1. Limitations

As mentioned earlier, one of the major limitations of studying the inference of soft biometrics is a quality dataset. Although every participant provided about thirty thousand keystrokes, the number of subjects is limited in the dataset, which makes the training, validation, and testing a bit difficult. In particular, we used the data collected from 70% users (i.e., 82 users) for training and cross-fold validation, while the data collected from the rest and the data collected from the remaining 30% (i.e., 35 users) used for testing. Another limitation of the dataset is that the samples for recorded soft biometrics are severely imbalanced in some cases (see Table 1). For example, of the total 117 participants, 105 are Asian, and 114 identified themselves as right-handed.

Although we expect that the performance of the proposed approaches would scale to a larger dataset, it is difficult to claim that it would. Nonetheless, the results are comparable or better than the existing mechanisms of inferring soft biometrics from keystrokes (see Table 6).

5.2. Ethical implications

While in the introduction section, we have listed positive application scenarios, people with malicious intent can use the research presented in this work for destructive purposes. We, however, believe that the misuse can be prevented by developing existing as well as new public policies (Plank, 2018).

6. Conclusion and future work

We conclude that soft biometrics such as gender, typing style, major/minor, age, and height can be inferred from typing patterns of individuals with reasonably good accuracy. The free-text analysis showed more promise compared to the fixed-text environment except for the major/minor prediction. DL methods outclassed ML methods overall except for the height estimation task. The Phone-Fixed configuration achieved the highest gender recognition accuracy (88.37%), while the combination of data collected from all three devices helped better the results (93.02%). The Desktop-Fixed setup outclassed the rest of the individual device setups achieving 85.37% accuracy in the major/minor classification, while the combined experimental setup reached 87.80%. The Phone-Fixed configuration attained the highest accuracy in typing style classification, and the combination of data from multiple devices did not help in this case. The Phone-Free setting predicted the age with an MAE of 1.77 years, while the Phone-Fixed setup estimated the height with an MAE of 2.65 inches.

We would like to test the proposed approaches on multiple datasets, especially on a dataset that consists of participants of more diverse backgrounds and demographics. Besides, we would also like to investigate other modalities available in the SU-AIS BB-MAS dataset as well as the fusion of all the modalities for better estimation and prediction of the soft biometrics. In the end, based on our observation that the deep learning-based models outsmarted traditional ML algorithms, we would like to leverage more deep learning architectures for their effectiveness in the soft biometric prediction task.

References

  • A. Acien, J. Fierrez, A. Morales, R. Rodriguez, and J. Hernandez-Ortega (2018) Active detection of age groups based on touch interaction. IET Biometrics. Cited by: §2.
  • A. Acien, J. V. Monaco, A. Morales, R. Vera-Rodriguez, and J. Fierrez (2020) Typenet: scaling up keystroke biometrics. arXiv preprint arXiv:2004.03627. Cited by: §3.3.1, §3.3.2.
  • B. Akis, M. Sorgente, and A. Starosta (2014) Typeguess: using mobile typing dynamics to predict age, gender and number of fingers used for typing. Standord University. Cited by: §1, Table 6.
  • M. Antal and G. Nemes (2016) Gender recognition from mobile biometric data. In 2016 IEEE 11th SACI, Cited by: §1.
  • 6. M. Australia (2019) Exposing nigerian online love scammers — 60 minutes australia. Note: https://www.youtube.com/watch?v=nTorFTRcYDQ&t=778sOnline; accessed February 8, 2020 Cited by: Figure 1.
  • J. Baldwin, R. Burnham, A. Meyer, R. Dora, and R. Wright (2019) Beyond speech: generalizing d-vectors for biometric verification. In AAAI’19, Cited by: §3.3.1, §3.3.2.
  • S. Baluja and H. A. Rowley (2007) Boosting sex identification performance.

    International Journal of computer vision

    .
    Cited by: §3.3.1.
  • D. R. C. Bandeira, A. M. de Paula Canuto, M. Da Costa-Abreu, M. Fairhurst, C. Li, and D. S. C. do Nascimento (2019) Investigating the impact of combining handwritten signature and keyboard keystroke dynamics for gender prediction. In 2019 8th Brazilian Conference on Intelligent Systems (BRACIS), pp. 126–131. Cited by: §1, §2, Table 6.
  • S. P. Banerjee and D. L. Woodard (2012) Biometric authentication and identification using keystroke dynamics: a survey.

    Journal of Pattern Recognition Research

    .
    Cited by: §1, §1.
  • A. Belman, L. Wang, S. Iyengar, P. Sniatala, R. Wright, R. Dora, J. Balwdin, Z. Jin, and V. Phoha (2019) Su-ais bb-mas (syracuse university and assured information security-behavioral biometrics multi-device and multi-activity data from same users) dataset. Cited by: §2, §3.1, §3.1, Table 1.
  • A. K. Belman and V. V. Phoha (2020) Discriminative power of typing features on desktops, tablets, and phones for user identification. ACM T-OPS. Cited by: §1, §2, 3rd item, §3.2, §3.3.
  • M. L. Bernardi, M. Cimitile, F. Martinelli, and F. Mercaldo (2019) Keystroke analysis for user identification using deep neural networks. In 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8. Cited by: §3.3.1, §3.3.2.
  • D. G. Brizan, A. Goodkind, P. Koch, K. Balagani, V. V. Phoha, and A. Rosenberg (2015) Utilizing linguistically enhanced keystroke dynamics to predict typist cognition and demographics. International Journal of Human-Computer Studies 82, pp. 57–68. Cited by: §1, §2, §3.3.1.
  • A. Buker, A. Vinciarelli, and G. Roffo (2019) Type like a man! inferring gender from keystroke dynamics in live-chats. IEEE Intelligent Systems. Cited by: §1, §2, Table 6.
  • A. Buriro, Z. Akhtar, B. Crispo, and F. Del Frari (2016) Age, gender and operating-hand estimation on smart mobile devices. In 2016 IEEE-BIOSIG, Vol. , pp. 1–5. Cited by: §1, §1, §2, §3.3.1, Table 6.
  • N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002) SMOTE: synthetic minority over-sampling technique.

    Journal of artificial intelligence research

    .
    Cited by: §3.3.
  • A. Dantcheva, P. Elia, and A. Ross (2016) What else does your biometric data reveal? a survey on soft biometrics. IEEE T-FIS. Cited by: 2nd item, 3rd item, §1, §2.
  • C. Epp, M. Lippold, and R. L. Mandryk (2011) Identifying emotional states using keystroke dynamics. In Proceedings of the SIGCHI’11, New York, NY, USA. Cited by: §2.
  • M. Fairhurst and M. Da Costa-Abreu (2011) Using keystroke dynamics for gender identification in social network environment. In 4th International Conference on Imaging for Crime Detection and Prevention 2011 (ICDP 2011), Cited by: 4th item, §1, §2, §3.3, Table 6.
  • U. Federal Bureau Investigation (FBI) (2019) 2019 internet crime report. Note: https://pdf.ic3.gov/2019_IC3Report.pdfOnline; accessed February 8, 2020 Cited by: 5th item.
  • M. Frank, R. Biedert, E. Ma, I. Martinovic, and D. Song (2013) Touchalytics: on the applicability of touchscreen input as a behavioral biometric for continuous authentication. IEEE Transactions on Information Forensics and Security 8 (1), pp. 136–148. Cited by: §1.
  • G. Garofalo, E. Argones Rúa, D. Preuveneers, W. Joosen, et al. (2019) A systematic comparison of age and gender prediction on imu sensor-based gait traces. Sensors 19 (13), pp. 2945. Cited by: §2.
  • R. Giot and C. Rosenberger (2012) A new soft biometric approach for keystroke dynamics based on gender recognition. International Journal of Technology Management. Cited by: §1, §2, Table 6.
  • H. He, Y. Bai, E. A. Garcia, and S. Li (2008) ADASYN: adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE IJCNN, Cited by: §3.3.
  • S. Hochreiter (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems. Cited by: 4th item.
  • J. Huang, D. Hou, S. Schuckers, and S. Upadhyaya (2016) Effects of text filtering on authentication performance of keystroke biometrics. In 2016 IEEE International Workshop on Information Forensics and Security (WIFS), Cited by: §3.2.
  • S. Z. S. Idrus, E. Cherrier, C. Rosenberger, and P. Bours (2014) Soft biometrics for keystroke dynamics: profiling individuals while typing passwords. Computers & Security. Cited by: §1, §2, Table 6.
  • M. D. C. Julliana Caroline Gonc¸alves de A.S.M (2019) An evaluation of a three-modal hand-based database to forensic-based gender recognition. In 19th SBSeg 2019, Cited by: §2.
  • Kaggle (2020) What is xgboost. Note: https://www.kaggle.com/dansbecker/xgboostOnline; accessed February 8, 2020 Cited by: §3.3.1.
  • Dr. A. Keromytis (2015) DARPA active authentication program. Note: https://www.securetechalliance.org/wp-content/uploads/keromytisa.pdfOnline; accessed February 8, 2020 Cited by: §1.
  • R. Kumar, V. V. Phoha, and A. Serwadda (2016a) Continuous authentication of smartphone users by fusing typing, swiping, and phone movement patterns. In 2016 IEEE BTAS, Cited by: §1.
  • R. Kumar, P. P. Kundu, and V. V. Phoha (2018) Continuous authentication using one-class classifiers and their fusion. In 2018 IEEE 4th International Conference on Identity, Security, and Behavior Analysis (ISBA), pp. 1–8. Cited by: §1.
  • R. Kumar, P. P. Kundu, D. Shukla, and V. V. Phoha (2017) Continuous user authentication via unlabeled phone movement patterns. In 2017 IEEE International Joint Conference on Biometrics (IJCB), pp. 177–184. Cited by: §1.
  • R. Kumar, V. V. Phoha, and A. Jain (2015) Treadmill attack on gait-based authentication systems. In 2015 IEEE 7th International Conference on Biometrics Theory, Applications and Systems (BTAS), pp. 1–7. Cited by: §1.
  • R. Kumar, V. V. Phoha, and R. Raina (2016b) Authenticating users through their arm movement patterns. arXiv preprint arXiv:1603.02211. Cited by: §1.
  • G. Li, P. R. Borj, L. Bergeron, and P. Bours (2019) Exploring keystroke dynamics and stylometry features for gender prediction on chat data. In 2019 42nd MIPRO, Cited by: 4th item, 5th item, §1, §2, Table 6.
  • O. Miguel-Hurtado, S. V. Stevenage, C. Bevan, and R. Guest (2016a) Predicting sex as a soft-biometrics from device interaction swipe gestures. Pattern Recognition Letters. Cited by: §1, §3.3.1, §3.3.
  • O. Miguel-Hurtado, S. V. Stevenage, C. Bevan, and R. Guest (2016b) Predicting sex as a soft-biometrics from device interaction swipe gestures. Pattern Recognition Letters. Cited by: §2.
  • A. Morales, J. Fierrez, R. Tolosana, J. Ortega-Garcia, J. Galbally, M. Gomez-Barrero, A. Anjos, and S. Marcel (2016) Keystroke biometrics ongoing competition. IEEE Access. Cited by: §3.3.1.
  • Na Cheng, Xiaoling Chen, R. Chandramouli, and K. P. Subbalakshmi (2009) Gender identification from e-mails. In 2009 IEEE Symposium on Computational Intelligence and Data Mining, Cited by: §3.3.1.
  • T. J. Neal and D. L. Woodard (2018a) A gender-specific behavioral analysis of mobile device usage data. In 2018 IEEE 4th -ISBA, Cited by: §2.
  • T. J. Neal and D. L. Woodard (2019) You are not acting like yourself: a study on soft biometric classification, person identification, and mobile device use. IEEE-T-BIOM. Cited by: §1.
  • T. J. Neal and D. L. Woodard (2018b) A gender-specific behavioral analysis of mobile device usage data. In 2018 IEEE -ISBA, Cited by: §3.3.1.
  • T. J. Neal* and D. L. Woodard (2018) On the use of mobile calling patterns for soft biometric classification. In 2018 IEEE BTAS, Cited by: §2.
  • H. M. Nguyen, E. W. Cooper, and K. Kamei (2011) Borderline over-sampling for imbalanced data classification.

    International Journal of Knowledge Engineering and Soft Data Paradigms

    .
    Cited by: §3.3.
  • M. S. Nixon, P. L. Correia, K. Nasrollahi, T. B. Moeslund, A. Hadid, and M. Tistarelli (2015) On soft biometrics. Pattern Recognition Letters 68, pp. 218–230. Cited by: §1.
  • V. M. Patel, R. Chellappa, D. Chandra, and B. Barbello (2016) Continuous user authentication on mobile devices: recent progress and remaining challenges. IEEE Signal Processing Magazine 33 (4), pp. 49–61. Cited by: §1.
  • A. Pentel (2017) Predicting age and gender by keystroke dynamics and mouse patterns. In UMAP ’17, Cited by: §1, §2, Table 6.
  • B. Plank (2018) Predicting authorship and author traits from keystroke dynamics. In Proceedings of the Second Workshop on Computational Modeling of People’s Opinions, Personality, and Emotions in Social Media, Cited by: §1, §3.3.1, §3.3, Table 6, §5.2.
  • A. Primo, V. V. Phoha, R. Kumar, and A. Serwadda (2014) Context-aware active authentication using smartphone accelerometer measurements. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 98–105. Cited by: §1.
  • A. Rattani and M. Agrawal (2019) Soft-biometric attributes from selfie images. In Selfie Biometrics, Cited by: 2nd item, 3rd item, §1.
  • J. Roth, X. Liu, and D. Metaxas (2014) On continuous user authentication via typing behavior. IEEE Transactions on Image Processing 23 (10), pp. 4611–4624. Cited by: §1.
  • J. Roth, X. Liu, A. Ross, and D. Metaxas (2015) Investigating the discriminative power of keystroke sound. IEEE T-FIS. Cited by: §1.
  • scikit (2020) Cross-validation: evaluating estimator performance. Note: https://scikit-learn.org/stable/modules/cross_validation.htmlOnline; accessed May 1, 2020 Cited by: Figure 3.
  • A. Serwadda, V. V. Phoha, Z. Wang, R. Kumar, and D. Shukla (2016) Toward robotic robbery on the touch screen. ACM Trans. Inf. Syst. Secur. 18 (4). External Links: ISSN 1094-9224, Link, Document Cited by: §1.
  • T. Sim and R. Janakiraman (2007) Are digraphs good for free-text keystroke dynamics?. In 2007 IEEE CVPR, Cited by: §2, 3rd item, §3.2.
  • L. Sun, Y. Wang, B. Cao, S. Y. Philip, W. Srisa-An, and A. D. Leow (2017) Sequential keystroke behavioral biometrics for mobile user identification via multi-view deep learning. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 228–240. Cited by: §3.3.1, §3.3.2.
  • S. Z. Syed Idrus, E. Cherrier, C. Rosenberger, S. Mondal, and P. Bours (2015) Keystroke dynamics performance enhancement with soft biometrics. In IEEE-ISBA, Cited by: 2nd item, §2.
  • P. S. Teh, A. B. J. Teoh, and S. Yue (2013) A survey of keystroke dynamics biometrics. The Scientific World Journal 2013. Cited by: §1, §1, §3.2.
  • R. Thanganayagam, S. Kannan, and T. Arivoli (2019) Machine learning based soft biometrics for enhanced keystroke recognition system. Multimedia Tools and Applications. Cited by: 2nd item.
  • I. Tsimperidis, P. D. Yoo, K. Taha, A. Mylonas, and V. Katos (2020) R2BN: an adaptive model for keystroke-dynamics-based educational level classification. IEEE Transactions on Cybernetics. Cited by: §2.
  • I. Tsimperidis, A. Arampatzis, and A. Karakos (2018) Keystroke dynamics features for gender recognition. Digital Investigation. Cited by: §1, §3.3, Table 6.
  • I. Tsimperidis and V. Katos (2013) Keystroke forensics: are you typing on a desktop or a laptop?. In Proceedings of the 6th Balkan Conference in Informatics, Cited by: §3.3.1.
  • Y. Uzun, K. Bicakci, and Y. Uzunay (2015) Could We Distinguish Child Users from Adults Using Keystroke Dynamics?. arXiv e-prints. Cited by: §1, §2, Table 6.
  • L. M. Vizer and A. Sears (2015) Classifying text-based computer interactions for health monitoring. IEEE Pervasive Computing. Cited by: §1.