Continuous Authentication of Smartphones Based on Application Usage

07/18/2018 ∙ by Upal Mahbub, et al. ∙ University of Maryland University of Oulu 0

An empirical investigation of active/continuous authentication for smartphones is presented in this paper by exploiting users' unique application usage data, i.e., distinct patterns of use, modeled by a Markovian process. Variations of Hidden Markov Models (HMMs) are evaluated for continuous user verification, and challenges due to the sparsity of session-wise data, an explosion of states, and handling unforeseen events in the test data are tackled. Unlike traditional approaches, the proposed formulation does not depend on the top N-apps, rather uses the complete app-usage information to achieve low latency. Through experimentation, empirical assessment of the impact of unforeseen events, i.e., unknown applications and unforeseen observations, on user verification is done via a modified edit-distance algorithm for simple sequence matching. It is found that for enhanced verification performance, unforeseen events should be incorporated in the models by adopting smoothing techniques with HMMs. For validation, extensive experiments on two distinct datasets are performed. The marginal smoothing technique is the most effective for user verification in terms of equal error rate (EER) and with a sampling rate of 1/30s^-1 and 30 minutes of historical data, and the method is capable of detecting an intrusion within 2.5 minutes of application use.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

With the rapid increase of smartphone users worldwide, the mobile applications are growing both in number and popularity [38]. The number of apps in Google Play store is around 8 million, while in Apple App Store, Windows Store and Amazon Appstore there around 2.2 million, 669 thousand, and 600 thousand applications, respectively111https://www.statista.com/statistics/276623/number-of-apps-available-in-leading-app-stores/

. It has been estimated that a total of 197 billion mobile application were downloaded in 2017

222https://www.statista.com/statistics/271644/worldwide-free-and-paid-mobile-app-store-downloads/. A retrospective study in 2016 showed that on average a smartphone user uses over different mobile applications per month and different applications per day [1]. As for usage duration in 2016, in the USA, the smartphone users spend on a daily basis over two hours on mobile applications, i.e., over a month usage of applications in a year [1]. With growing concerns of smartphone security, monitoring the application usage coupled with the diverse pool of applications can help to make a difference in user authentication systems.

Smartphone application usage data can provide several interesting insights on the device users leading to different use cases of such data. There are several research works on user profiling and predicting behavioral patterns using application usage data [39][38][34][37]. Predicting application usage pattern can also help optimizing smartphone resources and help simulating realistic usage data for automated smartphone testing [20][15][18][8][16][4]. The open foreground application can also work as a context for active authentication using other modalities [7][17][26][27]

. For example, when verifying with touch and accelerometer data, the application running in the background can provide useful context for robust authentication. Intuitively, the way a user handles and swipes in a phone for a banking application is very different from those for a gaming application. The foreground application context can be even more useful for active authentication if some more insightful information about the applications are available as metadata. For example, one key idea of active/continuous authentication is gradually blocking a probable intruder starting from the most sensitive applications, such as banking and social media accounts

[31] [23]. If the sensitivity level or the type of application is known as metadata, it would be possible to attain enhanced security. Also, some applications, if permitted, can access the location data and store click information for targeted advertisement and similar applications [21]. A more active use case of application-usage data could be verifying the users solely from the pattern of usage. The different use cases of app-usage data are shown in Fig. 1.

Fig. 1: Use cases for smartphone app-usage data.

In this paper, the suitability of application-usage data as a modality for smartphone user verification is thoroughly investigated. The main contributions of this paper are:

  • An innovative formulation that utilizes application usage data pattern as a biometric for user verification. The formulation tackles key challenges such as data sparsity and accounting for unforeseen test observations. Unlike traditional approaches of using top N-applications for authentication purposes [11], in the proposed formulation the full list of applications are considered for verification models in order to ensure low-latency which is essential for active authentication systems.

  • Insight into the application usage similarity among different users and statistics on unforeseen applications.

  • A thorough investigation of the impact of unknown applications and unforeseen observations on the verification task.

  • A Modified Edit-Distance (M-ED) algorithm and experiments to demonstrate the advantage of including unforeseen events during sequence matching.

  • Modeling the Person Authentication using Trace Histories (PATH) problem as a variation of the person authentication using location histories [22].

The paper is organized as follows. In Section II, background and related works on this topic are discussed. In Section III, the approach is explained in detail along with associated challenges and possible solutions. The impact of unknown application and unforeseen events is investigated in Section IV, and several methods for handling the active authentication problem effectively are described in Section V. Finally, a detailed analysis on the application usage data, experimental results and discussions are presented in Section VI, followed by conclusions and suggestions for future work in Section VII.

Ii Related Work

In this section, some of the most recent published literature on active authentication and the utilization of application usage data are reviewed. The section also discusses the exploitation of two active authentication datasets (UMDAA-02 [23] and Securacy [9]) for this research work.

Ii-a Active/Continuous Authentication of Smartphones

Active, continuous or implicit authentication are different terminologies for the same authentication approach in which the rightful user of mobile devices is authenticated throughout the entire session of usage [13][29][23]. In recent years, active authentication research has gained a lot of attention because of the increased security risks and complexity of password, token-based, multi-factor and other explicit authentication systems [29]. In active authentication, the wide range of sensor data available on the mobile devices are utilized to learn one or more templates for the legitimate user during a training session. The templates are used in the background to continuously authenticate the user during regular usage and based on the amount of deviation from the templates the device itself starts restricting access to phone applications and utilities starting from the most sensitive ones [23]. Most popular modalities for active authentication are front camera face images [32][14][5], touch screen gesture data[10][6][40], accelerometer and gyroscope data [12][30][28], location data [22] etc. Suitability of different behavioral biometric such as touch and keystroke dynamics, phone pick-up patterns, gait dynamic, and patterns from location trace history have been explored for active authentication[25][19][22]. Combinations of multiple biometric have been demonstrated to produce robust authentication on real-life data333http://www.biometricupdate.com/201506/atap-division-head-previews-behavioral-biometrics-system-at-google-io.

Ii-B Prior Research on Application-Usage Data

In recent years, there has been a lot of focus on predicting individual and community-wise application usage patterns [2]. For example, in [38], the authors investigate the ratio of local and global applications in the top usage list, the traffic pattern for different application categories, likelihood of co-occurrence of two different applications and such other patterns in usage. In this work, the authors identify traffic from distinct applications using HTTP signatures. On the other hand, in [35] the authors use mobile in-app advertisements to identify the applications in network traces. Using the ad flow data, the authors tried to analyze the usage behavior of different types of applications. In [39], the authors analyzed the application-usage logs of over smartphone users worldwide to develop an app-usage prediction model that leverages user preferences, historical usage patterns, activities and shared aggregate patterns of application behavior.

From the authentication front, in [17], the authors proposed an application centric decision approach for active or implicit authentication in which applications are used as context to decide what modalities to use to authenticate a user and when to do it. Application usage data has also been used to generate scores for user authentication in [11]. The authors only considered the frequency of occurrence of an application in the training set to determine the likelihood of being a particular user, missing the temporal variation in the the usage pattern.

An interesting use-case of application-usage data is presented in [33]. The authors used a large-scale annotated application-usage dataset to build a predictor that can estimate where a person is (e.g., at home or office) and if he/she is with a close friend or a family member. In [20]

, the authors used application usage traces along with system status and sensor indicators to predict the battery life of the phones using machine learning techniques.

Ii-C Datasets on Application Usage

Even though there have been diverse research approaches that need application-usage data, there is a scarcity of publicly available datasets. Also, many of the application-usage datasets have limited number of applications or are not unbounded real-life usage data, but instead contain data generated under supervision or by following certain instructions. In this work, all the experiments are performed on two well-known large scale public datasets suitable for investigating the active authentication problem, namely, the application-usage data of University of Maryland Active Authentication Dataset-02 (UMDAA-02)444Available at https://umdaa02.github.io/ [24] and the Securacy 555Available at http://ubicomp.oulu.fi/securacy-understanding-mobile-privacy-and-security-concerns/ [9] dataset from the Center of Ubiquitous Computing, University of Oulu.

Ii-C1 UMDAA-02 Application-Usage Dataset

No. of Subjects with training samples and test samples for sampling rate of (Train/Test)
Avg. No. of Sessions/User with App-Usage Data of the selected subjects (train/test)
Train/Test split for the experiment
Total Number of Unique Applications Used by the selected subjects (train/test)
Average Number of Samples Per User for the selected subjects (train/test)
TABLE I: General information on application-usage data available in the UMDAA-02 dataset.

The UMDAA-02 dataset is specifically designed for evaluating active authentication systems in the wild. The dataset consists of 141.14 GB of smartphone sensor data collected from volunteers who were using Nexus 5 phones in their regular daily activities over a period of two months. The data collection application ran completely in the background and the collected data includes the front-facing camera, touchscreen, gyroscope, accelerometer, magnetometer, light sensor, GPS, Bluetooth, WiFi, proximity sensor, temperature sensor and pressure sensor among with the timing of screen unlock and lock events, start and end timestamps of calls and currently running foreground application, etc. The application usage data from users is summarized in Table I. However, not all the users have adequate amount of usage data. For all the experiments in this paper, a total of users are used who has more that training samples and more than test samples for any sampling rate between to . The usage statistics for the top 20 applications for the selected subjects is presented in Table II. The usage rate for the top applications for each user is shown in fig. 2(a). From the table and the figure, it is readily seen that the applications ranked th, th, th and th are in the top list because of excessive usage by very few users, where as, the remaining applications are genuinely popular among the users.

Rank App Name No. of Users Per User Usage Overall Usage
1 com.google.android. googlequicksearchbox 26 283.27 283.27
2 com.android.dialer 25 255.24 245.42
3 com.whatsapp 15 303.6 175.15
4 com.android.chrome 26 141.42 141.42
5 com.facebook.katana 11 308.18 130.38
6 com.nextwave.wcc2 1 2366 91
7 com.google.android.youtube 16 144.38 88.85
8 com.ea.game.pvzfree 2 872.5 67.12
9 com.google.android.gm 24 51.04 47.12
10 com.android.mms 22 52.09 44.08
11 com.google.android.talk 18 62.28 43.12
12 com.andrewshu.android.reddit 1 842 32.38
13 com.nextbus.mobile 19 41.89 30.62
14 com.google.android.apps.docs 24 33 30.46
15 com.android.settings 24 27.71 25.58
16 com.google.android.apps.maps 14 44 23.69
17 com.android.camera2 22 20.5 17.35
18 com.google.android.gallery3d 17 24.94 16.31
19 com.android.vending 21 20.1 16.23
20 com.viber.voip 5 74.6 14.35
TABLE II: App-usage statistics for the top apps for the selected users of the UMDAA-02 dataset.
Fig. 2: Similarity matrix depicting top application-usage rate among users in the training set of the (a) UMDAA-02 dataset and, (b) Securacy dataset.

Ii-C2 Securacy Application Usage Dataset

The Securacy dataset was originally created within the context of exploring the privacy and security concerns of a smartphone user by analyzing the location of servers that different applications use and whether secure network connections are used. For a period of approximately six months, the data was collected from 218 anonymous participants who installed the data collection application from the Google Play store. The collected data, 679.90 GB, includes the currently running foreground application, installed, removed or updated applications, application server connections and device location, etc. Out of the users of the original Securacy dataset, are used for this experiment based on the limits on training and test observations as mentioned for the UMDAA-02 dataset. The application usage data for the subjects in the Securacy dataset are summarized in Table III and the corresponding usage statistics for the top 20 applications are presented in Table IV. The usage rate for the top applications for each user are show in fig. 2(b). Note that the top applications ranked st, nd and th in Table IV are actually the same application written in Spanish, English and Finnish, respectively. Similarly, rank , ‘Horloge’ is ’Clock’ in French, and therefore is the same application as rank . However, these applications are shown separately here because, for the active authentication problem, even the preferred language of the user is a type of biometric metadata and can be used to discriminate between users. Also, similar to UMDAA-02 dataset usage statistics, there are several applications in the top rank that were actually used by only a few users very frequently (ranked 1, 4, 9, 12, 16). For this dataset, this phenomenon can be attributed to language difference as well because if the language difference were nullified, then rank 1, 2, 4 will collapse at rank 1 and rank 12 and 19 will collapse at 12 - thereby removing three applications from the list (rank 1, 4 and 12) that has very few users. For the user verification research presented here, the language variation is kept unaltered in order to retain the naturalness of the dataset and the algorithms are expected to learn to discriminate between users based on the language as well as on usage pattern.

No. of Subjects with training samples and test samples for sampling rate of (Train/Test)
Avg. No. of Sessions/User with App-Usage Data of the selected subjects (train/test)
Train/Test split for the experiment
Total Number of Unique Applications Used by the selected subjects (train/test)
Average Number of Samples Per User for the selected subjects (train/test)
TABLE III: General information on application-usage data available in the Securacy dataset.
Rank App Name No. of Users Per User Usage Overall Usage
1 Sistema Android 4 9972.25 402.92
2 Android System 80 480.44 388.23
3 com.android.keyguard 34 802.79 275.71
4 Android-jrjestelm 5 4820.8 243.47
5 System UI 80 242 195.56
6 Nova Launcher 19 794.79 152.54
7 Maps 38 363.08 139.36
8 Google Search 53 214.3 114.73
9 Launcher 12 650 78.79
10 Chrome 60 128.2 77.7
11 Facebook 49 154.53 76.48
12 Horloge 1 7328 74.02
13 YouTube 49 144.94 71.74
14 TouchWiz home 20 348.3 70.36
15 Securacy 84 75.39 63.97
16 Internet 16 371.25 60
17 WhatsApp 37 154.62 57.79
18 Google Play Store 72 71.83 52.24
19 Clock 44 113.89 50.62
20 Package installer 36 138.69 50.43
TABLE IV: App-usage statistics for the top apps for the selected users of the Securacy Dataset.

Iii Problem Formulation

The application usage data from smartphones coupled with the timing information can be used to determine the exact day time and duration of using any application. It is assumed here that there might be certain pattern in the usage of different applications at different time of the day or during weekdays and weekends. Hence, a state-space model can be intuitively considered for modeling the pattern of application usage for a particular user. Models for different users are assumed to be different because of the difference in lifestyle of each individual. Therefore, the state-space model of a user can effectively be considered as a model for the pattern of life of that user and can be used to differentiate the user from others. There are however several challenges to this approach towards solving the authentication problem using application usage:

  • Forming observation states from the application data and corresponding timing information.

  • Training a state-space model in a way that it can handle unforeseen observations during testing.

  • Generating verification scores from sequential observation data.

Each of these challenges and the proposed solutions are discussed here.

Iii-a Application Names to Observation States

Incorporating the temporal information with the application name is a challenge because the user can use an application at any time, and therefore power set of all applications and all probable time is intractable even if we sample at a relatively high frequency. For example, if there are number of applications and if we sample every minutes, then there would be unique time stamps in a day and timestamps in a week. This would mean a total of observation states for the applications in a week. However, for a single application, most of these observation states will either not occur or occur very infrequently in the training set. Hence, training a reliable state-space models with this sparsely occurring observation states will be difficult.

In this regard, the time-zone and weekday/weekend flag idea are adopted from [22]. By dividing the day into three distinct time zones (TZs), namely, (12:01 am to 8:00 am), (8:01 am to 4:00 pm) and (4:01 pm to 12:00 pm), and denoting weekday/weekend with a flag , respectively, the total number of possible observation states is kept limited to . The functions and maps any time into one of the corresponding timezone and weekday/weekend, respectively. The impact of converting application tags into observations on verifying the users of the UMDAA-02 app-usage data and the Securacy datasets can be visualized from Figs. 3(a)-(b) and 3(c)-(d), respectively. The similarity matrix in Figs. 3(a) depicts the percentage of common applications between two users in UMDAA-02 training dataset, whereas, the similarity matrix in Fig. 3(b) depicts the percentage of common observations between any two users on the same dataset. It is clear that the similarity of observations between two different users is less than the similarity of applications. The effect is less visible on the Securacy dataset (Figs. 3(c)-(d))because the subjects came from a diverse population than the subjects of the UMDAA-02 dataset. Hence, the similarity of applications is less pronounced, yet, the differences between application similarity and observation similarity are still present.

Fig. 3: Similarity matrix depicting (a) application name overlap, and (b) observations overlap for the training set of the UMDAA-02 dataset. Similarly, (c) and (d) depicts the application overlap and observations overlap for the training set of the Securacy dataset

Iii-B Taking Unknown Applications into Account

Now, in order to handle unknown applications that might be present in the test set, an additional application name is considered. The application adds observation states when combined with TZs and . Note that in the training set there is no probability of having any application, and all the observations with

are assigned a very small prior probability (

) when state-space models are trained. Also, it is ensured for state-space models that the emission probability for the states with application does not go to zero, in order to prevent zero probability score during testing when unknown applications are encountered. If the total number of unique applications used by user in the training set is , then any application of the test user in the test set will be denoted as if . In [22], the authors addressed similar issues for geo-location data by considering even more additional states such as nearby unknowns. However, proximity is a vague concept for application data and therefore only is considered here. Note that, any observation with an unknown application is unforeseen by default, but an unforeseen observation with some other application name is not unknown.

Note that, apart from , unforeseen observations might be present in the test set. For example, in the training set an application might only occur in weekdays at timezones and while the same application might be used in the test set at time zone on a weekday. In that case, the test observation would be unforeseen in the training set. For state space models, this problem is handled by generating all possible combinations applications, time zone and day flag and use them to construct the model. If one such observation is not present in the training set, it is assigned non-zero prior and emission probabilities to ensure that they do not bring down the probability of a test sequence to zero.

Iii-C Handling Uncertainty

Now that unknown applications and unforeseen observation states are addressed, we tackle the creation of observation states via binning of time-stamped data. In most cases, the data collection is done in sessions, where a session starts with unlocking the phone and stops when the phone is locked again. Even if this is not the case, there can be very long idle times between consecutive usage of a phone, during which, authentication is a redundant operation and no application is running in the foreground [36]. Hence, there can be a big gap between the start-time for an application and the stop time of the previous application in the data log. This time gap might be as small as several seconds to as big as several days even for a user who owns a smartphone for regular use [3]. The sparsity introduced by this time gap is handled in two ways. At the beginning of each session (unlocking of the phone) a dummy observation state is introduced. The state-space model is expected to learn that is a time gap which might or might not cause a change in the time zone. For example, the last used application might be in before the closing of a session. Then the next session may occur in either or or of the same day. If the next session is in the next day or if the day changes within a running session, then an additional flag is introduced which denotes the transition into next day. The time zone and weekday/weekend flags are ignored for observations and .

So, taking the six probable observations for and the and observations into consideration, the total number of possible observation states for user would be .

Fig. 4: Overview of an application-usage-based user verification system for mobile devices.

Iii-D System Overview

A diagram depicting an application-usage-based user verification system is shown in Fig. 4

. Once the observation sequence is extracted, a verification model can be trained based on the patterns in the sequence. The verification model can be a state space model, a string matching approach or even a recurrent neural network, depending on data availability and need. For state-space models, once training for a user is done, the model can be used to generate scores for last

test observation sequences created using the same protocol that was used during the training phase. The score can be thresholded to obtain the verification decision. For more simpler methods such as sequence matching, unknown applications and unforeseen observations are difficult to handle. For the authentication problem, the unknown and unforeseen play key roles, described in the next section.

Iv The Role of Unknown application and Unforeseen Observations in User Verification

Iv-a Statistics of unknown applications in the test data

If an application is present in the test set but not encountered in the training set, the application is denoted with as unknown application in the proposed formulation. Intuitively, the prevalence of will be much higher if the test set comes from a different user or from an intruder of the phone, while for the legitimate user the test set will have fewer unknown applications. This intuition is verified on the application usage data from both UMDAA-02 and Securacy datasets, as can be seen from the box plots in Fig. 5.

Fig. 5: Boxplots depicting the percentage of unknown application in test data for (a) UMDAA-02 dataset, and (b) Securacy dataset, for different sampling rates. Note that the average percentage of unknown applications used by the the different user is much bigger than that for same user on both datasets.

Note that the gap between the whisker plots for same user and different users is larger for the Securacy dataset in comparison to UMDAA-02 dataset. Securacy is a larger dataset with more users, more data per user and more variation in user demographies compared to UMDAA-02 in which the subjects were from a narrow age range and were all affiliated with the same institution. Hence, it shows that among the general population, even the selection of applications varies widely between users.

Iv-B Impacts of binary decision based on unforeseen events

Two simple experiments with unknown applications and unforeseen observations are performed on the UMDAA-02 and Securacy datasets to evaluate their role in user verification. The observations for each user are chronologically sorted and the earliest observations are considered for training and the rest for testing. Now, for any user in the training set, a sequence of training observations is obtained along with the set of unique applications . Now, each test sequence of a user is compared with the training sequence and application lists of the training subjects and different binary hard decision rules are applied in two experiments. In the first experiment, the binary decision rule is based on occurrence of an application in the test set that is not present in the training set. In the second experiment, the decision is taken based on the occurrence of an unforeseen observation in the test set. In both cases, if there is even a single occurrence of an unknown application or an unforeseen observation, then the match score is set to , otherwise it is set to . The matching algorithms for the two experiments are shown in (1) and (2), respectively. The data sampling rates for both these experiments were set to per second, which resulted in training-test sequence pairs for the UMDAA-02 application-usage dataset and training-test sequence pairs for the Securacy dataset. The number of users with adequate training and test data is in UMDAA-02 and in Securacy, leading to an average of and pairs per user, respectively.

procedure BinUnk(, ) List of unique applications of user (

), n-last Test Sequence Vector of user

()
     for  do Loop through all test observations
          Get the application name from the test observation
         if  then
              return Return score 0.0 if any unknown application is encountered
         end if
     end for
     return Return score 1.0 if no unknown application in test sequence
end procedure
Algorithm 1 Binary Decision Rule based on Unknown Applications
procedure BinUnfore(, ) Sequence of training observations for user (), n-last Test Sequence Vector of user ()
     for  do Loop through all test observations
         if  then
              return Return score 0.0 if any unforeseen observation is encountered
         end if
     end for
     return Return score 1.0 if no unforeseen observation in test sequence
end procedure
Algorithm 2 Binary Decision Rule based on Unforeseen Observations
Fig. 6: (a) Sensitivity, (b) Specificity, (c) F1-Score, and (d) Accuracy (in ) obtained by varying sequence length for Securacy and UMDAA-02 application-usage data for using the Binary Hard Decision rule based on unknown applications and unforeseen observations.

Results for several evaluation metrics namely, sensitivity, specificity, F1-score and accuracy - all in percentage, obtained through the two experiments on the two datasets are shown in Fig.

6(a)-(d). The definition of these metrics are as follows:

(1)
(2)
(3)
(4)

where, , and are the numbers of true positive, false positive and false negative detections, respectively. High Sensitivity implies smaller number of false-negatives, while high Specificity implies less false-positives. Accuracy over

denotes that the true values outweighs the false predictions. Finally, F1-Score implies better overall precision and recall.

Fig. 6 gives the following interesting insights about the impact of the unknown applications and unforeseen observations on the performance metrics for the two datasets.

  • With increasing sequence length , the specificity is increasing gradually for all the cases, while sensitivity is decreasing. The decrease in sensitivity is probably due to the fact that the probability of having an unknown application in the sequence increases with increasing sequence size, thereby increasing the chances for false negatives. On the other hand, with increasing more sequences are denoted as negatives, which in effect reduces the number of false positives and therefore increases specificity.

  • The sensitivity drops drastically when unforeseen observations are used instead of unknown applications as decision criteria. This is understandable, since the number of false negatives increases rapidly when all sequences with at least on unforeseen observations are marked as data from a different user.

  • The number of false positives decreases when unforeseen observations are considered for decision instead of unknowns. This leads to a jump in specificity for a fixed . In general, the specificity is much higher for Securacy dataset in comparison to UMDAA-02. This proves that there are more unknown applications and unforeseen applications in Securacy when comparing a user with others. Securacy being a more diverse and larger dataset has wider variation of information, which leads to this phenomenon. Here, the training data for each user is longer, meaning that they are much closer representation of real life and therefore, an unknown application or unforeseen observation is actually a different user’s data in most cases.

  • Higher sensitivity, however, does not mean that for real life data a simple binary classifier based on unforeseen observations is reasonably good. The F1-Score is very low for both datasets, which means either precision or recall or both of therm are very low. Since in the active authentication using application usage, the number of positive pairs is largely outweighed by the number of negative pairs, it can be assumed that

    and . Since Precision= and Recall=, that means, RecallPrecision. With increasing , increases, while decreases, leading to reduction in recall and increase in precision. However, given the fact that the F1-Score does not improve much with increasing , it can be assumed that Recall reduces steeply while Precision does not improve much.

  • Irrespective of deciding with unknown or unforeseen, the accuracy is always lower for the UMDAA-02 dataset in comparison to Securacy dataset. Even though the application-usage information in Securacy is much larger than UMDAA-02, probably due to the high demographical similarity among the subjects of UMDAA-02, the binary hard measure performs poorly in comparison to Securacy. In practice, there could not be any assumption made about the demographic similarity or dissimilarity of an user and an intruder - hence, using neither unknown applications nor unforeseen observations as a hard decision metric cannot be a practical solution to the active authentication problem.

  • The experiment once again proves that ’accuracy’ is not a good performance metric when the number of samples between classes is severely biased. In this example, the average percentage of positive pairs in the dataset is on UMDAA-02 dataset and in the Securacy dataset. Being an open set problem, the task is to deal with heavily biased data towards negative samples and better performance measures in this regard would be receiver operating characteristic (ROC) curves and equal error rates (EER) instead of accuracy.

procedure M-ED(, ) Training observation sequence of a user () of length , -last Test observation Sequence of any user (), where length().
     
     for  to  do
         
         
         for  to length(do
              if  then
                   Exact match, no operation needed.
              else
                   Extract application name, timezone and day flag from the observations.
                   Extract application name, timezone and day flag from the observations.
                  NOp
                  if  and ( or then
                       NOp One substitution needed if only timezone or day does not match.
                  else if  then
                       NOp Two substitution needed if neither timezone nor day are matching.
                  else
                       NOp Three substitution operation for no match.
                  end if
                   NOp
              end if
              
         end for
     end for
     return
end procedure
Algorithm 3 Pseudocode for the modified edit-Distance algorithm.

Iv-C Impacts of ignoring unforeseen events

Now that the impact of unforeseen events on the authentication problem are established, a slightly more advanced sequence matching approach based on Levenshtein Distance a.k.a Edit-Distance (ED) is performed to study the impact of ignoring the unknown observations and unforeseen events. When matching sequence to another sequence of the same length, the original ED calculates the number of deletions, insertions, or substitutions required to transform to . For the active authentication problem, let’s assume that a test observation sequence of length is to be matched with any training observation sequence of length , where, intuitively . Since each observation consists of an application name, timezone and day flag, when a mismatch occurs, the distance can be assumed to be different depending on the amount of match. For example, if only the application name matches, then the timezone and day flag needs to be substituted, leading to two operations. Based on this fact, the a modified algorithm for edit distance (M-ED) is presented in (3). Using this algorithm, three different tests are performed on the UMDAA-02 dataset, the results for which are given in Table V. In the first test, all test observations are included, while in the next two tests, the observations with unknown applications, and the unforeseen observations are ignored. In order to ignore the unforeseen observations, for any training sequence, each test sequence is compared to find the unforeseen observations and removed from the test sequence. For unknown applications, the corresponding observation is removed. This operation reduced the number of samples per user from to and , respectively, and the number of unique application in the test data went from to and . As can be seen from Table V, the lowest EERs for any value of are obtained when all observations are considered. Ignoring both unknown applications and unforeseen observations make the verification task difficult. Also, for practical purposes, ignoring samples will cause latency in decision making, which can greatly reduce the recall of an active authentication system.

n %EER
All Obs. No Unknown Apps. No Unforeseen Obs.
20 43.20 49.22 48.96
30 39.03 44.72 46.70
40 36.97 43.64 45.01
50 35.53 42.19 44.16
60 34.31 42.47 43.29
TABLE V: Performance of the M-ED algorithm in terms of EER (%) for three types of test sequences - all observations, all except the ones with unknown applications and all without unforeseen observations. Experiment performed on the UMDAA-02 dataset with fixed sampling rate at 1/30.

In the next section, some suitable modeling approaches for the application usage-based active authentication problem are discussed.

V Suitable Modeling Techniques

In light of the outcomes of the experiments presented in the previous section, it can be asserted that the application-usage-based verification models must be capable of taking into account unknown applications and unforeseen observations. A popular approach to model temporal data sequences is to use state-space models such as Mobility Markov Chains or Hidden Markov Models (HMM) which can model time variation of the data. However, these methods are not capable of handling unforeseen events by default. For example, any unforeseen event will be given a zero emission probability in these models, and therefore, the models will be somewhat like the binary decision model that was discussed earlier. However, simple modifications to these models can improve the usability of these methods when unforeseen events are present as discussed in [22] for geo-location data. In this paper, the three state-space models namely, the Markov Chain (MC)-based Verification, HMM with Laplacian Smoothing (HMM-lap) and Marginally Smoothed HMM (MSHMM), described in [22] are employed on the application-usage-based verification task and the performances are compared.

For the MC method, the prior probability for unknown and unforeseen events are set to a very small nonzero probability of (Laplace-smoothing) when training a model for observation sequences of length . For MC, the probability of transitioning to an observation state depends only on the probability of the last observation state , i.e.

(5)

If the prior probability of entering any state is with respect to the set of observations for user- , then the total probability of traversing any sequence of consecutive observations is calculated as

(6)

Similar to the MC method, in HMM-lap method Laplacian Smoothing of the emission probabilities is considered with HMM to incorporate unforeseen observations as discussed in [22]. The number of hidden states is fixed to for all the experiments and the maximum number of iteration is set to .

The most suitable approach for handling unforeseen observations is the Marginally Smoothed Hidden Markov Model (MSHMM) introduced in [22]. To adopt the approach for the active authentication problem, the marginal probabilities of the presence of an application in the training sequence of a user for each time-zone and day flags are precomputed. Assuming that the probability of user-x using application at time-zone at time , is independent of the probability of user-x using the application at location , at time , the emission probability from state to observation , is if . Otherwise,

(7)

where and, . By definition, the MSHMM approach is capable of differentiating between unknown applications and unforeseen observations with known applications, as well as, the more frequent vs. less frequent applications occurring at different time zones and days.

In the next section, experimental results for these three verification methods are discussed in detail for performance comparison.

n Method Sampling Rate
1/5 1/10 1/15 1/20 1/25 1/30
20 M-ED 42.96 42.92 44.12 43.64 43.09 43.2
MMC 40.86 40.53 40.27 39.48 40.39 36.78
HMM-lap 38.49 38.35 37.82 37.39 38.83 36.77
MSHMM 37.3 37.3 36.67 35.93 35.63 34.82
30 M-ED 42.7 41.71 40.18 38.17 37.58 39.03
MMC 40.29 39.18 38.21 40 39.04 36.82
HMM-lap 37.28 37.2 36.68 37.73 37.89 37.45
MSHMM 36.23 36.87 35.74 36.87 35.99 35.79
40 M-ED 41.7 38.64 38.41 38.13 37.45 36.97
MMC 39.29 40.57 38.13 39.62 41.97 35.89
HMM-lap 37.37 37.88 36.75 36.07 39.11 34.62
MSHMM 35.4 35.65 34.026 34.4 36.58 32.54
50 M-ED 40.69 37.98 36.19 35.55 35.58 35.53
MMC 40.34 37.92 38.67 36.96 39.57 33.56
HMM-lap 36.97 36.01 36.48 34.72 36.7 33.95
MSHMM 35.95 34.41 34.67 32.41 35.27 30
60 M-ED 38.69 35.93 35.32 35.72 34.97 34.31
MMC 38.33 37.5 37.5 38.01 35.91 34.35
HMM-lap 35.31 35.48 34.18 33.15 36.05 34.35
MSHMM 34.036 34.92 32.78 33.33 34.3 31.93
TABLE VI: Application-usage-based verification performance comparison for UMDAA-02 dataset across different methods based on EER () for varying sequence length (n) and sampling rate. The number of hidden states is fixed at and maximum number of iteration is for HMM-based methods.

Vi Experimental Results

The performances of M-ED, MMC, HMM-lap and MSHMM algorithms for the full test sequences of the UMDAA-02 application usage dataset are shown in Table VI, where, the sampling rate has been varied from one sample every seconds to one sample every seconds with intervals of seconds, while the number of previous observations is varied from to with intervals of . It can be seen from the table that with smaller sampling rate and bigger , the EER drops for all the methods. The MSHMM outperforms every other method in every case, which can be attributed to the improved modeling capability of the method due to marginal smoothing. For a practical verification system, the sampling rate and value of would determine the latency of decision making. In many cases, a sample every second might be too late and therefore the system designer should choose these parameters carefully.

Fig. 7: Average change in MSHMM scores in response to intrusion on the UMDAA-02 application-usage data.

As for , intuitively with more historical data the performance should improve all the time. In order to determine the impact of and also to get an idea about the latency of MSHMM when intrusion occurs, a different experiment was performed where a different user’s data is appended with the legitimate user’s data to simulate intrusion. To be more precise, for each user of the UMDAA-02 dataset, consecutive observations from the test sequence starting from a random index are appended with consecutive observations from the test sequences of all the other users (start index picked randomly) and the whole sequence is evaluated using MSHMM for different values. The average score values across all users are plotted in Fig. 7 for different values. When the observations from a different user starts to enter a batch (at -th batch), the average scores returned by MSHMM for each batch drops vividly, as can be seen from the figure. Also, the figure clearly shows the drop is larger for large values - proving the intuition that considering more historical data is advantageous in this regard. As for latency, if the score of is considered as a threshold for decision making, then for all , the intrusion will be detected within batches, i.e. withing minutes from the inception of intrusion.

Finally, for the Securacy dataset, the performances of MSHMM, HMM-lap, MMC and M-ED are presented in Table. VII. Similar to the UMDAA-02 dataset results, MSHMM outperforms the other methods by a good margin. Note that the EER values are much lower for this dataset for the state-space models, which is understandable since it has already been demonstrated in Fig. 3(c) that the users are quite separable in this dataset even if only application names are considered. However, M-ED faces difficulty in exploiting the separability of the observations since is not capable of modeling temporal variations as effectively as state-space models.

n MSHMM MMC HMM-lap M-ED
20 17.23 19.286 19.66 35.09
30 16.75 18.9967 19.59 32.88
40 16.38 18.7074 19.19 31.4
50 16.26 17.9475 19.22 30.53
60 16.16 17.6443 18.38 30.58
TABLE VII: App-based verification EER() comparison for Securacy dataset across different methods [22] for different values. Number of Hidden States is set to and sampling rate is .

Based on results of the experiments presented in this paper, it can be asserted that application-usage data might be useful as a soft biometric for user verification for bolstering the decision in a multi-modal authentication scenario. Given the fact that the application-usage data is readily available and easy to track without using much battery or computational power, real-time score generation is possible. The experiments also depict that the verification scores show rapid change for intrusion within several minutes. Hence, the latency is not too high for a soft biometric measure. However, even though state-space models can be made to work well with some modifications, the equal error rate for a diverse dataset is still around , which needs further improvement. In this regard, bigger training datasets and keeping longer usage history might be helpful. In addition, if computational constraints can be loosened, then more sophisticated high-performance methods such as deep neural networks can be employed to minimize the EER.

Vii Conclusion

In this paper, the challenging problem of active authentication using application usage data has been formulated and systematically tackled to obtain viable solutions. Through several experiments, the impact of unknown applications and unforeseen observations on the authentication problem has been investigated and it is established that for this problem inclusion of the uncertain events are necessary to obtain better performances. In this regard, a modified edit distance algorithm has been introduced, the performance of which is compared with three state-space models namely, Markov Chain, HMM with Laplacian Smoothing and Marginally-Smoothed HMM, in terms of EER. Experiments were performed on the UMDAA-02 and the Securacy application-usage datasets. The experiments revealed some very interesting insights about the differences between the two datasets. Also, the paper addressed different aspects of important practical considerations such as intrusion detection, latency, observation history and sampling rate. As for future work, the M-ED method might be further improved by varying the distances for the three different cases based on the marginal probabilities. Also, recurrent neural network (RNN)-based models might be able to learn more discriminative properties of application-usage patterns. However, RNNs require huge amount of data for useful training, which the two datasets presented here lacks. Another interesting research direction would be the joint training of application sequence and some other sequential data such as the location data to improve the authentication performance. Finally, since application information are also suitable context for other modalities, application data sequences can have duel utilization (as a separate modality and also as context) in more advanced active authentication schemes.

Acknowledgment

This work is partially funded by the Emil Aaltonen Foundation (170117 KO), the Finnish Foundation for Technology Promotion (PoDoCo program), Academy of Finland (Grants 286386-CPDSS, 285459-iSCIENCE, 304925-CARE, 313224-STOP, SENSATE), and Marie Skłodowska-Curie Actions (645706-GRAGE)

References

  • [1] Spotlight on consumer app usage part-1. Technical report, App Annie, 2017.
  • [2] H. Cao and M. Lin. Mining smartphone data for app usage prediction and recommendations: A survey. Pervasive and Mobile Computing, 37:1 – 22, 2017.
  • [3] K. Church, D. Ferreira, N. Banovic, and K. Lyons. Understanding the challenges of mobile phone usage data. In Proceedings of the 17th International Conference on Human-Computer Interaction with Mobile Devices and Services, MobileHCI ’15, pages 504–514, New York, NY, USA, 2015. ACM.
  • [4] H. Falaki, R. Mahajan, S. Kandula, D. Lymberopoulos, R. Govindan, and D. Estrin. Diversity in smartphone usage. In Proceedings of the 8th International Conference on Mobile Systems, Applications, and Services, MobiSys ’10, pages 179–194, New York, NY, USA, 2010. ACM.
  • [5] M. E. Fathy, V. M. Patel, and R. Chellappa. Face-based active authentication on mobile devices. In IEEE Int. Conf. Acoust., Speech and Signal Process. (ICASSP), 2015.
  • [6] T. Feng, Z. Liu, K.-A. Kwon, W. Shi, B. Carbunar, Y. Jiang, and N. Nguyen. Continuous mobile authentication using touchscreen gestures. In Homeland Security (HST), 2012 IEEE Conf. on Technologies for, pages 451–456, Nov. 2012.
  • [7] T. Feng, J. Yang, Z. Yan, E. M. Tapia, and W. Shi. Tips: Context-aware implicit user identification using touch screen in uncontrolled environments. In Proceedings of the 15th Workshop on Mobile Computing Systems and Applications, HotMobile ’14, pages 9:1–9:6, New York, NY, USA, 2014. ACM.
  • [8] D. Ferreira, E. Ferreira, J. Goncalves, V. Kostakos, and A. K. Dey. Revisiting human-battery interaction with an interactive battery interface. In Proceedings of the 2013 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’13, pages 563–572, New York, NY, USA, 2013. ACM.
  • [9] D. Ferreira, V. Kostakos, A. R. Beresford, J. Lindqvist, and A. K. Dey. Securacy: An empirical investigation of android applications’ network usage, privacy and security. In ACM Conference on Security & Privacy in Wireless and Mobile Networks, WiSec ’15, pages 11:1–11:11. ACM, 2015.
  • [10] M. Frank, R. Biedert, E. Ma, I. Martinovic, and D. Song. Touchalytics: On the applicability of touchscreen input as a behavioral biometric for continuous authentication. IEEE Transactions on Information Forensics and Security, 8(1):136–148, Jan. 2013.
  • [11] L. Fridman, S. Weber, R. Greenstadt, and M. Kam. Active authentication on mobile devices via stylometry, application usage, web browsing, and GPS location. IEEE Systems Journal, 11(2):513–521, 2017.
  • [12] S. Gambs, M.-O. Killijian, and M. N. n. del Prado Cortez. Show me how you move and i will tell you who you are. In Proc. 3rd ACM SIGSPATIAL Int. Workshop on Security and Privacy in GIS and LBS, SPRINGL ’10, pages 34–41, 2010.
  • [13] S. Gupta, A. Buriro, , and B. Crispo. Demystifying authentication concepts in smartphones: Ways and types to secure access. Mobile Information Systems, 2018:16 pages, 2018.
  • [14] A. Hadid, J. Heikkila, O. Silven, and M. Pietikainen. Face and eye detection for person authentication in mobile phones. In Distributed Smart Cameras. ICDSC ’07. First ACM/IEEE Int. Conf., pages 101–108, Sept. 2007.
  • [15] S. L. Jones, D. Ferreira, S. Hosio, J. Goncalves, and V. Kostakos. Revisitation analysis of smartphone app use. In Proceedings of the 2015 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’15, pages 1197–1208, New York, NY, USA, 2015. ACM.
  • [16] J. Kaasila, D. Ferreira, V. Kostakos, and T. Ojala. Testdroid: Automated remote ui testing on android. In Proceedings of the 11th International Conference on Mobile and Ubiquitous Multimedia, MUM ’12, pages 28:1–28:4, New York, NY, USA, 2012. ACM.
  • [17] H. Khan and U. Hengartner. Towards application-centric implicit authentication on smartphones. In Proceedings of the 15th Workshop on Mobile Computing Systems and Applications, HotMobile ’14, pages 10:1–10:6, New York, NY, USA, 2014. ACM.
  • [18] V. Kostakos, D. Ferreira, J. Goncalves, and S. Hosio. Modelling smartphone usage: A markov state transition model. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’16, pages 486–497, New York, NY, USA, 2016. ACM.
  • [19] W.-H. Lee, X. Liu, Y. Shen, H. Jin, and R. B. Lee. Secure pick up: Implicit authentication when you start using the smartphone. In Proceedings of the 22Nd ACM on Symposium on Access Control Models and Technologies, SACMAT ’17 Abstracts, pages 67–78, New York, NY, USA, 2017. ACM.
  • [20] H. Li, X. Liu, and Q. Mei. Predicting smartphone battery life based on comprehensive and real-time usage data. CoRR, abs/1801.04069, 2018.
  • [21] Y. Liu and A. Simpson. Privacy-preserving targeted mobile advertising: requirements, design and a prototype implementation. Software: Practice and Experience, 46(12):1657–1684.
  • [22] U. Mahbub and R. Chellappa. Path: Person authentication using trace histories. In 2016 IEEE 7th Annual Ubiquitous Computing, Electronics Mobile Communication Conference (UEMCON), pages 1–8, Oct 2016.
  • [23] U. Mahbub, S. Sarkar, V. M. Patel, and R. Chellappa. Active user authentication for smartphones: A challenge data set and benchmark results. In Biometrics Theory, Applications and Systems (BTAS), 2016 IEEE 7th Int. Conf., Sep. 2016.
  • [24] U. Mahbub, S. Sarkar, V. M. Patel, and R. Chellappa. Active user authentication for smartphones: A challenge data set and benchmark results. In IEEE International Conference on Biometrics Theory, Applications and Systems (BTAS), 2016.
  • [25] A. Mahfouz, T. M. Mahmoud, and A. S. Eldin. A survey on behavioral biometric authentication on smartphones. Journal of Information Security and Applications, 37:28 – 37, 2017.
  • [26] S. Mondal and P. Bours. Does context matter for the performance of continuous authentication biometric systems? An empirical study on mobile device. In International Conference of the Biometrics Special Interest Group (BIOSIG), pages 1–5, 2015.
  • [27] R. Murmuria, A. Stavrou, D. Barbará, and D. Fleck. Continuous authentication on mobile devices using power consumption, touch gestures and physical movement of users. In International Workshop on Recent Advances in Intrusion Detection, pages 405–424. Springer, 2015.
  • [28] N. Neverova, C. Wolf, G. Lacey, L. Fridman, D. Chandra, B. Barbello, and G. Taylor. Learning human identity from motion patterns. IEEE Access, 4:1810–1820, 2016.
  • [29] V. M. Patel, R. Chellappa, D. Chandra, and B. Barbello. Continuous user authentication on mobile devices: Recent progress and remaining challenges. IEEE Signal Processing Magazine, 33(4):49–61, July 2016.
  • [30] A. Primo, V. Phoha, R. Kumar, and A. Serwadda. Context-aware active authentication using smartphone accelerometer measurements. In Comput. Vision and Pattern Recognition Workshops, IEEE Conf., pages 98–105, June 2014.
  • [31] O. Riva, C. Qin, K. Strauss, and D. Lymberopoulos. Progressive authentication: Deciding when to authenticate on mobile phones. In Proc. of the 21st USENIX Conf. on Security Symp., Security’12, pages 15–15, Berkeley, CA, USA, 2012. USENIX Association.
  • [32] P. Samangouei, V. M. Patel, and R. Chellappa. Facial attributes for active authentication on mobile devices. Image and Vision Computing, 58:181 – 192, 2017.
  • [33] A. Shema and D. E. Acuna. Show me your app usage and i will tell who your close friends are: Predicting user’s context from simple cellphone activity. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems, CHI EA ’17, pages 2929–2935, New York, NY, USA, 2017. ACM.
  • [34] V. Srinivasan, S. Moghaddam, A. Mukherji, K. K. Rachuri, C. Xu, and E. M. Tapia. Mobileminer: Mining your frequent patterns on your phone. In Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’14, pages 389–400, New York, NY, USA, 2014. ACM.
  • [35] A. Tongaonkar, S. Dai, A. Nucci, and D. Song. Understanding mobile app usage patterns using in-app advertisements. In M. Roughan and R. Chang, editors, Passive and Active Measurement, pages 63–72, Berlin, Heidelberg, 2013. Springer Berlin Heidelberg.
  • [36] N. van Berkel, C. Luo, T. Anagnostopoulos, D. Ferreira, J. Goncalves, S. Hosio, and V. Kostakos. A systematic assessment of smartphone usage gaps. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, pages 4711–4721, New York, NY, USA, 2016. ACM.
  • [37] P. Welke, I. Andone, K. Blaszkiewicz, and A. Markowetz. Differentiating smartphone users by app usage. In Proceedings of the 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing, UbiComp ’16, pages 519–523, New York, NY, USA, 2016. ACM.
  • [38] Q. Xu, J. Erman, A. Gerber, Z. Mao, J. Pang, and S. Venkataraman. Identifying diverse usage behaviors of smartphone apps. In Proceedings of the 2011 ACM SIGCOMM Conference on Internet Measurement Conference, IMC ’11, pages 329–344, New York, NY, USA, 2011. ACM.
  • [39] Y. Xu, M. Lin, H. Lu, G. Cardone, N. Lane, Z. Chen, A. Campbell, and T. Choudhury. Preference, context and communities: A multi-faceted approach to predicting smartphone app usage patterns. In Proceedings of the 2013 International Symposium on Wearable Computers, ISWC ’13, pages 69–76, New York, NY, USA, 2013. ACM.
  • [40] H. Zhang, V. M. Patel, and R. Chellappa. Robust multimodal recognition via multitask multivariate low-rank representations. In IEEE Int. Conf. Automat. Face and Gesture Recogn. IEEE, 2015.