The House That Knows You: User Authentication Based on IoT Data

08/01/2019 ∙ by Talha Ongun, et al. ∙ Northeastern University Visa 0

Home-based Internet of Things (IoT) devices have gained in popularity and many households became "smart" by using devices such as smart sensors, locks, and voice-based assistants. Given the limitations of existing authentication techniques, we explore new opportunities for user authentication in smart home environments. Specifically, we design a novel authentication method based on behavioral features extracted from user interactions with IoT devices. We perform an IRB-approved user study in the IoT lab at our university over a period of three weeks. We collect network traffic from multiple users interacting with 15 IoT devices in our lab and extract a large number of features to capture user activity. We experiment with multiple classification algorithms and also design an ensemble classifier with two models using disjoint set of features. We demonstrate that our ensemble model can classify six users with 86

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

User authentication is one of the major challenges in security, and while a diverse set of solutions have been proposed, the quest for the perfect solution continues. Traditional user authentication methods include passwords, hardware tokens, and biometrics. Given the limitations of passwords (Habib et al., 2018; Melicher et al., 2016; Ur et al., 2016) several well-known techniques are used in practice to augment them such as multi-factor authentication based on either a hardware token or a different channel (e.g. email, SMS). These methods impede the usability of authentication, and introduce additional latency (Siadati et al., 2017; De Cristofaro et al., 2013)

. Biometrics-based authentication has been extensively used in various forms (facial recognition, fingerprints, or retina scans), but revocation and user impersonations are well-recognized limitations of these methods.

The omnipresence of IoT devices such as sensors, locks, entertainment devices, and voice-based personal assistants, has made authentication in smart households more complex, bringing new challenges. IoT devices are fundamentally different from personal devices and traditional authentication methods are either not applicable or break the natural flow of interaction with these devices. Deploying a screen or input pad for each of these devices is either not practical or costly. Some devices such as Amazon Echo Dot can perform voice-based biometric authentication as an optional feature. However, this feature can be circumvented as voice can be spoofed and users might not be comfortable with their voices being recognized due to privacy concerns 

(Naeini et al., 2017). Therefore, better authentication methods are needed in the context of smart homes.

One of the opportunities in solving this problem in smart home environments is the relatively small number of users for which authentication must be provided. Most American households range in size from two to six members. Multi-class classification algorithms can be trained for such small sets and with minimal effort it can be ensured that adversaries do not interfere at training time. However, there are several challenges in designing user authentication methods for household IoT devices including: supporting a diverse set of devices with minimal changes in their operation, maintaining user privacy, and supporting a diverse set of policies. Data that captures detailed user behavior (such as device logs) tend to be device-specific and more invasive in terms of privacy.

Our approach addresses these challenges by using the IoT devices’ network traffic from the home router as data source. In order to preserve user privacy we collect and leverage the information in the headers of HTTPS packets for our authentication system. This data includes minimal information (timing, ports, bytes sent and received) with low risk on user privacy. Additionally, to protect user privacy we make the design choice of performing the model training and testing locally in the user’s home, rather than remotely, in the cloud. This implies that the authentication module needs to reside in the smart home. To address the mis-match between user-level semantics and network-level semantics we observe the network traffic for a continuous time window and design a machine learning (ML) model that uses features aggregated over a recent time window to predict the user in the room. Finally, we train a multi-class classifier to predict the likelihood that a certain user is in the room. The model is trained based on historical data obtained in controlled settings (knowing which user is in the room) and then used at testing time to classify users by generating in real time an authentication score based on the most recent observed activity.

We leverage the IoT lab at our institution set up as a studio apartment and purchase a set of 15 IoT devices from different categories, including voice assistants, smart kitchen appliances, and entertainment devices. We make use of the existing monitoring infrastructure in the lab to collect network packets (pcap files) from all these devices. We design an IRB-approved user study with multiple users participating in data collection over a period of three weeks. During the training period, we use labeled data of user sessions and extract features from HTTPS headers that capture user interaction with the IoT devices over a continuous time window. We design and test several ML classification algorithms for predicting the likelihood that a certain user is in the room. During the regular user interaction with the IoT devices, the authentication module continuously receives data extracted from the most recent observation window and computes an authentication score for each user. The score is made available to upper-level authorization systems that can implement flexible policies according to the level of risk tolerated. For example, these policies can be used for local services such as log in to a laptop, to perform operations on IoT devices, or to access cloud services, such as performing financial transactions.

Our experimental evaluation shows that a Gradient Boosting classifier achieves 81% precision and 80% recall at classifying six users, based on IoT device features. For five users the precision and recall are both 92%, for a 25-minute observation window. We also design a high-confidence ensemble classifier with two models trained on disjoint set of features (one on 420 device features and the second on 2910 domain features). The ensemble generates an authentication score only when the two models agree on the prediction. An ensemble of two Gradient Boosting models achieves an F1 score of 0.86 for six users and 0.97 for five users (at the cost of not generating scores in 16 out of 91 sessions).

The rest of the paper is organized as follows. We define our problem and adversarial model in Section 2. We describe our system design in Section 3 and our user study and data collection in Section 4. We present the evaluation in Section 5 and related work in Section 6. We conclude the paper in Section 7 and include some additional results in the Appendix.

2. Preliminaries and Background

In this section we provide an overview of methods for user authentication, we define the problem we are solving, and describe the adversarial model we consider in this work.

2.1. User Authentication

User authentication is a fundamental security process that has been studied extensively throughout the years. Traditional user authentication methods are based on what users know (e.g. passwords), what they have (e.g. hardware token), or what they are (e.g. biometrics). Even though extensive research shows password limitations (Habib et al., 2018; Melicher et al., 2016; Ur et al., 2016), passwords are still the de-facto standard for authentication in corporate and home environments. Users are shown to pick weak passwords and they tend to reuse them across different services (wor, 2019). Critical services do not always deploy proper password data protection techniques, resulting in increased risk for users in face of data breaches (fac, 2019). Several well-known techniques are used in practice to augment passwords. Multi-factor authentication methods use either a hardware token or a different channel (e.g. email, SMS) for increasing the confidence of authenticating users. These methods impede the usability of authentication, and introduce additional latency (Siadati et al., 2017; De Cristofaro et al., 2013). Biometrics-based authentication such as keystroke dynamics, fingerprints, facial recognition, and voice recognition, provide a convenient and reliable way to identify users (Wayman et al., 2005; Meng et al., 2015; Blasco et al., 2016). However, revocation and user impersonation are well-known challenge for these methods, in addition to privacy implications on users (Wolf et al., 2019).

Behavior-based authentication provides a safer and more convenient way to identify users, based on their behavior, how they behave. Such methods build behavioral models based on the unique way the users interact with devices like smartphones, tablets or touch screens. In the context of mobile devices, several systems for continuous authentication have been designed, for instance (Shi et al., 2010; Riva et al., 2012). They leverage a number of behavior features extracted from user interaction with their mobile devices, such as call logs, browser history, application usage, and location. User authentication based on WiFi signal of IoT devices has been also investigated recently (Shi et al., 2017). These systems can be used in a multi-factor authentication system to additionally increase confidence of user authentication.

2.2. Problem Definition

With the widespread usage of IoT devices in many households, new challenges and opportunities emerge. We pose for the first time the problem of designing usable, behavioral-based authentication models based on users naturally interacting with IoT devices in their homes or familiar environments. We are interested in designing a lightweight authentication module for smart homes that uses information extracted from monitoring the user interaction with IoT devices. The system should be able to identify with high accuracy each user from a fixed set of known users based on their behavioral characteristics. With user privacy as the premiere consideration in our design, we intend to leverage a minimum number of attributes that are universally applicable to a range of IoT devices from different manufacturers. We consider here mainly authentication based on behavioral features extracted from network traffic generated by IoT devices in a smart home. We envision such a system would learn from historical observations and create user profiles during a training phase. At testing time, the authentication module generates an authentication score mapped to the likelihood that a certain user is in the room based on the most recent observation of IoT device activity. The authentication module needs to observe users during a contiguous observation window to capture a sequence of user actions. The authentication scores can be used by higher-level applications to create flexible authorization policies that support various levels of risk.

2.3. Adversarial Model

We consider two main classes of adversaries:

Local adversaries. This adversary might be in close proximity or even inside the home environment, and interact with some of the IoT devices in the home. This model captures visitors, neighbors, kids, or roommates sharing the home. We would like to prevent such an adversary to perform actions like changing smart lock settings or making purchases on Alexa on behalf of real users.

Remote adversaries. Remote adversaries might not directly have access to IoT devices, but they could compromise user machines and accounts leveraging potential vulnerabilities in the user devices. This adversary is motivated to impersonate the real user to gain access to sensitive devices in the home or perform more harmful actions. Remote attackers can also inject network traffic in the smart home. This is possible by a malicious app installed on a user’s smart phone, or an adversary sending traffic to the home router. The adversary is aware of standard network protocols and applications, but he does not have full knowledge on the exact user behavior.

We assume that adversaries are not in the system when ML training is performed. A training-time attack against ML models is called a poisoning attack (e.g., (Xiao et al., 2015; Biggio et al., 2012; Steinhardt et al., 2017)) and protecting against it is orthogonal to our work. This guarantees that the trained ML models for user authentication are trustworthy. Both types of adversaries might act at testing time with the goal of influencing the decision of the ML model. We assume that adversaries have relatively limited interaction with the IoT home devices and they cannot perfectly imitate users in their behavior.

We envision that data collection and the authentication module reside in a trusted device, like the home router or a server. Compromising the authentication system by physically accessing it or by intercepting the communications is not within our scope.

3. System design

In this section we describe the goals motivating the design of our system and then give an overview of our approach.

3.1. Design Goals

In designing our system for user authentication in smart homes, we set forward the following design goals:

Operate in a smart home environment with multiple users. We intend our system to be used in a smart home environment with several users actively interacting with diverse IoT devices. The average American household size in 2018 was 2.53, with most households (98.66%) having size at most 6. Our system should provide good accuracy at the task of classifying users from a small set of previously seen users. Multi-class classifiers can be trained to predict the likelihood that a certain user is in the room.

Support multiple heterogenous IoT devices. Compared to traditional computing platforms, IoT devices are extremely heterogeneous and diverse. This raises the challenge of designing a flexible, general monitoring platform that can support devices from different manufacturers and can be easily extended as new devices become available. Data available usually consists of event logs and network traffic. We observe that network traffic is much more general and more likely to support the diversity of IoT devices.

Maintain user privacy. User privacy is an important design consideration, given the nature of the applications and the environment where users perform their actions. We intend our system to work without access to user personal identifiable information (PII). Most network traffic of IoT devices is over HTTPS, with encrypted payloads in the transmitted packets.

Capture user actions for behavioral modeling. User behavior is defined by their interaction with the devices and the applications. User (and application) behavior results in several packets disseminated over the network. An approach based on more general information, like network traffic, needs to map low-level information from network packets to higher-level user actions (for instance, brew a coffee or change the smart lock settings) in order to capture user behavior. This creates a challenge as a single action could generate many packets addressed to different external destinations.

Measure authentication confidence. Most systems for access control and authorization need to make an ultimate decision whether a user is allowed to access a critical resource or not. Our goal is to create an authentication service that generates authentication scores on a continuous basis. These scores should provide a measure of authentication confidence, which is leveraged by higher-level authorization systems. For instance, to perform a high-value financial transaction, the authorization system could set the authentication score threshold at 0.95 and require another authentication factor (such as SMS). We believe that reliable authentication measures as we aim to design can greatly augment the flexibility of current authorization systems.

Figure 1. Architecture diagram of the data-flow and proposed authentication process

3.2. System Overview

Our Approach. We address the challenges raised by our design goals by using the IoT devices’ network traffic from the home router. In order to preserve user privacy we collect the headers of the HTTPS packets that include minimal information such as timing, ports, destination address, bytes sent and received, posing low risk on user privacy. Additionally, to protect user privacy we make the design choice of performing the model training and testing locally in the user’s home, rather than remotely, in the cloud. This implies that the authentication module needs to reside in the smart home. To address the mis-match between user-level semantics and network-level data we observe the network traffic for a continuous time window and design an ML model that uses features aggregated over a recent time window to predict the user in the room. Finally, we train a multi-class classifier to predict the likelihood that a certain user is in the room based on the most recent activities.

Figure 1 gives an overview of our system architecture. We installed 15 IoT devices in the IoT lab at our institution. We make use of the existing monitoring infrastructure in the lab to collect pcap files from all these devices. We design an IRB-approved user study with multiple users participating in data collection over a period of three weeks. During the training period, we use labeled data of user sessions and extract features from HTTP headers over a continuous time window (with varying length). We design and test multiple ML classification algorithms with the aim of obtaining good accuracy at identifying users. During the regular user interaction with the IoT devices, the authentication module continuously receives data extracted from the most recent observation window and computes an authentication score for each user. The authentication score can be used by upper-level authorization systems for flexible policies according to various levels of risk. For example, log in to a local device might require a lower score than performing a financial transaction in the cloud.

Below we provide more details about the machine learning models we used and the features we selected for our multi-class classifier.

3.3. Machine Learning Design for User Classification

We design a machine learning (ML) system that: (1) learns user profiles over time, using data extracted from network traffic generated by IoT devices; and (2) computes an authentication score in real time based on the most recent observation interval. The authentication score estimates the confidence or probability of the actual user being in the room.

In more details, the ML system is trying to learn profiles for a set of users . In our specific application, user observation is performed over time, and thus the features or user attributes need to be defined over an observation window. We design the system to predict at time the probability that a certain user is in the room, i.e.,:

Attribute Type Description
Device Name Categorical Name of the smart device (extracted from the MAC address)
Time Integer The timestamp of the packet
Packet length Integer Size of the packet in bytes
Domain Categorical Domain name of the destination, if available
Direction Categorical Direction of the packet (outgoing/incoming)
Protocol Categorical Transport protocol of the packet (TCP/UDP/ICMP)
Destination Port Categorical Port number of the destination address
Source Port Categorical Port number of the source address
Table 1. Packet-level attributes: These fields are available for each HTTPS packet from its header.
Type Feature Category Operations Description
Device/Domain Incoming packet size count, sum, min, max, std. dev., mean, median Statistics on incoming packets
Device/Domain Outgoing packet size count, sum, min, max, std. dev., mean, median Statistics on outgoing packets
Device/Domain Protocols count Packet counts per protocol (TCP, UDP, ICMP)
Device/Domain Inter-event times count, sum, min, max, std. dev., mean, median Statistics on inter-packet timing
Device Domains distinct Distinct domain count
Table 2. Features aggregated by time window.

Given an observation or feature set computed over the time window of size (most recent history of length ), the system aims to predict an authentication score at time for each user . This can be achieved by training a multi-class classifier , where is the hypothesis space. The ML algorithm is given historical training data of labeled user activities: , with labels and the set of features computed over observation intervals of length . At the end of training, a model

is selected to minimize a certain loss function on the training set.

A probabilistic model in fact estimates the probability that a user generates the observed features in the most recent time window of length . The sum of the predicted probabilities is always 1:

At testing time, the model is applied at time to the most recent observation window (i.e., features extracted during time interval ) and generates an authentication score for each user .

We experiment with multiple ML classification algorithms, including Logistic Regression, Random Forest, and Gradient Boosting that fit our framework very well. The most challenging issue in our design is to create appropriate feature representations that capture user behavior over the recent time window and can be used to effectively differentiate multiple users with high confidence. We discuss multiple feature representations next.

3.4. Feature Representation

Packet-level attributes. Our system design is based on features extracted from HTTPS traffic captured at the router in the smart home. To protect user privacy, one of our design consideration is to inspect only headers of the HTTPS packets. The fields that we leverage in our design are shown in Table 1. They include the device identifier (extracted from the MAC address in each packet), timing, the packet length, the direction (outgoing or incoming), transport-level protocol (TCP, UDP, or ICMP), destination port, source port, and destination domain. We analyze the DNS queries from the network traffic and extract mapping of external IP addresses to domain names (FQDN).

Feature definition. Using packet-level attributes directly as features in an ML model is not feasible due to the large number of packets that are usually generated in a network. Moreover, for our task for user authentication we need aggregated behavioral indicators, beyond the scope of a single network packet. Therefore, we need to experiment with multiple feature representations that capture the user behavior over a time interval of length . For our approach, we use a sliding-window method to advance the time at which we compute authentication scores by one minute. We experiment with multiple values for the window length (from 5 to 30 minutes). For instance, if a user is in the room for an interval of 30 minutes during training and minutes, we generate 26 sliding windows of size 5 minutes. Features are generated for each sliding window and labeled with the identity of the user. At testing time, we compute an authentication score at each time based on the most recent sliding window of size , i.e., the interval . That means, implicitly, that we need to observe the user for of period at least before computing an authentication score.

The first source of data for feature extraction is user interactions with different IoT devices. It is well known that people have different interests and users prefer to interact with certain type of devices. Moreover, the same user tends to use the same set of IoT devices they are most familiar with over time. We believe that device usage is a strong behavioral indicator, as also confirmed in our evaluation.

Therefore, we define the first set of device-level features extracted from packet attributes aggregated per device. In particular, we would like to capture information about user interaction with each IoT device during the time window , such as: (1) various statistics on incoming and outgoing packet lengths; (2) packet counts for transport-level protocols; (3) number of distinct domains contacted by each device; (4) statistics on inter-arrival timing of packets. The list of aggregated features is given in Table 2, and all the features are aggregated over the packets sent and received by each device in each time window of size . An important design decision is to separate incoming and outgoing traffic, as they might have different distributions. In particular, user interaction with voice assistants could result in large volumes of incoming traffic (for instance, if a user listens to music), while device communication with the cloud induces regular outgoing communication. We denote this set of device features in time window as: .

The second source of data useful for feature generation is user communication with external destinations (domain names). This is motivated by our observation that different Alexa skills connect to different external domains. For instance, listening to music may stream data from the spotify.com domain, whereas playing fireplace sounds uses kwimer.com. As the set of external domain names is very large and cloud services use many sub-domains for load balancing, we transform our domain names to second-level domains and compute features per second-level domain. For each second-level domain, we use aggregated features including statistics on incoming and outgoing packet lengths, inter-event times and packet counts for transport protocols, computed over all the packets exchanged with that external domains in the time window of size . We denote this set of domain features in time window as: .

For our task of user authentication, we will consider several feature representations:

  1. Device-only: Estimate and predict ;

  2. Domain-only: Estimate and predict ;

  3. Both device and domain: Estimate and predict
    .

Device User Activities
Echo Dot - Echo Spot - Echo Plus Alexa voice assistants to support voice commands and control other smart devices

Google Home Mini
Google voice assistant to support voice commands and control other smart devices
Harman Kardon Invoke Cortana voice assistant to support voice commands and control other smart devices

LG Smart TV
Watch TV or stream content through third party applications

Roku TV
Stream content through third party applications
Amazon Fire TV Stream content through third party applications
Philips Hue Bridge Set the color and brightness of the bulbs

Samsung Smart Fridge
Use fridge, interact with the LCD Display
Smart Microwave Open, Heat up food, Close

iKettle Smart Kettle
Boil water

Behmor Smart Brewer
Brew coffee
SmartThings Hub Trigger motion and contact sensors

Table 3. IoT Devices in the IoT lab.

4. System Implementation

User Sessions Total Time Avg. Duration
1 9 184 20
2 5 181 36
3 11 330 30
4 8 206 25
5 2 38 19
6 12 263 21
7 4 71 17
8 9 154 17
9 6 112 18
10 8 121 15
Table 4. The number of sessions, total time, and average session duration per user, all in minutes.

4.1. User study

We performed a user study with multiple users over three weeks to generate the dataset used to train and test our authentication models. We utilize the IoT lab at our institution to conduct our user study and monitor users while they interact with a smart-home like environment. The lab is an enclosed studio with a kitchen and living area, equipped with a range of Internet-connected devices and appliances to simulate a smart household. The lab is available for use by researchers at our institution. We collected data from 15 IoT devices installed in our IoT lab, as described in Table 3.

We asked users to use the room for multiple sessions, each of them lasting at least 15 minutes. One of our requirements was that users are alone in the room while we perform the user study. The reason is to facilitate data labelling and to ensure that we do not monitor data generated by other lab users. Users logged their start and end times using their mobile phones in order to label the data for each session. We held several orientation sessions before the actual data collection, to help users become familiar with the devices. We did not provide any scripts and users were asked to interact with the devices in a natural manner.

In total, we recruited 10 users, who participated in the study and generated 74 sessions with a duration of 1660 minutes. The usage statistics per user are in Table 4. We collected 4,082,975 packets, including connections to 316 external destinations and 97 second-level domains. In the rest of the paper, we focus on six users: 1, 3, 4, 6, 8, and 10 who generated more than 6 sessions.

IRB approval. Our study was approved by the IRB at our institution. All researchers with access to the collected data performed the IRB training and users were required to consent to the data being collected while they use the room. Users were informed of the project goal before participating in the study. As we only extracted fields from packet headers of HTTPS traffic, we did not have access to any user personal information. To further protect the privacy of users, we anonymize the collected data by creating a unique user ID for each study participant. Our data is stored in anonymized format on our servers, not directly identifying the participants.

Implementation. The lab network is monitored at all times and the network traffic (pcap files) of all the IoT devices is collected in a server located in the room. We did not collect data from users’ personal devices such as laptops and phones, as we used a MAC address filter to collect only IoT device traffic. We created software that parses the HTTPS packet headers and stores the fields from Table 1 in a Postgres database. We extracted the device and domain features and stored them in a different table in our database. For training our models, we use the ML implementation from the Python scikit-learn package. We performed cross-validation for all our experiments and varied the ML hyper-parameters.

Figure 2. Device usage statistics during the user study.
Total bytes, packet count, and average packet length are plotted per device from left to right.
Figure 3. Domain usage statistics during the user study.
Total bytes, packet count, and average packet length are plotted per domain from left to right.

4.2. Data exploration.

We analyzed the collected data to compare different users in terms of device and domain usage. In Figure 2, we show the total data exchanged by device and user, the number of packets, as well as the average length of packets. For some devices such as Amazon Echospot and Google Home, the amount of data transferred varies significantly by user. On the other hand, some smart devices (such as the microwave and smart kettle) are either not utilized by users, or they do not generate a lot of network activity. Statistical features such as average packet length provide more distinguishing patterns across users than total number of packets or total data transferred.

In Figure 3 we display the amount of data transferred, the total number of packets, and the average packet length by user and second level domains. We only included domains with more than 10,000 packets for at least a user. While all devices communicates with the cloud and CDN domains (e.g., cloudfront.com), some domains (such as iheart.com) exhibit different patterns across users.

Next, we are interested in validating the hypotheses that users have consistent behavior in multiple sessions and users have differentiating behavior from other users. For this, we plot the amount of data transmitted by different devices over time for User 6 (in Figure 4) and User 10 (in Figure 5) in three distinct sessions. Interestingly, some sessions of the same user are similar (for instance, the first and third session of User 6). However, users sometimes deviate in their behavior across different sessions, as demonstrated by the third session of User 10 (in which interaction with Echo Dot is lower). Overall, Users 6 and 10 have different behavior in their interactions with smart devices. However, while there is some similarity across the same user’s sessions, these plots also show that user behavior varies across sessions, making behavioral authentication in this context challenging.

5. Evaluation

We evaluate our system using the data collected in the user study that is discussed in Section 4.1. We present results for different feature representations, user classification performance, and receiver operating characteristic (ROC) curves for three different classification models and a high-confidence ensemble model.

For the user classification task, we label the data using the session start and end times of each user participating in our user study. As it is usually not possible to train accurate machine learning models with limited amount of data, we decided to exclude users with less than 6 sessions in total. We perform our analysis on six users: 1, 3, 4, 6, 8, and 10 who generate more than 6 sessions. We further excluded 7 sessions in total during which the network monitoring tool was not active.

5.1. Comparison of Feature Representations

Figure 4. Total amount of data transmitted by five devices for User 6 in three sessions.
Figure 5. Total amount of data transmitted by five devices for User 10 in three sessions.
Figure 6. Accuracy metrics for different time windows for 6 users using different feature representations. From left to right: device based features, domain based features, device and domain based features.

We experiment with three different feature representations as discussed in Section 3.4: Device-only (420 features); Domain-only (2910 features); and Both device and domain (3330 features). We considered time windows of different length minutes. Each time window is created by sliding windows over the user sessions at one-minute intervals. If a session is shorter than , we generate only one window covering the entire session.

To evaluate our ML models, we perform 7-fold cross-validation across sessions. Thus, we keep one session for each user for testing and select all other sessions in training. We consider first a Random Forest (RF) classifier to evaluate feature representations and time window length. RF is an ensemble learning method that constructs many decision trees using a subset of features chosen at random at each split. RF are robust models that work well in multi-class classification settings with many features, as ours. We will experiment and compare other models next section.

We plot the recall, precision, and F1 score of the three feature representation methods for varying time windows in Figure 6. The results show that the model performance generally increases as the time window increases until 25 minutes, at which point the best results are obtained. We thus select a time window of 25 minutes for the rest of our experiments. We note the inherent tradeoffs between classification accuracy and the length of observation period for computing authentication scores. With smaller windows length, we can generate authentication scores after limited observation, at the cost of decreased classification performance. Comparing different feature representations, we observe that Device-only performs well, and it is very similar to Both device and domain. The worst performance is provided by the Domain-only representation.

We also generate user confusion matrix for the RF models in Table 

5. Interestingly, we obtain very good classification for some of the users (for instance, User 4), but worse results for other users (e.g., User 6). User 6 is mis-classified as User 3, meaning that their behavior is fairly similar. Thus, the classification performance varies a lot across users. We believe there are multiple factors contributing to this phenomenon, among them the variability of user behavior and the amount of data used for training these models.

5.2. Model Comparison

We compare three different models: Logistic Regression with regularization (LR), Random Forest (RF), and Gradient Boosting (GB) for time windows of length 25 minutes. We varied the hyper-parameters of these classifiers, and we select 2000 estimators for RF, and 2000 estimators for GB, with learning rate set at 0.01 and maximum depth at 3. For this experiment, we use the Device-only features. In Table 6 we show user-level performance metrics for the three classifiers when all six users are considered. As observed, GB outperforms both RF and LR with its average precision and recall at 0.81 and 0.8, respectively. However, the performance of RF and GB is fairly close, but LR has lower classification performance (being a linear model with lower complexity). We also generate ROC curves when training six classifiers, each one for classifying one of the user classes versus the rest in Figure 7. Figure 8 shows the micro-averaged results for each of three models with six users.

1 3 4 6 8 10 Count Recall Precision F1
1 5 0 1 1 0 0 7 0.71 0.71 0.71
3 0 47 0 1 0 0 48 0.97 0.81 0.88
4 0 1 20 1 0 0 22 0.90 0.95 0.93
6 0 9 0 4 0 1 14 0.28 0.57 0.38
8 1 0 0 0 4 2 7 0.57 0.8 0.66
10 1 1 0 0 1 4 7 0.57 0.57 0.57
Table 5. RF user confusion matrix and accuracy metrics for 25 minute intervals: avg. recall=0.8, avg. precision=0.78
Logistic Regression Random Forest Gradient Boosting
Recall Precision F1 Recall Precision F1 Recall Precision F1
User 1 0.42 0.6 0.5 0.71 0.71 0.71 0.85 0.75 0.79
User 3 0.52 0.69 0.59 0.97 0.81 0.88 0.93 0.86 0.9
User 4 0.81 0.62 0.70 0.90 0.95 0.93 0.90 1.0 0.95
User 6 0.71 0.45 0.55 0.28 0.57 0.38 0.5 0.53 0.51
User 8 0.42 0.5 0.46 0.57 0.8 0.66 0.42 0.75 0.54
User 10 0.71 0.71 0.71 0.57 0.57 0.57 0.57 0.5 0.53
Average 0.60 0.62 0.60 0.80 0.78 0.78 0.80 0.81 0.80
Table 6. Comparison of three ML classifiers per user and micro-average results for 6 users.

Additionally, we consider a setting with only five users (removing User 6) and show classification results in Table 9 in the Appendix. Surprisingly, by removing one of the users, we obtain classification precision of 0.92 and recall of 0.92 with GB, an increase of more than 10% compared to the setting of six users. Figure 9 in the Appendix shows the user-level ROC curves for five users and Figure 10 in the Appendix shows the micro-averaged results for each of three models with five users.

Figure 7. User-level ROC curves for 25 minute time windows for three models with 6 users.

5.3. High-Confidence Ensemble Model

So far, we showed that models based on device-only features perform relatively well at classifying users at time windows of length 25 minutes. As we discussed initially, different applications might have different requirements in terms of the confidence offered by the authentication score. For instance, a financial application might require a high confidence in the user classification module before allowing a financial transaction. When the authentication score is used as a factor in a multi-factor authentication system, it could be acceptable to not compute an authentication score when the confidence is very low.

To account for these settings, we propose the idea of designing an ensemble of two models that are built independently and can be used in combination to increase the confidence in the authentication score. Our main insight is that we can build models leveraging the device-only and domain-only features independently, and compute an authentication score only when the two models agree on the user prediction. As a result, the overall confidence in the classification will increase. The cost is that, in situations when the two models disagree, no authentication score will be computed. In this case, the upper-level applications might wait for additional time, or leverage other authentication factors.

1 3 4 6 8 10 Disagreed Agreed Recall Precision F1
1 2 0 0 1 0 0 4 3 0.66 0.66 0.66
3 0 35 0 3 0 0 10 38 0.92 0.92 0.92
4 0 0 13 0 0 0 9 13 1.0 1.0 1.0
6 0 3 0 6 0 0 5 9 0.66 0.54 0.6
8 1 0 0 0 3 0 3 4 0.75 0.75 0.75
10 0 0 0 1 1 4 1 6 0.66 1.0 0.8
Total 32 73 0.86 0.87 0.86
Table 7. Confusion matrix for a Gradient Boosting ensemble with one model using the device features, and the other the domain features for 25 minute window for 6 users. If the classifiers disagree, no authentication score is computed.
1 3 4 8 10 Disagreed Agreed Recall Precision F1
1 3 0 0 0 0 4 3 1.0 1.0 1.0
3 0 48 0 0 0 0 48 1.0 0.97 0.98
4 0 0 15 0 0 7 15 1.0 1.0 1.0
8 0 0 0 3 0 4 3 1.0 0.75 0.85
10 0 1 0 1 4 1 6 0.66 1.0 0.8
Total 16 75 0.97 0.97 0.97
Table 8. Confusion matrix for the same ensemble model as in Table 7, but with 5 users.

We test the ensemble of two Gradient Boosting models, one built on the device-only features, and the second on the domain-only features. We show the results of our ensemble user classification model for six users in Table 7 and for five users in Table 8. The F1 score of the six-user model improves from 0.8 with a single GB model to 0.86 with the ensemble. In the ensemble, the models do not agree on 32 out of 105 sessions. For five users, the ensemble F1 score reaches 0.97 (compared to 0.92 for a single GB model). In this case, the ensemble does not compute a score on only 16 out of 91 sessions. Therefore, the ensemble can reliably increase the confidence of the model, at the cost of not always providing an authentication scores.

Figure 8. Model ROC curves for 25-minute windows, 6 users.

6. Related work

Behavior-based authentication systems are identifying users based on their behavior. Implicit authentication (Shi et al., 2010) shows the applicability of behavior modeling for authentication by utilizing the call/message information, browser activity, and GPS history. Itus provides an extensible implicit authentication framework for Android (Khan et al., 2014b). A survey of multiple implicit authentication methods is given in (Khan et al., 2014a). Progressive Authentication (Riva et al., 2012) models user behavior on mobile devices by combining biometric features and sensor data. Another emerging continuous authentication method leverages the sensor information from wearable devices (e.g. smart watches, activity trackers, glasses, bracelets) to learn user behavior (Gafurov et al., 2006; Mare et al., 2014; Peng et al., 2017).

Behavioral authentication in other contexts has also been studied. Freeman et al. (Freeman et al., 2016) designed an ML approach for clustering logins based on their IP, geolocation, browser, and time, for authenticating users to online services. Device fingerprinting has been used to augment user authentication on the web (Alaca and van Oorschot, 2016).

Authorization and access policy frameworks in the IoT systems have also been an active research area. He et al. (He et al., 2018) conducted a survey of IoT device authorization preferences and observed that most devices have limited authorization flexibility. ContexIoT (Jia et al., 2017) is a permission system for IoT apps that provides users contextual information to enable them to make better access control decisions. SmartAuth (Tian et al., 2017) generates a user interface for authorization decisions in IoT apps. Soteria (Celik et al., 2018) performs static analysis to identify if IoT apps respect security policies. IoTGUARD dynamically enforces security policies in IoT devices based on monitoring event handlers by IoT apps (Celik et al., 2019).

There have been prior work investigating user authentication in the smart home environments. Emami-Naeini et al. (Naeini et al., 2017) perform a user study on IoT privacy and demonstrate that users are not comfortable with biometric data collection in IoT settings. Biometric based authentication is optional in Alexa (ale, 2019), but this is prone to recording and replay attacks. Shi et al. (Shi et al., 2017) proposed an authentication system leveraging physical properties of WiFi signal generated by IoT devices. Apthorpe et al. (Apthorpe et al., 2017) show that user actions can be inferred from encrypted network traffic generated by IoT devices. In this work, we utilize the network communication statistics of IoT devices through user interactions to build a user authentication framework using machine learning. Our system computes an authentication score for each user which can be used by services to enforce flexible policies depending on the sensitivity of the service.

7. Discussion and Conclusions

We propose a novel user authentication method by analyzing HTTPS network traffic generated by home-based IoT devices. We conduct a user study and experiment with several ML algorithms using features extracted from network traffic with the goal of classifying users from a known set of users. We showed that random forest and gradient boosting models perform well in this setting and obtain high accuracy at user classification. Moreover, an ensemble of two models can increase the confidence of classification, at the expense of abstaining to produce a score when the models produce different predictions.

We also tested several deep neural network (DNN) architectures, including feed-forward neural networks and LSTMs. We did not include the details in the paper, as the accuracy of these models was not comparable with that of gradient boosting, random forest, and our two-model ensemble. The best F1 score for a feed-forward architecture was 0.54 and for LSTM 0.51, whereas our gradient boosting achieves an F1 score of 0.86. We suspect the reason is because we have a relatively limited labeled dataset available for model training. While in principle it is possible to conduct the user study over longer periods of time, we found it challenging to recruit and retain users over time.

In a practical deployment, we envision that the ML model for user classification is continuously re-trained with new behavioral data over time. This is particularly important as new IoT devices are added to the smart home and users slowly shift their interests over time. Again, a longer-term user study would provide sufficient data to experiment with model re-training and analyze user behavior over time as IoT devices evolve. Another research direction would be investigating the performance of the models in the presence of multiple active users in the smart home.

The behavioral authentication scores computed by our authentication modules could be applied in various settings. More importantly, they open up the possibility of creating flexible policies for authorization and access control, and replace today’s rigid, fixed policies. We believe that our work opens up new research avenues in this direction, and this is a topic of interest in our future work.

Acknowledgements

The authors would like to thank David Choffnes, Daniel Dubois, and Jingjing Ren for providing us access to the Mon(IoT)r Lab at Northeastern University and setting up the data collection infrastructure that enabled IoT device monitoring. We thank Visa Research for funding this research. We would also like to thank Andres Molina-Markham for many discussions on contextual authentication.

References

  • (1)
  • fac (2019) 2019. Facebook Stored Hundreds of Millions of User Passwords in Plain Text. https://krebsonsecurity.com/2019/03/facebook-stored-hundreds-of-millions-of-user-passwords-in-plain-text-for-years/
  • ale (2019) 2019. Manage Alexa Voice Purchasing Settings. https://www.amazon.com/gp/help/customer/display.html?nodeId=201952610
  • wor (2019) 2019. Passwords, passwords everywhere. https://www.ncsc.gov.uk/blog-post/passwords-passwords-everywhere
  • Alaca and van Oorschot (2016) Furkan Alaca and P. C. van Oorschot. 2016. Device Fingerprinting for Augmenting Web Authentication: Classification and Analysis of Methods. In Proceedings of the 32Nd Annual Conference on Computer Security Applications (ACSAC ’16). ACM, New York, NY, USA, 289–301. https://doi.org/10.1145/2991079.2991091
  • Apthorpe et al. (2017) Noah Apthorpe, Dillon Reisman, Srikanth Sundaresan, Arvind Narayanan, and Nick Feamster. 2017. Spying on the smart home: Privacy attacks and defenses on encrypted IoT traffic. arXiv preprint arXiv:1708.05044 (2017).
  • Biggio et al. (2012) Battista Biggio, Blaine Nelson, and Pavel Laskov. 2012.

    Poisoning attacks against support vector machines. In

    ICML.
  • Blasco et al. (2016) Jorge Blasco, Thomas M. Chen, Juan Tapiador, and Pedro Peris-Lopez. 2016. A Survey of Wearable Biometric Recognition Systems. ACM Comput. Surv. 49, 3, Article 43 (Sept. 2016), 35 pages. https://doi.org/10.1145/2968215
  • Celik et al. (2018) Z. Berkay Celik, Patrick McDaniel, and Gang Tan. 2018. Soteria: Automated IoT Safety and Security Analysis. In 2018 USENIX Annual Technical Conference (USENIX ATC 18). USENIX Association, Boston, MA, 147–158. https://www.usenix.org/conference/atc18/presentation/celik
  • Celik et al. (2019) Z. Berkay Celik, Gang Tan, and Patrick D. McDaniel. 2019. IoTGuard: Dynamic Enforcement of Security and Safety Policy in Commodity IoT. In 26th Annual Network and Distributed System Security Symposium, NDSS 2019, San Diego, California, USA, February 24-27, 2019. https://www.ndss-symposium.org/ndss-paper/iotguard-dynamic-enforcement-of-security-and-safety-policy-in-commodity-iot/
  • De Cristofaro et al. (2013) Emiliano De Cristofaro, Honglu Du, Julien Freudiger, and Greg Norcie. 2013. A comparative usability study of two-factor authentication. arXiv preprint arXiv:1309.5344 (2013).
  • Freeman et al. (2016) David Freeman, Sakshi Jain, Markus Dürmuth, Battista Biggio, and Giorgio Giacinto. 2016. Who Are You? A Statistical Approach to Measuring User Authenticity.. In NDSS. 1–15.
  • Gafurov et al. (2006) Davrondzhon Gafurov, Kirsi Helkala, and Torkjel Søndrol. 2006. Biometric Gait Authentication Using Accelerometer Senso. JCP 1, 7 (2006), 51–59.
  • Habib et al. (2018) Hana Habib, Pardis Emami Naeini, Summer Devlin, Maggie Oates, Chelse Swoopes, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. 2018. User Behaviors and Attitudes Under Password Expiration Policies. In Fourteenth Symposium on Usable Privacy and Security (SOUPS 2018). USENIX Association, Baltimore, MD, 13–30. https://www.usenix.org/conference/soups2018/presentation/habib-password
  • He et al. (2018) Weijia He, Maximilian Golla, Roshni Padhi, Jordan Ofek, Markus Dürmuth, Earlence Fernandes, and Blase Ur. 2018. Rethinking Access Control and Authentication for the Home Internet of Things (IoT). In 27th USENIX Security Symposium (USENIX Security 18). USENIX Association, Baltimore, MD, 255–272. https://www.usenix.org/conference/usenixsecurity18/presentation/he
  • Jia et al. (2017) Yunhan Jack Jia, Qi Alfred Chen, Shiqi Wang, Amir Rahmati, Earlence Fernandes, Z. Morley Mao, and Atul Prakash. 2017. ContexIoT: Towards Providing Contextual Integrity to Appified IoT Platforms. In 21st Network and Distributed Security Symposium.
  • Khan et al. (2014a) Hassan Khan, Aaron Atwater, and Urs Hengartner. 2014a. A Comparative Evaluation of Implicit Authentication Schemes. In Research in Attacks, Intrusions and Defenses, Angelos Stavrou, Herbert Bos, and Georgios Portokalidis (Eds.). Springer International Publishing, Cham, 255–275.
  • Khan et al. (2014b) Hassan Khan, Aaron Atwater, and Urs Hengartner. 2014b. Itus: An Implicit Authentication Framework for Android. In Proceedings of the 20th Annual International Conference on Mobile Computing and Networking (MobiCom ’14). ACM, New York, NY, USA, 507–518. https://doi.org/10.1145/2639108.2639141
  • Mare et al. (2014) Shrirang Mare, Andrés Molina Markham, Cory Cornelius, Ronald Peterson, and David Kotz. 2014. ZEBRA: Zero-Effort Bilateral Recurring Authentication. In Proceedings of the 2014 IEEE Symposium on Security and Privacy (SP ’14). IEEE Computer Society, Washington, DC, USA, 705–720. https://doi.org/10.1109/SP.2014.51
  • Melicher et al. (2016) William Melicher, Darya Kurilova, Sean M. Segreti, Pranshu Kalvani, Richard Shay, Blase Ur, Lujo Bauer, Nicolas Christin, Lorrie Faith Cranor, and Michelle L. Mazurek. 2016. Usability and Security of Text Passwords on Mobile Devices. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA, 527–539. https://doi.org/10.1145/2858036.2858384
  • Meng et al. (2015) Weizhi Meng, Duncan S. Wong, Steven Furnell, and Jianying Zhou. 2015. Surveying the Development of Biometric User Authentication on Mobile Phones. IEEE Communications Surveys and Tutorials 17, 3 (2015), 1268–1293. https://doi.org/10.1109/COMST.2014.2386915
  • Naeini et al. (2017) Pardis Emami Naeini, Sruti Bhagavatula, Hana Habib, Martin Degeling, Lujo Bauer, Lorrie Faith Cranor, and Norman Sadeh. 2017. Privacy Expectations and Preferences in an IoT World. In Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017). USENIX Association, Santa Clara, CA, 399–412. https://www.usenix.org/conference/soups2017/technical-sessions/presentation/naeini
  • Peng et al. (2017) Ge Peng, Gang Zhou, David T Nguyen, Xin Qi, Qing Yang, and Shuangquan Wang. 2017. Continuous authentication with touch behavioral biometrics and voice on wearable glasses. IEEE Transactions on Human-Machine Systems 47, 3 (2017), 404–416.
  • Riva et al. (2012) Oriana Riva, Chuan Qin, Karin Strauss, and Dimitrios Lymberopoulos. 2012. Progressive Authentication: Deciding When to Authenticate on Mobile Phones.. In USENIX Security Symposium. 301–316.
  • Shi et al. (2017) Cong Shi, Jian Liu, Hongbo Liu, and Yingying Chen. 2017. Smart user authentication through actuation of daily activities leveraging WiFi-enabled IoT. In Proceedings of the 18th ACM International Symposium on Mobile Ad Hoc Networking and Computing. ACM, 5.
  • Shi et al. (2010) Elaine Shi, Yuan Niu, Markus Jakobsson, and Richard Chow. 2010. Implicit authentication through learning user behavior. In International Conference on Information Security. Springer, 99–113.
  • Siadati et al. (2017) Hossein Siadati, Toan Nguyen, Payas Gupta, Markus Jakobsson, and Nasir Memon. 2017. Mind your SMSes: Mitigating social engineering in second factor authentication. Computers & Security 65 (2017), 14–28.
  • Steinhardt et al. (2017) Jacob Steinhardt, Pang Wei Koh, and Percy Liang. 2017. Certified Defenses for Data Poisoning Attacks. In Advances in Neural Information Processing Systems (NIPS).
  • Tian et al. (2017) Yuan Tian, Nan Zhang, Yueh-Hsun Lin, XiaoFeng Wang, Blase Ur, Xianzheng Guo, and Patrick Tague. 2017. SmartAuth: User-Centered Authorization for the Internet of Things. In 26th USENIX Security Symposium (USENIX Security 17). USENIX Association, Vancouver, BC, 361–378. https://www.usenix.org/conference/usenixsecurity17/technical-sessions/presentation/tian
  • Ur et al. (2016) Blase Ur, Jonathan Bees, Sean M. Segreti, Lujo Bauer, Nicolas Christin, and Lorrie Faith Cranor. 2016. Do Users’ Perceptions of Password Security Match Reality?. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New York, NY, USA, 3748–3760. https://doi.org/10.1145/2858036.2858546
  • Wayman et al. (2005) James Wayman, Anil Jain, Davide Maltoni, and Dario Maio. 2005. An introduction to biometric authentication systems. In Biometric Systems. Springer, 1–20.
  • Wolf et al. (2019) Flynn Wolf, Ravi Kuber, and Adam J Aviv. 2019. Pretty Close to a Must-Have: Balancing Usability Desire and Security Concern in Biometric Adoption. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. ACM, 151.
  • Xiao et al. (2015) Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. 2015.

    Is feature selection secure against training data poisoning?. In

    Proc. 32nd International Conference on Machine Learning (ICML), Vol. 37. 1689–1698.

Appendix A Appendix

We show here additional results for classifying five users. Figure 9 shows the user-level ROC curves and Figure 10 shows the micro-averaged results for each of three models. Table 9 shows the results per-user for the three models when classifying among five users.

Figure 9. User-level ROC curves for 25 minute time windows for three models with 5 users.
Figure 10. Averaged ROC curves for 25 minute time windows for 5 users.
Logistic Regression Random Forest Gradient Boosting
Recall Precision F1 Recall Precision F1 Recall Precision F1
User 1 0.57 0.66 0.61 0.85 0.75 0.8 0.85 1.0 0.92
User 3 0.79 0.88 0.83 0.97 0.95 0.96 1.0 0.92 0.96
User 4 0.90 0.62 0.74 0.90 0.95 0.93 0.95 1.0 0.97
User 8 0.42 0.75 0.54 0.57 0.8 0.66 0.71 0.83 0.76
User 10 0.71 0.83 0.76 0.71 0.62 0.66 0.57 0.66 0.61
Average 0.76 0.79 0.78 0.90 0.90 0.90 0.92 0.92 0.92
Table 9. Comparison of three ML classifiers per user and micro-average results for 5 users.