A Study on Priming Methods for Graphical Passwords

03/16/2021
by   Zach Parish, et al.
0

Recent work suggests that a type of nudge or priming technique called the 'presentation effect' may potentially improve the security of Passpoints-style graphical passwords. These nudges attempt to prime or non-intrusively bias user password choices (i.e., point selections) by gradually revealing a background image from a particular edge to another edge at password creation time. We conduct a large-scale user study (n=865) to develop further insights into the presence of this effect and to perform the first evaluations of its usability and security impacts. Our usability analyses indicate that these priming techniques do not harm usability. Our security analyses reveal that the priming techniques measurably alter the security of graphical passwords; however, this effect is dependent on the combination of both the image and priming techniques used.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 5

page 6

page 10

07/01/2019

Geographical Security Questions for Fallback Authentication

Fallback authentication is the backup authentication method used when th...
05/24/2018

Why Johnny Can't Store Passwords Securely? A Usability Evaluation of Bouncycastle Password Hashing

Lack of usability of security Application Programming In- terfaces (APIs...
10/18/2021

Long Passphrases: Potentials and Limits

Passphrases offer an alternative to traditional passwords which aim to b...
01/04/2019

Adversarial CAPTCHAs

Following the principle of to set one's own spear against one's own shie...
07/18/2018

Security Mental Model: Cognitive map approach

Security models have been designed to ensure data is accessed and used i...
12/09/2019

Extended- Force vs Nudge : Comparing Users' Pattern Choices on SysPal and TinPal

Android's 3X3 graphical pattern lock scheme is one of the widely used au...
11/01/2021

User-friendly Composition of FAIR Workflows in a Notebook Environment

There has been a large focus in recent years on making assets in scienti...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

User authentication is an integral component for the security of computer systems, ranging from mobile devices to critical systems. Its most common form is knowledge-based authentication systems, based on a secret that you know (e.g., passwords, PINs, and passphrases) (Herley2012). Knowledge-based authentication systems, particularly passwords, are widely popular due to their low cost and lack of specialized hardware requirements. Unfortunately, password security has become a serious concern, due to recent advances in password attacks using publicly leaked passwords, personal information, and advanced guessing techniques (CCS2016-TARGETEDONLINE; liu19; PERSONAL-PCFG; NNGuessing). These attacks call into question the viability of existing password systems and demonstrate the need for new approaches to improve security. One current approach is to employ complex password-composition policies (i.e. symbol or digit requirements, minimum length requirements, or banning dictionary words), which aim to improve the entropy of the password space. Restrictive policies, however, often lead to user frustration by increasing the difficulty of remembering a secure password (Komanduri2011). These memorability and security concerns have motivated new authentication methods such as graphical passwords (SUO2005), where a user remembers an image, or parts of an image, instead of a text password. The promise of graphical passwords is to improve password memorability—by using people’s superior memory for images—and possibly inspire other ways to improve text passwords (BCV12). One such graphical password system is PassPoints (wiedenbeck2005), in which a user is asked to select a sequence of five click-points on a background image as a password. The simplicity of PassPoints allows it to serve as a building block or special case of several more complex graphical password systems (e.g., MPP (MPP), BDAS (BDAS), CCP (CCP), PCCP (PCCP)). Of practical importance is Microsoft Picture Password (MPP) (MPP)—an optional login mechanism in Windows 8 and newer versions—for which Passpoints can be viewed as a special case (see related work for further discussion).

PassPoints, among many other graphical password systems, are unfortunately still prone to security concerns where users tend to create predictable graphical passwords, making them easier for attackers to guess (Zhao2015; vanOorschot2010; Thorpe2007; Dirik2007; Sadovnik). Successful attacks against PassPoints passwords (vanOorschot2010; Thorpe2007; Dirik2007) have motivated approaches to help users choose unpredictable graphical passwords (e.g., using persuasion (PCCP), saliency masks (Bulling2012), and adaptive mechanisms based on cognitive and visual behaviors (katsini2018)).

One recent approach of interest is a priming technique or nudge that uses an image presentation to unobtrusively influence user’s graphical password choices at the time of password creation (Thorpe2014). A special instance of an image presentation is drawing the curtain, which slowly reveals a background image, as though a curtain is initially covering it. A small-scale study () found significant differences in the distribution of first click-points between groups with two different directions of drawing the curtain: right-to-left (RTL) and left-to-right (LTR) (Thorpe2014). As the image presentation used in password creation is unknown to an adversary, with a sufficiently large number of image presentation styles for a system, it was suggested to hold promise for complicating guessing attacks, and consequently enhancing security.

While interesting, the initial image presentation study offered only a suggestion of security improvements; while the study showed a statistically significant impact in user choices, it offered no security analyses. Additionally, it suffers from a number of shortcomings: a small sample size (), lack of control group, lack of serious usability analyses, and focusing on only one engineered background image. As such, this paper aims to take a more rigorous approach to evaluating priming techniques by addressing the specific questions: (1) How does drawing the curtain affect usability? (2) How does drawing the curtain affect security in practice (or password guessability)?, and (3) Is the presentation effect and its security consequences image dependent (and if so, on which types of image)?

We tackle the above questions through a large-scale study of the image presentations (), involving three different images (from distinct classes of images) and a control group for each. Our contributions include:

  1. The design of an image selection methodology that automatically selects images of varying levels of eye-tracking complexity. Our image selection methodology removes researcher bias towards images of personal interest while making this important decision more objective and data-driven, and also has the potential to open interesting research directions such as automated secure image selection.

  2. A usability analysis that confirms that the tested priming techniques have no usability impact (measured by SUS scores, login times, password reset rates, and login success rates).

  3. A security analysis that employs a well-known class of automated attacks (vanOorschot2010) in order to better quantify the extent of security improvements that actually arise in practice from such priming techniques.

Our analyses suggest that the effectiveness of priming techniques in improving security is image dependent. For a class of our examined images, the security was considerably improved (in some cases, twofold) while for some other classes the security was degraded with the examined priming techniques. Surprisingly, our results indicate that the positive impact of priming techniques on password security is more pronounced on the images with low security potential (e.g, having a single dense saliency region). Our analyses shed light into the challenges of designing effective priming techniques in graphical passwords, in line with related research for text passwords (Coventry2018; Renaud2017).

2. Related Work

User authentication methods fall into three categories: what you know (i.e., knowledge-based authentication such as passwords and PINs), what you have (e.g., physical tokens and smartphones), and what you are (i.e., biometrics such as fingerprints). Knowledge-based authentication offers many benefits including low cost and easy recovery (FRAMEWORK). Graphical passwords are one knowledge-based proposal that aims to harness the memorability of images.

There is a large body of research on graphical passwords (see these comprehensive surveys (BCV12; SUO2005)

for review). Graphical password systems are typically classified into

recall-based (e.g., BDAS (BDAS), PassPoints (wiedenbeck2005)) and recognition-based (e.g., PassFaces (PassFaces), VIP (VIP)), depending on whether or not the login phase involves recognizing an image (or set of images). Recall-based systems are typically further classified into cued-recall or pure-recall, depending on whether the user is provided an image “cue” at login time. Our focus is a cued-recall system called PassPoints (wiedenbeck2005), whereby a password is a sequence of five click-points on a system-provided background image. A login involves a user selecting the same ordered sequence of five click-points within some acceptable margin of error. The simplicity of PassPoints allows it to serve as a building block or special case of several more complex graphical password systems discussed below. One real-world example is Microsoft Picture Password (MPP) (MPP), which serves as an optional login mechanism in Windows since Windows 8. In MPP, users draw gestures (e.g., lines, curves, taps) on a background image to serve as their password. This image is then shown to the user during login to cue password entry. An MPP password created using only the tap gesture is analogous to a PassPoints password.

Security issues in PassPoints have been extensively studied, including the consequences of users choosing popular points (or hot-spots) (Thorpe2007) and the success of attacks that use image processing (Dirik2007), human computation (Thorpe2007), and geometric patterns in the sequence of click-points (vanOorschot2010).

To improve security, persuasive techniques have been proposed that limit user’s choice during password creation. One approach, Persuasive Cued Click Points (PCCP)(PCCP), is based on another cued-recall system whereby a user selects a single point on each of five images (CCP). PCCP selects a random location to place a viewport, which contains a small region of the image where the user can choose a click-point. Another approach (Bulling2012) uses saliency masks to reduce a user’s interest in the most salient parts of the image.

Recent work has explored quantifying the security of background images based on their underlying saliency (Alshehri2016). In particular, Graph-Based Visual Saliency model (GBVS) with binary thresholding is used to assess the security of background images. Our work extends this approach by placing images into security clusters using their ground truth saliency (as collected by Borgi et al.(Borji2015)) and considering additional saliency map features alongside regions of interest found through binary thresholding.

Priming effects are another class of methods proposed to positively influence security outcomes without limiting a user’s choice in graphical passwords (Thorpe2014). That work also demonstrated that a simple image presentation of drawing-a-curtain over the background image to reveal it slowly (from left to right or right to left) produces a different distribution in user-chosen click-points. Following this work, another form of image presentations were devised and studied: revealing the image starting from the least salient parts, and showing the most salient parts last (katsini2018). Other forms of nudging a user towards more secure choices during password creation have also been successfully employed in grid-based graphical passwords (vonZezschwitz2016) and traditional text passwords (Renaud2019). Despite the interest in priming techniques for user authentication, many questions remain open for the initial image presentation approach (Thorpe2014): (1) How do these techniques affect usability? (2) How do these techniques affect security?, and (3) Is the effect and its security impact image dependent (and if so, on which types of image)? Answering these questions in a principled way is what motivates this paper.

Priming has also been explored as a method to help users learn and later recognize a “recognition-based” graphical password (denning2011exploring). A system called MooneyAuth involved priming users with Mooney images to aid in the long-term recollection of a set of image labels (MOONEYAUTH). Passphrase memorability was improved by an approach that employed a training period involving semantic priming and a visual implicit learning technique(Joudaki18).

3. Graphical Password System

We implemented PassPoints (wiedenbeck2005), in which a user is required to select and recall a sequence of 5 click-points (or pixels) on a given background image as his/her password, denoted by

(1)

Here, is the (x,y)-coordinates that the user has selected for his/her click-points on the image . For enhancing usability, some error tolerance is permitted for a login attempt’s click-points, meaning that each click-point can be within a tolerance distance from each originally selected point. Assuming the login attempt , it can successfully login as user on image if and only if for all :

In this paper, we set consistent with the other similar studies (Thorpe2014). This restriction creates a square 21x21 pixel error tolerance region centered on the selection point.

3.1. Priming Methods

During password creation, our system can apply randomly selected priming methods (e.g., drawing the curtain (Thorpe2014)) to bias a user’s click-points. These priming methods intend to counteract the tendency of users to select similar sequences of click-points when presented with the same background image. Similar password choices form hot spots (Thorpe2007) and click-order patterns (vanOorschot2010) which undermines the theoretical security promises of PassPoints.

We implement the drawing the curtain priming method (Thorpe2014), where users watch the target image being gradually revealed before selecting their click-points. In our system, the user is first presented with a blank white image, and the target image is gradually revealed starting from one edge over 20 seconds. In replicating the drawing-the-curtain method, we apply curtain drawing in the left-to-right (LTR) and right-to-left (RTL) directions. We chose to study RTL and LTR exclusively, rather than alternative methods such as top-to-bottom or bottom-to-top, so that our work could be based on a known result before being extended to other possible priming techniques.

4. Image Selection Methodology

The presentation effect was previously examined on a grid image composed of smaller images (see Figure 1) (Thorpe2014). We investigate the drawing-the-curtain technique on this grid image, and two additional non-composite images, carefully selected to present different underlying saliency with more natural real-life stimuli. We select these images from the CAT2000 dataset (Borji2015) which contains thousands of images and their ground truth saliency maps. We preprocess the dataset and extract features from the saliency maps to perform clustering. Our clustering aims to identify distinct classes of images with structurally different distributions of saliency regions. We hypothesize that the manner in which saliency is distributed over an image will play a role in the security of passwords created on the image, and have an impact on the effect of priming techniques. In particular, we expect that images where saliency is less evenly distributed will produce more predicable passwords, and thus be more susceptible to guessing attacks. We find three distinct clusters and select the center image of each cluster as its representative. For comparability with the grid image, the selected images are scaled to a 680x460 resolution. We explain the details of this selection process below.

Figure 1. Grid image (Thorpe2014)

4.1. CAT2000 Dataset

The CAT2000 dataset (Borji2015) contains 2000 images from 20 categories (e.g., Indoor, Outdoor Natural, Object, etc.) with eye-tracking fixation point data generated by 18 observers performing a five-second free look on each image. For each image, a greyscale saliency map is generated by smoothing the fixation points of all viewers to approximate the continuous distribution generated by infinite viewers. Each pixel in the saliency map ranges from 0 to 255. The higher a pixel value is, the higher its saliency is. A pixel with a value of 0 indicates that the pixel was not salient to the observers. The distribution of saliency within each map varies greatly, ranging from maps with a single small region of high-intensity saliency to maps with saliency distributed more uniformly across the image in large regions.

4.2. Preprocessing

We focus on seven image categories of the CAT2000 dataset: Action, Indoor, Object, OutdoorManMade, OutdoorNatural, Random and Social. We have excluded categories composed of art or computer generated graphics (e.g., Sketch, Cartoon, LineDrawing) and categories with a particular visual effect (e.g., noisy, jumbled, inverted). This exclusion allows us to focus on images primarily drawn from real life scenes with minimal artificial visual stimuli types. We also exclude the Affective category to remove potentially disturbing imagery for the subjects in our study.

As with the original implementation of the presentation effect, our system uses 640x480 background images. Since we must resize our selected images to fit this size, we consider only images with a 4:3 aspect ratio to prevent distortion.111For eye-tracking purposes all images in the CAT2000 dataset were superimposed onto a 1920x1080 grey background image, creating grey image borders. We remove these borders from the images and from their associated saliency map. From this set of images we select only those with a resolution of 1440x1080 so that image sizes are the same for clustering.

4.3. Feature Extraction

We next extract six features from the saliency maps of the candidate filtered images to construct image feature vectors

. Our image feature vectors are specifically designed to capture the number, spread, and density of saliency regions within each image. For each saliency map, our features include:

(i) Salient Proportion: fraction of non-zero valued pixels; (ii) All Pixel Variance

: variance of all pixels;

(iii) Salient Pixel Variance: variance of all non-zero valued pixels; (iv) Number of Saliency Regions: number of unconnected regions of salient pixels when we threshold the saliency map using Otsu’s method (Otsu1979); (v) Distance Between Saliency Regions: distance between unconnected regions of salient pixels after Otsu’s thresholding; and (vi) Proportion of High Saliency Pixels: fraction of non-zero valued pixels after Otsu’s thresholding. These features were selected from a larger set of candidates as they provided the most reasonable clustering result during manual inspection.

4.4. Clustering

We cluster the image feature vectors using the k-means algorithm

(macqueen1967) (initialized with k-means++ (Arthur2007)) with k=3 as determined by the Kneedle algorithm (Satopaa2011). This yields three clusters that are largely stable across different random initializations of k-means. For each cluster, we select an image with the shortest distance to its cluster’s centroid as the representative for that cluster. As k-means is non-deterministic, mainly due to random initialization, the representative image might sometimes be different for different initializations. We select the representatives that were the most common in 1000 different run of k-means with random initializations. In our case, all three cluster representatives were consistently selected in all 1000 runs.

The three detected clusters exhibit relatively distinct characteristics of saliency region distributions. The Compact cluster contains images that have a small, dense, and typically center-biased, region of salient visual information within a largely non-salient image (see Figure 2(d) for an example). The Highway image in Figure 2(a) is the selected representative of this cluster. The Diffuse cluster contains images where saliency is spread more uniformly throughout the image in moderately dense saliency regions. The Barn image in Figure 2(c) is this cluster representative. The Outside cluster contains images that fall outside of the other two clusters. Typically images in this cluster have several saliency regions that are too large or too distant from each other to be comparable to the compact images, but not large or widely spread enough to be clustered with the diffuse images. This cluster’s representative is the Fan image in Figure 2(b). As images in the Outside cluster typically present a blend between characteristics found in the Compact and Diffuse clusters, we ignore this cluster and perform our user studies using only the Highway and Barn images. We expect that images in the Diffuse cluster will present more possible click-point locations to the user, leading to more entropy in the password space, and therefore higher resiliency to guessing attacks.

(a) Highway Image (b) Fan Image (c) Barn Image
(d) Highway Image Saliency Map (e) Fan Image Saliency Map (f) Barn Image Saliency Map
Figure 2. Selected background image (top) and its associated saliency map (bottom). Each image is a representative of its associated cluster: Highway (Compact cluster), Fan (Outside cluster) and Barn (Diffuse cluster).

5. User Study

Our user study is split into 3 sessions across 8 days. Users are directed to interact with a graphical password authentication scheme deployed on our website. They create a graphical password, and log in with it several times over 8 days to simulate average reported time between logins for email and online banking platforms (Hayashi2011). We ask users to complete a questionnaire to record demographic and usability information. This user study, questionnaire, and an exit survey were approved by our institution’s research ethics board. Our study employs a between-subject design where we compare users exposed to a priming effect to users in a control group. We detail how each session proceeds, our recruitment, and demographics below.

5.1. Sessions and Procedures

For each selected image, we run a separate user study with sessions and procedures discussed below. Session 1 (day 1). Users are recruited from the crowd-sourcing website Amazon Mechanical Turk (MTurk) and directed to visit our website. Users are told they will be participating in a usability test of an graphical password system, but are not told about the priming. Users visiting the website with mobile devices are automatically detected and filtered out. Mobile devices are excluded from the study in order to normalize user screen sizes and the mode of interaction with the system. Our 640x480 images are sized to fit cleanly, without distortion or scrolling on all but the smallest screens. Users then enter their MTurk ID number as a username and are randomly assigned to any of the three groups; left-to-right (LTR), right-to-left (RTL), or control. Users in the LTR and RTL groups are given a drawing the curtain image presentation that begins revealing the image from the left side and the right side, respectively. Users in the control group are not primed.

Users then watch a short instruction video detailing how to create and login using our graphical password system. Next, they create a practice password on a practice background image to familiarize themselves with the system. For consistency, this practice background image is revealed using the same type of curtain drawing method (e.g., LTR, RTL, or none) that they will be exposed to when they create their real password. Users are then instructed to select a password to login with for this and other sessions on a specific background image primed according to their group.

Users then fill out a demographics questionnaire on their sex, age, first language, field of work or study, level of experience with computers, and level of experience with computer security. Users are then asked if they have seen their password’s background image prior to the study, and if they used a touch screen device for the study. Users are then prompted with their background image and asked to click the first object that drew their attention. Users then select from a drop down menu their strategy for creating their password, and are asked to provide further details in text. For users in a treatment group with a priming effect, we ask if they watched the entire curtain draw, or were distracted during the effect.

Users are then shown their background image again and asked to login with their selected password. If users cannot successfully login, they can reset their password and create a new one on the same background image and with the same priming effect. Upon a successful login, this session ends.

Session 2 (day 2–3). Session 2 takes place 24–48 hours after Session 1 for each user, in order to simulate self-reported frequency of logging into email accounts (Hayashi2011). Users were notified about the second session through the MTurk platform 24 hours after completing Session 1.

Users are directed to our website and start by entering their ID. After watching a video that instructs them to login with their password from Session 1, they are prompted to login and shown their background image. Users who have forgotten their passwords can reset from this session to Session 1 to create a new password. They must then wait an additional 24 hours to attempt Session 2 a second time. Users who reset their password are placed in the same priming group and shown their same original background image. After users successfully login with their password, Session 2 ends. Session 3 (day 7-8). Session 3 takes place five days after Session 2 to simulate average reported time between logins for online banking platforms (Hayashi2011). Users were notified about this session through MTurk 5 days after completing Session 1. Users first enter their ID, then watch a short video explaining the session, and are then prompted to login using their background image. After logging in successfully, users fill out an exit survey. The survey asks users if they used a touch screen for any sessions and if they recorded their password externally during the study. If they were in a group with a priming effect, they are also asked whether they noticed the effect was only used during creation. The exit survey also includes a System Usability Scale (SUS) survey to collect usability information about the system (brooke1996). Users who cannot successfully log in during this session are not allowed to reset and are given the exit survey to complete.

5.2. Participant Recruitment

All participants in our studies are recruited through Amazon’s Mechanical Turk platform. An advertisement was listed for the first session and users were notified of the subsequent sessions through the platform as they became available. Users could only complete Sessions 2 and 3 if they had completed the previous sessions. Users were paid $1.50 USD for Session 1, $0.25 USD for Session 2 and $0.75 USD for Session 3. Throughout the advertisements, consent forms, and system itself, participants are told they will be participating in a usability test of an graphical password system, but are not told about the priming. Our study and compensation structure was approved by the ethics board of our institute.

For the Grid image, Sessions 1-3 are run with 436, 188, and 124 participants respectively. For the Highway and Barn images, we conduct only Session 1, with 216 and 213 participants respectively, as our primary goal was security analyses for those images.

5.3. Demographics

Here we detail the self-reported demographic data collected in Session 1 for each image. For the Grid image, we had 436 participants (with 188 completing session 2 and 124 completing session 3). 293 (67.2%) of the users identified as male and 143 (32.8%) as female. 391 (89.7%) of our users reported English as their first language. 341 (65.4%) of the users were 35 years of age or younger. 102 (23.4%) of the users were students. 422 (96.8%) of the users reported their computer skill as being a 3 or above and 355 (81.4%) reported their computer security skill as 3 or above. We found similar demographic breakdowns in our studies of the Highway and Barn images (detailed demographic information can be found in Table 1), with the exception that the Barn group contained more students. We found no change to our results when we performed comparisons across demographic segments.

Participant Demographics
Grid Age Gender Computer Skills Occupation
20 32 (7.3%) Male 293 (67.2%) 1-2 14 (3.2%) Student 102 (23.4%)
20-25 47 (10.8%) Female 143 (32.8%) 3-5 422 (96.8%) Non-Student 334 (76.6%)
25-30 114 (26.1%) Language Security Skills Work/Study Major
30-35 94 (21.6%) EN 391 (89.7%) 1-2 81 (18.6%) CS/IT 108 (24.8%)
35 151 (34.6%) OTH 45 (10.3%) 3-5 355 (81.4%) OTH 328 (75.2%)
Highway Age Gender Computer Skills Occupation
20 20 (9.3%) Male 143 (66.2%) 1-2 7 (3.2%) Student 63 (29.2%)
20-25 42 (19.4%) Female 73 (33.8%) 3-5 209 (96.8%) Non-Student 153 (70.8%)
25-30 53 (24.5%) Language Security Skills Work/Study Major
30-35 33 (15.3%) EN 197 (91.2%) 1-2 34 (15.7%) CS/IT 55 (25.5%)
35 68 (31.5%) OTH 19 (8.8%) 3-5 182 (84.3%) OTH 161 (74.5%)
Barn Age Gender Computer Skills Occupation
20 8 (3.8%) Male 135 (63.4%) 1-2 17 (8.0%) Student 95 (44.6%)
20-25 44 (20.7%) Female 78 (36.6%) 3-5 196 (92.0%) Non-Student 119 (55.9%)
25-30 74 (34.7%) Language Security Skills Work/Study Major
30-35 36 (16.9%) EN 184 (86.4%) 1-2 34 (16.0%) CS/IT 46 (21.6%)
35 51 (23.9%) OTH 29 (13.6%) 3-5 179 (84.0%) OTH 167 (78.4%)
Table 1. Demographic information for all studies over three background images of Grid, Highway, and Barn.

6. Results

We report the results of our user studies with regard to point selection biasing, usability analyses, and security analyses.

6.1. Outliers

During analysis, we discovered one notable group of outliers who selected their click-points in a different manner than most participants. These users selected their points repeatedly in the same, or very close (i.e., within the error tolerance region) to the same location. We include these users in our analysis as we find it likely that this behavior would be present in a real-world application of this system, similar to the poor password creation behaviors observed in text passwords.

6.2. Selection Point Biasing

We first attempt to replicate the statistical findings of the presentation effect demonstration (Thorpe2014). We examined whether the x-coordinates for each of the five click-points come from the same distribution, when RTL and LTR presentation groups are compared.

We extend the original paper’s tests by including a control group whose members create their passwords without any priming effect. For each background image, we compare the passwords generated by each presentation treatment group with the passwords generated by that image’s control group, yielding two sets of tests for each image: RTL vs. Control and LTR vs. Control. We let represent the x-coordinate of the click-point selected by user on image . Given this notation, and to be consistent with the original paper (Thorpe2014), we formulate a class of null hypotheses of the form:

: the two samples and come from the same distribution.

Where refers to the control group and can be any priming treatment group of LTR and RTL. From the previous work (Thorpe2014), one expects that users’ first click-points (i.e., ) will be biased towards the edge that the curtain drawing began from (e.g., right for RTL group and left for LTR). We therefore test 5 null hypotheses for each primed treatment group against a control group for each image. For these tests we compare the click-points of all users who completed Session 1 of the study on a particular image. In order to be comparable with the results of the original paper, we begin by testing each of these hypothesises using a one-sided Mann-Whitney-U test (with = 0.05).

Figure 3. Each background image is divided into two equally sized bins for the Chi Square test.
Image: Test Pair 1 points 2 points 3 points 4 points 5 points
Grid: RTL vs CTRL 0.0624 (0.09) 0.2321 (0.04) 0.2794 (0.03) 0.2121 (0.05) 0.3303 (0.03)
Grid: LTR vs CTRL 0.3442 (0.02) 0.3125 (0.03) 0.228 (0.04) 0.0871 (0.08) 0.1804 (0.05)
Barn: RTL vs CTRL 0.6981 (0.04) 0.8929 (0.10) 0.9466 (0.13) 0.8405 (0.08) 0.8189 (0.07)
Barn: LTR vs CTRL 0.7652 (0.06) 0.5678 (0.01) 0.227 (0.06) 0.3604 (0.03) 0.6944 (0.04)
Highway: RTL vs CTRL 0.6289 (0.03) 0.2855 (0.05) 0.1918 (0.07) 0.6501 (0.03) 0.2525 (0.06)
Highway: LTR vs CTRL 0.3524 (0.03) 0.9089 (0.11) 0.8513 (0.08) 0.1915 (0.07) 0.7584 (0.06)
Table 2. Mann-Whitney-U test results for different hypotheses with various control/treatment pairs and images: p values are without multiple-test correction and effect size is presented in parenthesis.
Image: Test Pair 1 points 2 points 3 points 4 points 5 points
Grid: RTL vs CTRL 0.0018 (0.19) 0.0044 (0.17) 0.0083 (0.16) 0.0081 (0.16) 0.0077 (0.16)
Grid: LTR vs CTRL 0.2258 (0.07) 0.8348 (0.01) 0.0764 (0.10) 0.0251 (0.13) 0.7269 (0.02)
Barn: RTL vs CTRL 0.212 (0.10) 0.1032 (0.13) 0.0325 (0.18) 0.2158 (0.10) 0.0719 (0.15)
Barn: LTR vs CTRL 0.1436 (0.12) 0.1019 (0.14) 0.0096 (0.22) 0.1413 (0.12) 0.1436 (0.12)
Highway: RTL vs CTRL 0.6032 (0.04) 0.5607 (0.05) 0.1025 (0.14) Invalid 0.2676 (0.10)
Highway: LTR vs CTRL 0.1664 (0.11) 0.2291 (0.10) 0.2113 (0.10) Invalid 0.1323 (0.12)
Table 3. Chi Square test results are given as p values (with effect size in parenthesis) without multiple-test correction for different hypotheses with various control/treatment pairs and images.

For both the RTL vs Control and LTR vs Control test pairs on all three images (i.e., Grid, Highway and Barn), we fail to reject any of our null hypotheses with the Mann-Whitney-U test (with or without a correction), suggesting that the presentation effect does not bias the x-coordinates of click-points in a statistically significant way for this image.

We also replicated the RTL vs. LTR test performed in other work (Thorpe2014), which suggested that the x-coordinates for the first click-points of two opposing groups (i.e., RTL and LTR) were statistically different ( without multiple testing correction), but which was non-significant after a Bonferroni correction for the five tests (). Similar to the other work, we find a significant result on the first click-points between RTL and LTR (), which is non-significant with a Bonferroni correction for the five tests ().

To determine if the priming effects have an impact at a coarser level of granularity, we also test each hypothesis with a Chi Square test. For these tests, we divide each image into two equally sized bins that span the height of the image and half of the width (see Fig. 3), and record the distribution of points over the bins for the RTL, LTR and Control groups, for each click-point.

When applying the Chi square test to our hypotheses for the grid image, we fail to reject any null hypotheses for the LTR vs Control test pair, but can reject all hypotheses for the RTL vs Control test pair. After a Bonferroni correction, we can still reject the null hypothesis for the 1st click-points (

, effect size 0.19) and 2nd click-points (, effect size 0.17). This suggests that the RTL priming effect was able to change the left/right distribution of first two click-points, in a coarse manner. When testing our hypothesis for Highway and Barn images by the Chi square test , we fail to reject any of the null hypotheses, suggesting priming had no impact on the distribution of click-points for this image.

While both statistical tests we employed can capture a change in distribution along the x axis of the image, neither capture the impact of priming on the security of the generated passwords. It is possible that the priming effects alter the security of the generated passwords, without changing the distribution of x values in a way that is significant. This has motivated our security analyses below.

6.3. Security Analysis

We test the security of each group’s passwords against three classes of well-known purely automated click-order based attacks (vanOorschot2010). Each attack first creates an alphabet based on the background image’s resolution and error tolerance T.222As our images have the resolution of 640x480 and the error tolerance , our alphabet includes 731 (x,y)-coordinates.

Then, the attack deploys a series of click-order heuristics to construct an attack dictionary. Our focus on these classes of attacks are motivated by their ease and minimal requirements for mounting by attackers, and their guessing ability (especially for relatively large dictionary sizes).

333Human-seeded attacks (Thorpe2007) could have been alternatives for our analyses. However, those attacks usually have comparable guessing ability for relatively large dictionary sizes while requiring more information for mounting attacks, such as the actual background image, system error tolerance, collected sample password data on the same background image, etc. So this motivates us to focus on purely automated click-based attacks, which only require an image’s dimensions and system error tolerance, which are fixed for a deployed PassPoints system.

There are three general classes of click-order based attacks. The LINE class of attacks attempts to crack passwords that form a horizontal or vertical line across the background image. The DIAG attack class guesses passwords with a dictionary of all possible straight lines, which are not necessarily vertical or horizontal. Note that for a fixed alphabet (i.e., fixed image and error tolerance), the DIAG dictionary always includes the LINE dictionary. The LOD attack class attempts to crack passwords with a dictionary composed of passwords where each click-point is within a particular distance from its predecessor and successor.

Each class of attacks has a relaxation parameter (different from error tolerance ) controlling the extent to which each click-order pattern can be relaxed. The lower is, the more restrictive the pattern is. For example, LINE with generates those passwords exactly following a straight horizontal or vertical line, whereas LINE with allows two sequential click-points in a guessed password to deviate a maximum of pixels from a straight line. Letting be the dictionary of an attack with relaxation parameter , one can easily observe that, for , . In our experiments, we have used all three classes of attacks while varying the relaxation parameter .

Table 4 shows the percentage of guessed passwords for each group, image, and attacks. The relaxation parameter has been varied over {, , } in our experiments. For the Grid image, we observe that our primed groups exhibit lower attack resistance compared to the control group. There are a few exceptions such as RTL against DIAG(0), and RTL against LOD(0). For the highway image, both RTL and LTR consistently demonstrate considerable security improvements (compared to the control group) across all attacks. In some attacks such LINE(0), LOD(0), LOD(21), the passwords in the primed groups are (almost) twice as secure. The Barn image exhibits an interesting security pattern when exposed to LTR and RTL. The security is degraded for RTL while generally improving for LTR (with the exception for LOD(42) with almost comparable security).

Group LINE0 LINE21 LINE42 DIAG0 DIAG21 DIAG42 LOD0 LOD21 LOD42
Grid:CTL 14.65% 18.47% 19.11% 23.57% 30.57% 31.85% 8.28% 10.82% 11.46%
Grid:RTL 15.32% 24.19% 27.41% 20.97% 31.45% 37.90% 8.06% 11.29% 16.93%
Grid:LTR 16.13% 21.29% 23.23% 25.81% 33.55% 34.84% 10.32% 10.32% 12.90%
Highway:CTL 10.00% 25.71% 47.14% 12.86% 45.71% 61.43% 8.57% 11.43% 21.43%
Highway:RTL 4.54% 13.64% 33.33% 12.12% 37.88% 48.48% 4.55% 6.06% 10.61%
Highway:LTR 5.00% 17.50% 33.75% 16.25% 38.75% 53.75% 5.00% 5.00% 10.0%
Barn:CTL 11.39% 13.92% 18.99% 12.66% 26.58% 31.65% 8.86% 10.13% 10.13%
Barn:RTL 16.18% 19.12% 19.12% 19.12% 32.35% 38.24% 11.76% 13.24% 14.71%
Barn:LTR 6.06% 10.61% 10.61% 9.09% 21.21% 24.24% 6.06% 9.09% 10.61%
Table 4. Percentage of passwords cracked with each attack class for various relaxation parameter = 0, 21, 42. The blue and orange colors encode settings with stronger and weaker security, respectively, compared to the corresponding control group’s security.

6.4. Usability

We collected user-reported usability data from those users who completed all three sessions of the study for the grid image. We examine this data for all three groups (RTL, LTR, and Control) to determine if the priming techniques impact the perceived usability of the system. For these comparisons, we combine the RTL and LTR groups into a single Primed group, and compare their usability results to the control. We also compare login times for each session, and memorability metrics such as the number of password resets and incorrect login attempts.

6.4.1. System Usability Scale

To compare usability between these groups, we use the System Usability Scale (SUS) (brooke1996). The SUS is a survey with 10 short Likert Scale questions, and asks the user to score them from one to five, where one is strongly disagree and five is strongly agree.

The tone of the questions alternates: for odd-numbered questions, a score of 5 is an indicator of good usability whereas for even-numbered questions, 5 is an indicator of bad usability. This tone alternation discourages users from answering all questions with 1 or 5, and aids in detecting careless all-one or all-five answers to the questions. The SUS score for each user is calculated by subtracting 1 from each odd-numbered question response, subtracting each even-numbered question response from 5, summing all ten resulting values, and then multiplying by 2.5. This yields a score between 0 and 100 for each user. Scores above 68 indicate above average usability.

124 users completed Session 3 using the Grid image and completed the exit survey. 44 (35%) of these users were in the control group and 80 (65%) were in one of the two primed treatment groups. The control group reported an average SUS score of 75.91 and the primed treatment group reported an average of 76.56. An independent t-test fails to reject the null hypothesis that both samples come from the same distribution (with

). This suggests that the priming techniques do not have a negative impact on user perception of usability.

The exit survey also asks users to report if they recorded their passwords at any point during the study. Our implementation attempts to mitigate password recording with two approaches; never displaying a user’s password to them, in part or in full, and providing no visual feedback as to the location of click-points during creation. In the control group, 5 (11.4%) of 44 users reported recording their passwords. In the treatment group 6 (7.5%) of 80 users reported recording. This result suggests that password recording is no more prevalent in graphical passwords than in traditional text passwords. Komanduri et al. (Komanduri2011) found self reported password recording rates of 17-50% across 5 distinct 1000-participant groups who where given different password creation policies to follow. The lowest recording rate, 17% was observed in the group with the least stringent password policy, where users were only required to create passwords 8 characters long. More complex groups, where requirements included adding numbers or special characters to a user’s password or increased length, experienced notably higher rates of recording. The recording rate in graphical passwords was therefore observed to be comparable to text passwords with simple policies, and lower than that of text passwords with complex policies.

Users in treatment groups were also asked if they noticed that the priming effect was present during creation, but not login time. Users who answered yes were asked to rate the following Likert Scale statement on a scale of one (strongly disagree) to five (strongly agree): “I found the image revealing annoying”. 31 (38.8%) of 80 users reported noticing the effect during creation, but not during logins. Of these 31 users, 14 (45.2%) of the users agreed or strongly agreed with the statement, while 9 (29.0%) of the users disagreed or strongly disagreed and 8 (25.8%) of the users responded neutrally.

6.4.2. Login Time

To compare users’ login times in the Primed vs. Control groups, we perform a t-test between each group’s login times for each session. We record their login time as starting from when the image is first displayed until they enter their password correctly (i.e., the time for successful login). For users who reset their password during the study, login time is recorded for their latest password. We find no statistically significant difference in login times between the two groups in Session 1 (p=0.839). Likewise, we find no statistically significant difference in their login times during Session 2 (p=0.702) or Session 3 (p=0.330). Furthermore, we found no statistically significant differences between the login times of the RTL and LTR groups, or between Control and either RTL or LTR in any session.

6.4.3. Memorability

We compare password memorability between the Primed and Control groups by considering the number of password resets and incorrect login attempts for each group.

We count the number of users who reset their passwords during the study and those who did not in both the Primed and Control groups. We compare the two groups using Fisher’s exact test where the expected values come from the Control group. We find no significant results () using . We also perform this test on RTL vs. Control and LTR vs. Control separately and find no significant results (with p=1.000 and p=0.172 respectively). This suggests that neither priming effect caused users to forget their passwords more often than when compared with the control group.

Users are asked to login with their passwords three times during the study, once per session. During these logins, we record the number of times a user enters their password incorrectly. We compare the average number of times a user must re-enter their password before logging in successfully between the Primed and Control groups. For these tests, we consider each login task in each session of the study separately. We compare the average number of login attempts needed using a Mann-Whitney-U test and find no significant difference between the Primed and Control groups during the first session (), second session () or third session () login tasks. We also perform this test between all pairs of groups separately and find no significant results. Again this suggests that the priming effects do not have a significant effect on the memorability of a selected password.

7. Limitations

Although the use of MTurk comes with many advantages, it might have some limitations. While studies performed on MTurk can have comparable results to those performed in lab settings (crump2013), users might have low incentive to perform as well as a natural setting. Even if this were the case, our results would simulate a worst-case scenario wherein users do not pay attention to the image presentations used in our study, in which case our results under-report the effect of the image presentations we study. However, such issues (if present) would apply equally to all conditions studied, and thus should not impact the validity of the study’s comparisons. This means that we should be cautious in interpreting the absolute values of our results, but instead focus on the comparisons.

For the Grid image, we retained 42% of users by the end of Session 3—a retention rate consistent with similar studies performed on MTurk (e.g., 55% (Shay12) and 42-51% (Joudaki18)). This drop-off may have inflated the usability results, if users who disliked the system chose to drop out. However, such inflation should apply equally to all groups, so it should not impact comparisons.

The duration of our study was for approximately one week (chosen to approximate the login frequency of online bank accounts (Hayashi2011)), therefore our memorability findings should not be considered for authentication systems with long delays between logins (e.g., fallback/secondary authentication). However, the duration of the study does not affect the results on password choice, which is the main goal of image presentations in this work.

8. Discussion

In terms of usability, our findings indicate that drawing the curtain has no negative impact. There were no statistically significant differences between the overall perception of the system between control group and treatment groups (as measured by SUS scores). Additionally, the priming methods we tested were not found to have any significant impact on the login times or memorability as observed through password resets and incorrect login attempts.

In terms of security, the click-order attack results revealed a complex interplay between priming techniques employed, background images, and the resulting graphical passwords that users choose. In particular, our low-complexity image (Highway) fared much better against these attacks for the primed groups than did the control group. The primed groups had mixed results for the high-complexity image (Barn)—RTL fared worse, but LTR fared better than the control group. These results suggest that the presentation effect is indeed image dependent, and furthermore only some image presentations can influence an effect on some images.

In terms of the impact of the image presentations on individual click-point selections, when we compared each primed group against a control group, we found no significant changes in the click-point distribution with a Mann-Whitney-U test (for all of the three images studied), in contrast to the results of a small-scale study on the Grid image (Thorpe2014). While it is possible that smaller study’s results were caused by type 1 error, it is also possible that our changes to the methodology, namely our use of MTurk, could have hindered replication of this particular statistic.

When we compare the click-points (for all 5 selections) of the RTL and LTR groups for each image (Figure 4), we notice that for the Highway (Figure 4c-d) and Barn (Figure 4e-f) images, the click-point locations of primed groups do not vary greatly from the control. While different locations are selected, or selected more frequently across the passwords for those groups, the overall effect is subtle, as demonstrated by the low brightness of the hot-spots on Fig 4. This suggests that changes in security, against click-order attacks, might be at least partly the result of changes in the ordering of click-points rather than altering hot-spot locations.

(a) Grid Image LTR (b) Grid Image Control (c) Grid Image RTL
(d) Highway Image LTR (e) Highway Image Control (f) Highway Image RTL
(g) Barn Image LTR (h) Barn Image Control (i) Barn Image RTL
Figure 4. Heat maps generated from the click-points of each group when superimposed onto the background image used during password creation.

In interpreting the sum of these results, it is important to consider that the purpose of the statistical tests is to detect that the image presentations have an impact on user choice of individual click-points. The statistical tests simply test to see whether, for each click taken in isolation, the distribution has changed. However, the click-order attacks reveal changes to the percentage of easily guessed entire passwords (the ordered set of 5 click-points). This is arguably a more meaningful test of a change in behaviour, especially in relation to security outcomes.

We stress that our results for drawing the curtain do not necessarily apply to other priming techniques. For example, Katsini et al. (katsini2018) found that image presentations based on saliency maps produced stronger passwords, especially when the image presentations were tailored to the user’s cognitive style.

Without a controlled lab environment, it is difficult to know whether users have actually viewed the image presentation. As such, our results may underestimate the impact of image presentations. In general, identifying methods to ensure user engagement during priming will be important for the future success of priming-based approaches.

9. Concluding Remarks and Future Directions

We performed a large-scale study on drawing-the-curtain priming techniques on PassPoints. Our findings include: (i) These priming techniques do not impact usability, and (ii) There are security benefits offered by the priming techniques employed, but these security benefits are dependent on both the background image and priming technique used. The results indicate that these priming techniques need to be carefully designed.

We found that drawing-the-curtain priming effects can improve the security of passwords selected on background images with highly concentrated saliency (i.e., in the compact cluster). This may assist in increasing the number and types of images that can be safely used for graphical password backgrounds. Future work should also seek to develop and test different image priming techniques. In the realm of curtain drawing, top-to-bottom, bottom-to-top, and oblique curtain draws offer clear extensions to this class of priming methods.

Future work might also focus on determining a method to automatically select or produce priming techniques that are tailored to a specific background image, and consider all possible security implications of the image presentation (i.e., hot-spots in the click-point distribution and click-order patterns). Examining structural properties of background images as well as saliency maps, especially those generated by saliency predictors, may be a useful starting point.

Automatically generated priming techniques should also be carefully constructed in order to resist attackers with knowledge of both the image and the applied priming technique. There are two general approaches to resist such attacks. The first is to apply priming such that it increases the entropy of the password space, rather than simply biasing users to select points in a different ordering or location. The second is to apply a non-deterministic priming technique that is selected from a large pool of candidates.

References