When cybercriminals compromise a user credential database and release its contents into the public arena, a number of different interested parties might seek to obtain and use the data it contains, with varying goals in mind. These might include, for instance, other groups of cybercriminals seeking to employ the data in credential stuffing attacks , and security researchers seeking to understand user password choice on the system concerned [19, 14, 18]. In particular, the latter group may be concerned with the password composition policy the passwords in the database were created under, in order to better understand how these rules around user password creation affect the distribution of user password choices.
Security researchers may find themselves confounded in this endeavour, however, because when the breached user credential database is released to the public, information about the password composition policy in place at the time of the breach is often not included. This could be because the party behind the breach does not think it relevant, wishes to keep their methods as secret as possible, or never sought this information out in the first place—after all, the password composition policy is of comparatively little interest to malicious actors seeking to directly employ the credentials in the database to criminal ends. The only other party known to have this information is the organisation that was the victim of the data breach in the first place, who by this point may be unable or unwilling to disclose any information regarding their security practices. Reasons for this might include, for example:
The organisation may have ceased to exist entirely, prior to the time at which the research in question is being conducted. There are several examples of this happening in the real world, for example the now-defunct Christian dating site singles.org  which ceased to exist sometime after 2009 when their entire user credential database was compromised in plaintext.
The organisation might be understandably reluctant to disclose any information regarding their security practices for fear of being further targeted or incriminating themselves by confessing to having taken inadequate measures to safeguard user data. This is especially the case in Europe, where tightening legislation around data protection  might make the latter point of particular concern.
If we cannot obtain a description of the password composition policy from any of the organisations involved in the breach, this information has been lost in disclosure—that is, lost somewhere in the process of the transfer of data between parties. We are therefore forced to turn to the data that we do have to attempt to infer as much of that lost information as we can.
There is no shortage of breached user credential databases available online. Arguably the most well-known of these, the RockYou set , like many others (e.g. the Yahoo  or 000webhost  sets) contains passwords that do not comply with the password composition policy in place when the breach happened (see Tables I and II). Reasons for this “noise” vary, but include:
Multiple password composition policies per dump—the RockYou set, for example, is an aggregate made up of at least two tables: one containing passwords to the main web application and one containing passwords used to log in to “partner services” (e.g. MySpace) which may enforce different policies . Passwords created under old policies may also be present. RockYou, for instance, changed their policy after their data breach in 2009 from minimum 5 characters in length  to a stronger policy [13, 7]. In this case, our methodology gives the password composition policy that the majority of passwords were created under, though there is scope for improving upon this in future work (see Section VI).
Formatting errors—when the raw data is being processed by the exfiltrating party, errors may be introduced if their data processing scripts are not robust. For example, passwords containing spaces may be read as two separate data points.
Intentional padding—if cybercriminals initially offer the data for sale, the price that they are capable of obtaining is often contingent on the number of records it contains. It is therefore possible that the dataset may be intentionally padded with extra records, some of which might contain non-compliant passwords.
|RockYou ||32524461||78587 (0.24%)|
|Yahoo ||444942||8550 (1.89%)|
|000webhost ||14936872||334336 (2.19%)|
|LinkedIn ||172409689||18549 (0.01%)|
With “noisy” data like this, we cannot, for example, simply check for the shortest password in the database to determine the minimum password length constraint specified by the policy. In fact, the authors of one published work  mention in their publication that the presence of “non-password artifacts” in the RockYou dataset factored in to their choice of research methods, at least in part due to the difficulty of filtering these out. This motivates us to search for a simple, easy-to-implement method to attempt to infer password composition policy rules from a password dataset, which would make filtering out at least some of these artifacts trivial. The remainder of this work outlines an alternative approach that we have found success with.
We make the following concrete contributions in this work:
(i) for the first time, we draw attention to the problem of “noise” in publicly-available breached password datasets in the form of passwords that do not comply with the password composition policy in place when the breach occurred
(ii) we suggest an easy-to-implement approach to filtering out this noise by converting the problem to one of outlier detection, without consulting any organisation involved in the breach
we suggest an easy-to-implement approach to filtering out this noise by converting the problem to one of outlier detection, without consulting any organisation involved in the breach(iii) we make pol-infer  available111Available for download at: https://sr-lab.github.io/pol-infer/, the tool used to produce the data and visualisations in our results (Section IV and Section V).
We have introduced and motivated the work in this Section I. We describe related work in Section II. In Section III we describe our approach in detail, showing the results we are able to obtain from the four password datasets shown in Table II in Section IV. In Section V we apply our methodology to datasets created to simulate both intentional padding and processing with error-prone data processing scripts. We conclude in Section VI, discussing the limitations of our approach and potential future work.
Ii Related Work
We are not aware of any existing published work that explores the automation of password composition policy inference from large datasets. Previous research has involved determining the password composition policies used by active services. A study by Florêncio and Herley  gathered password composition policy information by creating an account on the service, where possible, and performing web searches otherwise. This study was later replicated by Mayer et al. in . In , Golla and Dürmuth make extensive use of password data dumps where the password composition policy is known.
Our approach is applicable to any numerically-typed password attribute which is a function of type which extracts some password property (e.g. length). By default, pol-infer supports the password attributes in Table III, sufficient to capture the policies used in the study by Shay et al.  with the exception of the dictionary check on the comprehensive8 policy, which cannot be expressed as an attribute of this type.
|length||The number of characters in the password (i.e. its length).|
|words||The number of words in the password. We define “words” in the same way as in —as “letter sequences separated by a nonletter sequence”.|
|lowers||The number of lowercase letters in the password.|
|uppers||The number of uppercase letters in the password.|
|digits||The number of digits in the password.|
|symbols||The number of non-alphanumeric characters in the password.|
|classes||The number of character classes in the password. We recognise four character classes in the popular LUDS scheme—lowercase, uppercase, digits and symbols.|
For instance, let us suppose we wish to infer the minimum length constraint specified by the policy that the 000webhost set  was created under (that is, ). In this case, previous research  has established that the answer is , and yet the data in Table IV would seem to contradict this—there are passwords shorter than this present in the data.
It is readily apparent how the data in Table IV may be used to determine the minimum length constraint in the 000webhost policy. By observing the outlying value of in the column, we can see that we now have an outlier detection problem. In Table IV, for every length :
We can infer the minimum password length enforced by the password composition policy under which this data was created by looking for the outlying “sudden increase” in , taking where:
For the 000webhost data, this gives us the correct answer . By examining the number of digits in a password, as opposed to password length (that is to say ), we are also able to determine that the 000webhost policy demands that passwords contain at least one digit (see Section IV).
By setting a lower threshold on we are able to specify a cutoff point below which we assume there is no constraint in place on the attribute in question. For , we have found success using a value of as this threshold (i.e. ). For example, consider that the 000webhost policy does not demand that any uppercase letters be present in passwords.
As no value in Table V is outlying above the default cutoff point of , we conclude that there was likely no constraint on minimum number of uppercase letters present in the password policy when the dataset was created.
Iv Results: Real Data
We present a set of results demonstrating the success of our approach when used to infer minimum password length specified by the policy under which 4 different data sets were created.
LinkedIn—breached from the professional social networking site of the same name circa 2012, the true extent of this breach was uncovered in 2016 as much bigger than was initially made public . Unsalted password hashes in SHA-1 format were extracted, of which have since been cracked. It is these cracked passwords we use in this work. The policy in place at the time of the breach enforced a minimum length of 6 characters with no other requirements . Contains 172,428,238 passwords.
Iv-a The RockYou Set (2009)
Iv-B The Yahoo Set (2012)
The outlying point at in Figure 2 indicates that the password composition policy that most of the passwords in the set were created under enforces a minimum length of . This aligns with existing literature .
Iv-B1 Inferring the Absence of Constraints
Iv-C The 000webhost Set (2015)
Previous research has established that the majority of the 000webhost set  was created under a policy enforcing minimum length with the additional requirement that passwords must contain at least one digit .
The outlying point at in Figure 4 indicates that the password composition policy that most of the passwords in the set were created under enforces a minimum length of . This aligns with existing literature .
The outlying point at in Figure 5 indicates that the password composition policy that most of the passwords in the set were created under enforces a minimum of digit in passwords.
Iv-D The LinkedIn Set (2016)
V Results: Synthetic Data
In order to simulate the effect of some of the circumstances mentioned in Section I that could potentially create non-compliant “noise” in real-world password datasets, we created the following synthetic datasets:
2word12_linkedin_padded—The LinkedIn dataset  filtered according to a 2word12 policy (at least 12 characters long, at least 2 letter sequences separated by a non-letter sequence) to leave 1,511,786 passwords. This has then been combined with the singles.org dataset  (16,248 passwords), elitehacker dataset (1000 passwords), hak5 dataset  (2987 passwords), and faithwriters dataset  (9709 passwords). This is designed to simulate intentional padding of a dataset created under one policy with several other smaller datasets in order to increase its resale value.
2class8_linkedin_errors—The LinkedIn dataset  filtered according to 2class8 policy (at least 8 characters long, at least 2 character classes present from lowercase, uppercase, digits and symbols) to leave 65,271,156 passwords. For every password in this dataset containing either a space or a comma, this password has then been split into two or more separate strings along these tokens, leading to the creation of 404,547 additional records. This simulates the type of formatting error that might be introduced by processing scripts after the dataset has been exfiltrated.
V-a Intentional Padding
Figure 7 and Table VI show the use of our methodology to recover the original password composition policy of 2word12_linkedin_padded (2word12). The outlying points at and give us a length and word count of and respectively.
V-B Formatting Errors
Figure 8 and Table VII show the use of our methodology to recover the original password composition policy of 2class8_linkedin_errors (2class8). The outlying points at and give us a length and class count of and respectively.
In this work, we have demonstrated a simple, easy-to-implement methodology for inferring the password composition policy under which a password data dump was created without the need to interact with any of the parties involved in its disclosure. Once we have done this, we are able to trivially filter out non-compliant passwords if we so wish. We make pol-infer, the tool implementing this methodology that we used to produce the results in Sections IV and V, freely available . We show that results obtained by this tool agree with existing literature on several real-world password datasets, and that it is effective on datasets generated to mimic those that might arise as a result of intentional padding or buggy data processing.
While our approach is capable of approximately inferring password composition policies that place constraints on specific password attributes, it cannot offer a guarantee that the inferred policy is accurate or complete. As an example of a password composition policy rule that would be very difficult to infer, consider a rule that limits password length to a maximum of 1024 characters. As very few user-chosen passwords would be in violation of this rule even in its absence, its impact on user password choice would be very limited, making its inference very difficult.
Where time and date of account creation is available in password data dumps, it may be possible to detect with some accuracy the date and time of any password composition policy changes, offering new insight into the organisation’s internal security practices. This may require pol-infer to become more modular, acting as a framework capable of hosting different inference algorithms. Work on pol-infer is planned to make policy inference more automated and comprehensive (e.g. inference of dictionary checks), with an option to generate password composition policy names in the style used by . We plan to make use of pol-infer and the methodology we propose in this work to help prepare password data for use in research into other aspects of password security, such as formally verified password composition policy enforcement software .
-  (2012-01) Revisiting defenses against large-scale online password guessing attacks. IEEE Transactions on Dependable and Secure Computing 9 (1), pp. 128–141. External Links: Cited by: §I.
-  (2016-05) Check if your linkedin account was hacked — wired uk. Note: https://www.wired.co.uk/article/linkedin-data-breach-find-out-included(Accessed on 07/26/2019) Cited by: TABLE I, TABLE II, Fig. 6, 4th item, §IV-D, 1st item, 2nd item.
-  (2009-07) Security gurus 0wned by black hats. Note: https://news.softpedia.com/news/Security-Gurus-0wned-by-Black-Hats-117934.shtml(Accessed on 05/10/2019) Cited by: 1st item.
-  (2009-12) RockYou hack: from bad to worse — techcrunch. Note: https://techcrunch.com/2009/12/14/rockyou-hack-security-myspace-facebook-passwords/(Accessed on 04/10/2019) Cited by: 1st item, TABLE I, TABLE II, §I, Fig. 1, 1st item, §IV-A.
-  (2016) Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). Official Journal of the European Union 59, pp. 1–88. Cited by: 2nd item.
-  (2017) Certified password quality - A case study using Coq and Linux pluggable authentication modules. In Integrated Formal Methods - 13th International Conference, IFM 2017, Turin, Italy, September 20-22, 2017, Proceedings, pp. 407–421. External Links: Cited by: §VI.
-  (2010) Where do security policies come from?. In Proceedings of the Sixth Symposium on Usable Privacy and Security, SOUPS ’10, New York, NY, USA, pp. 10:1–10:14. External Links: Cited by: 1st item, §II.
-  (2018) On the accuracy of password strength meters. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, CCS ’18, New York, NY, USA, pp. 1567–1582. External Links: Cited by: 1st item, TABLE I, TABLE II, §II, §III, 1st item, 3rd item, §IV-A, §IV-A, §IV-C, §IV-C, §IV-D, §IV-D.
-  (2010-08) Researcher creates clearinghouse of 14 million hacked passwords. Note: https://www.forbes.com/sites/andygreenberg/2010/08/26/researcher-creates-clearinghouse-of-14-million-hacked-passwords/#7bacb64318fd(Accessed on 05/10/2019) Cited by: 1st item.
-  (2012-07) Yahoo hacked, 450,000 passwords posted online - cnn. Note: https://edition.cnn.com/2012/07/12/tech/web/yahoo-users-hacked(Accessed on 04/10/2019) Cited by: TABLE I, TABLE II, §I, Fig. 2, Fig. 3, 2nd item, §IV-B.
-  (2019-04) Sr-lab/pol-infer: inferring password composition policies from breached user credential databases.. Note: https://github.com/sr-lab/pol-infer(Accessed on 04/12/2019) Cited by: item iii, §IV-B1, §IV, §VI.
-  (2012-05) Guess again (and again and again): measuring password strength by simulating password-cracking algorithms. In 2012 IEEE Symposium on Security and Privacy, Vol. , pp. 523–537. External Links: Cited by: §I.
-  (2017) A second look at password composition policies in the wild: comparing samples from 2010 and 2016. In Thirteenth Symposium on Usable Privacy and Security (SOUPS 2017), Santa Clara, CA, pp. 13–28. External Links: Cited by: 1st item, TABLE I, TABLE II, §II, 2nd item, 4th item, §IV-B1, §IV-B, §IV-B.
-  (2013) Measuring password guessability for an entire university. In Proceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, CCS ’13, New York, NY, USA, pp. 173–186. External Links: Cited by: §I.
-  (2015-10) 000webhost hacked, 13 million customers exposed — zdnet. Note: https://www.zdnet.com/article/000webhost-hacked-13-million-customers-exposed/(Accessed on 04/10/2019) Cited by: TABLE I, TABLE II, §I, TABLE IV, TABLE V, §III, Fig. 4, Fig. 5, 3rd item, §IV-C.
-  (2009-02) Exposed web site a reminder for use of multiple passwords — network world. Note: https://www.networkworld.com/article/2263760/exposed-web-site-a-reminder-for-use-of-multiple-passwords.html(Accessed on 07/25/2019) Cited by: 1st item, 1st item.
-  (2016-05) Designing password policies for strength and usability. ACM Trans. Inf. Syst. Secur. 18 (4), pp. 13:1–13:34. External Links: Cited by: TABLE III, §III, §VI.
-  (2016) Do users’ perceptions of password security match reality?. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI ’16, New York, NY, USA, pp. 3748–3760. External Links: Cited by: §I.
-  (2010) Testing metrics for password creation policies by attacking large sets of revealed passwords. In Proceedings of the 17th ACM Conference on Computer and Communications Security, CCS ’10, New York, NY, USA, pp. 162–175. External Links: Cited by: §I.