Towards a Rigorous Statistical Analysis of Empirical Password Datasets

by   Jeremiah Blocki, et al.

In this paper we consider the following problem: given N independent samples from an unknown distribution 𝒫 over passwords pwd_1,pwd_2, … can we generate high confidence upper/lower bounds on the guessing curve λ_G ≐∑_i=1^G p_i where p_i=[pwd_i] and the passwords are ordered such that p_i ≥ p_i+1. Intuitively, λ_G represents the probability that an attacker who knows the distribution 𝒫 can guess a random password pwd ←𝒫 within G guesses. Understanding how λ_G increases with the number of guesses G can help quantify the damage of a password cracking attack and inform password policies. Despite an abundance of large (breached) password datasets upper/lower bounding λ_G remains a challenging problem. We introduce several statistical techniques to derive tighter upper/lower bounds on the guessing curve λ_G which hold with high confidence. We apply our techniques to analyze 9 large password datasets finding that our new lower bounds dramatically improve upon prior work. Our empirical analysis shows that even state-of-the-art password cracking models are significantly less guess efficient than an attacker who knows the distribution. When G is not too large we find that our upper/lower bounds on λ_G are both very close to the empirical distribution which justifies the use of the empirical distribution in settings where G is not too large i.e., G ≪ N closely approximates λ_G. The analysis also highlights regions of the curve where we can, with high confidence, conclude that the empirical distribution significantly overestimates λ_G. Our new statistical techniques yield substantially tighter upper/lower bounds on λ_G though there are still regions of the curve where the best upper/lower bounds diverge significantly.


page 1

page 2

page 3

page 4


Exact lower and upper bounds for shifts of Gaussian measures

Exact upper and lower bounds on the ratio 𝖤w(𝐗-𝐯)/𝖤w(𝐗) for a centered G...

On the Interrelation between Dependence Coefficients of Extreme Value Copulas

For extreme value copulas with a known upper tail dependence coefficient...

Lower Bounds for Prior Independent Algorithms

The prior independent framework for algorithm design considers how well ...

Bitcoin's Latency–Security Analysis Made Simple

Closed-form upper and lower bounds are developed for the security of the...

On the Statistical Complexity of Sample Amplification

Given n i.i.d. samples drawn from an unknown distribution P, when is it ...

Generating Random Samples from Non-Identical Truncated Order Statistics

We provide an efficient algorithm to generate random samples from the bo...

Educational Timetabling: Problems, Benchmarks, and State-of-the-Art Results

We propose a survey of the research contributions on the field of Educat...

Please sign up or login with your details

Forgot password? Click here to reset