Accurately measuring password strength is essential to guarantee the security of password-based authentication systems. However, even more critical, is training users to select secure passwords in the first place.
One common approach is to rely on password policies that list a series of requirements for a strong password. This approach is limited or even harmful [passwordExhaustion]. Alternatively, Passwords Strength Meters (PSMs) have been shown to be useful and are witnessing increasing adoption in commercial solutions [measure_up, ontheaccuracy].
The first instantiations of PSMs were based on simple heuristic constructions. Password strength was estimated via either handcrafted features such as LUDS (which counts lower and uppercase letters, digits, and symbols) or heuristic entropy definitions. Unavoidably, given their heuristic nature, this class of PSMs failed to accurately measure password security [empirical, testing].
More recently, thanks to an active academic interest, PSMs based on more sound constructions and rigorous security definitions have been proposed. In the last decade, indeed, a considerable research effort gave rise to more precise meters capable of accurately measuring password strength [FLA, fuzzyPSM, MM].
However, meters have also become proportionally more opaque and inherently hard to interpret due to the increasing complexity of the employed approaches. State-of-art solutions base their estimates on blackbox parametric probabilistic models [FLA, MM] that leave no room for interpretation of the evaluated passwords; they do not provide any feedback to users on what is wrong with their password or how to improve it. We advocate for explainable approaches in password meters, where users receive additional insights and become cognizant of which parts of their passwords could straightforwardly improve. This makes the password selection process less painful since users can keep their passwords of choice mostly unchanged while ensuring they are secure.
In the present work, we show that the same rigorous probabilistic framework capable of accurately measuring password strength can also fundamentally describe the relation between password security and password structure. By rethinking the underlying mass estimation process, we create the first interpretable probabilistic password strength
meter. Here, the password probability measured by our meter can be decomposed and used to estimate further the strength of every single character of the password. This explainable approach allows us to assign a security score to each atomic component of the password and determine its contribution to the overall security strength. This evaluation is, in turn, returned to the users who can tweak a few "weak" characters and keep their favorite passwords essentially unchanged. Figure1 illustrates the selection process.
In devising the proposed mass estimation process, we found it ideally suited for being implemented via a deep learning architecture. In the paper, we show how that can be cast as an efficient client-side meter employing deep convolutional neural networks. The major contributions of our work are: (i) We formulate a novel password probability estimation framework based on undirected probabilistic models. (ii) We show that such a framework can be used to build a precise and sound password feedback mechanism. (iii) We implement the proposed meter via an efficient and lightweight deep learning framework ideally suited for client-side operability.
2 Background and preliminaries
In this section, we offer an overview of the fundamental concepts that are important to understand our contribution. Section 2.1 covers Probabilistic Password Strength Meters. Section 2.2 glances neural networks and related topics. Next, in Section 2.3, we cover structured probabilistic models that will be fundamental in the interpretation of our approach. Finally, Section 2.4 briefly discusses relevant previous works within the PSMs context.
2.1 Probabilistic Password Strength Meters
Probabilistic password strength meters (PPSMs) are PSMs that base their strength measure on an explicit estimate of password probability. In the process, they resort to probabilistic models to approximate the probability distribution behind a set of known passwords, typically, instances of a password leak. Having an approximation of the mass function, strength estimation is then derived by leveraging adversarial reasoning. Here, password robustness is estimated in consideration of an attacker who knows the underlying password distribution, and that aims at minimizing the guess entropy[gue_and_en] of his/her guessing attack. To that purpose, the attacker performs an optimal guessing attack, where guesses are issued in decreasing probability order (i.e., high-probability passwords first). More formally, given a probability mass function defined on the key-space , the attacker creates an ordering of such that:
During the attack, the adversary produces guesses by traversing the list . Under this adversarial model, passwords with high probability are considered weak, as they will be quickly guessed. Low-probability passwords, instead, are assessed as secure, as they will be matched by the attacker only after a considerable, possibly not feasible, number of guesses.
2.2 Neural Networks
A neural network is a differentiable, non-linear111Linear neural networks are possible, but they are typically less interesting than the non-linear ones., function defined over a family of parametric functions indexed by the set of parameters/weights of the network. The family of functions is defined by the so-called architecture of the network that is specified as a sequence of logic partitions called layers i.e., non-linear parametric functions on their own. Deep neural networks, in turn, are functions defined by the composition of many layers.
Deep neural networks are a powerful function approximator capable of accurately describe relations among high-dimensional spaces. Chosen a target function, a differentiable loss function is defined and used to guide the approximation. This process, often namedlearning, consists in finding the configuration of parameters that minimize the discrepancy between the target function and the neural network by relying on a gradient-descent-based optimization technique.
Furthermore, deep neural networks have vastly demonstrated the peculiar capability of generalizing over the input domain. This mainly relates to the smoothness of the data representation learned during the training [representationlearning]. Such generalization is particularly useful within the context of passwords meters, as it helps a correct evaluation of unobserved passwords.
Our meter is implemented via a Convolutional Neural Network (CNN); that is, a network formed from convolutional layers. A convolutional layer is a neural layer that exhibits an infinitely strong bias over its weights; it leverages spatial parameters sharing to both reduce the number of learnable parameters of the networks222 This property allows the construction of deep architectures with a limited memory fingerprint. and enforce spatial invariance over the input domain which can further help generalization. In particular, we chose a family of architectures called residual neural networks [resnet] (or resnet in brief). A resnet is generally composed of a block of layers presenting skipping connections among them; that is, layers that are not adjacent can be connected via an additional connection. Residual skipping connections have been proven to improve both the gradient flow and the depth/accuracy ratio of networks [resnet].
2.3 Structured Probabilistic Models
Very often, the probabilistic models used by PPSMs are probabilistic structured models (even known as graphical models). Those describe password distributions by leveraging a graph notation
to illustrate the conditional dependencies among a set of random variables. Here, a random variableis depicted as a vertex, and an edge between and exists whether and
are statistically dependent. Structured probabilistic models are classified according to the orientation of edges. A direct acyclic graph (DAG) defines adirected graphical model
(or Bayesian Network). In such formalism, an edge asserts a cause-effect relationship between two variables; that is, the state assumed from the variableis intended as a direct consequence of those assumed by its parents in the graph. Under such a description, a topological ordering among all the random variables can be asserted and used to factorize the joint probability distribution of the random variables effectively. An undirected graph, on the other hand, defines an undirected graphical model
, also known as Markov Random Field (MRF). In this description, the causality interpretation of edges is relaxed, and connected variables influence each other symmetrically. Undirected models permit the formalization of stochastic processes where causality among factors cannot be asserted. However, this comes at the cost of losing easy factorization of the joint distribution.
Generally speaking, in defining a structured model, the designer can introduce arbitrary prior knowledge in the description of the stochastic process. That is, independence among random variables can be asserted a priori and used to simplify the learning process. Such independence assertions can serve to reduce the number of parameters of the model drastically, and, whether verified in practice, they can help to represent the target distribution better.
2.4 Related Works
Here, we briefly review early approaches to the definition of PSMs. We limit the discussion to the most influence works as well as to the ones most related to ours.
Originally thought for guessing attacks [mm_first]
, Markov model approaches have found natural application in the password strength estimation context. Castelluccia et al.[MM]
use a stationary, finite state Markov chain as a direct password mass estimator. Their model computes the joint probability by separately measuring the conditional probability of each pair of-grams in the observed passwords. In particular, the model learns a -gram concurrence matrix directly from the passwords chosen by users of the service. In order to avoid information leaks, the estimation process is performed purely server-side. Additionally, differential privacy is applied to the matrix.
Melicher et al. [FLA]
extended the Markov model approach by leveraging a character/token level Recurrent Neural Network (RNN) for modeling the probability of passwords. In the process, no-markovian property is assumed. Their model is implemented and carefully optimized to be a client-side meter. Given the recurrent nature of the neural network, the probability estimation process requires a number of network inferences linear in the number of characters in the passwords. Our approach requires the same number of network inferences, but these can be parallelized at batch level.
As discussed in the Introduction, probabilistic approaches are not capable of any natural form of feedback. In order to partial cope with this shortcoming, a hybrid approach has been investigated in [FLA2]. Here, the model of Melicher et al. [FLA] is aggregated with a series of heuristic hand-crafted feedback mechanisms such as detection and reporting of leeting behaviors or common tokens (e.g., keyboard walks).
Even if harnessing a consistently different form of feedback, our framework merges these solutions into a single and jointly learned model. Additionally, in contrast with [FLA2], our feedback has a concrete probabilistic interpretation as well as a complete freedom from any form of human bias. Interestingly enough, our model autonomously learns some of the heuristics hardwired in [FLA2]. For instance, our model learned that capitalizing characters in the middle of the string can consistently improve password strength.
Token look-up PSMs:
Another relevant class of meters are those based on the token look-up approach. Generally speaking, these are non-parametric solutions that base their strength estimation on collections of sorted lists of tokens like leaked passwords and word dictionaries. Here, a password is modeled as a combination of tokens and the relative security score is derived from the ranking of the tokens in the known dictionaries. Unlike probabilistic solutions, token-based PSMs are able to return feedback to the user such as explanation for the weakness of a password or hints on how to improve it. Such feedback is mainly based on the semantic attributed to the tokens composing the password. A leading member of token look-up meters is zxcvbn [zxcvbn]
, a client-side password strength meter based on a token look-up / pattern matching paradigm. It assumes every password as produced from a template obtained by the concatenation of different pre-defined patterns. For each pattern (i.e.,token, reversed, sequence repeat, keyboard and date) a list of candidate strings is maintained. These are sorted in decreasing order of probability and used to model the attacker’s guessing strategy. The meter scores passwords basing on a heuristic characterization of the guess-number [gue_and_en]. This score is described as the number of combinations of tokens necessary to match the tested password by traversing the sorted pattern lists. As above mentioned, zxcvbn is able to provide a consistent feedback to the user; when a weak password is detected, the meter outputs a motivation for the believed weakness of the password based on the semantics of the tokens composing the password. For instance, if one of the identified token lies in the "repeat" list, zxcvbn will suggest the user to avoid the use of repeated characters in the password. Naturally, this kind of feedback mechanism inherently lacks of generality and address just few human-chosen scenarios.
zxcvbn is available through a lightweight implementation. As discussed from the authors themselves, zxcvbn suffers from various limitations. By assumption, it is unable to model the relation among different patterns occurring in the same passwords. Additionally, like other token look-up based approaches, it fails to coherently model unobserved patterns and tokens.
Another example of token look-up approach is the one proposed in [telepathwords]. Telepathwords discourages a user to choose weak passwords predicting the next most probable characters during the password typing. In particular, predicted characters are shown to the user in order to dissuade him/her from choosing them as next characters in the password. These are reported together with an explanation of why those characters were predicted. As for zxcvbn, such feedback solely accounts for hardwired scenarios. For instance, the use of profanity in the password. Telepathwords is server side only.
3 Meter foundations
In this section, we introduce the theoretical foundations of the proposed estimation process as well as of the character-level feedback mechanism deriving from it. First, in Section 3.1, we introduce and motivate the probabilistic character-level feedback mechanism. Later, in Section 3.2, we describe how that can be obtained using undirected probabilistic models.
3.1 Character-level strength estimation via probabilistic reasoning
As introduced in Section 2.1, PPSMs employ probabilistic models to approximate the probability mass function of an observed password distribution, say . Estimating , however, could be particularly challenging and suitable estimation techniques must be adopted in order to make the process feasible. In this direction, a general solution is to factorize the domain of the mass function (i.e., the key-space); that is, passwords are modeled as concatenation of smaller factors, typically, decomposed at character level.333Also -grams or words segmentation are common. Afterwards, password distribution is estimated by modeling stochastic interactions among these simpler components. More formally, every password is assumed as a realization
of a random vector of the kind, where each disjoint random variable represents the character at position in the string. Then, is described by means of structured probabilistic models that formalize the relations among those random variables, eventually defining a joint probability distribution. In the process, every random variable is associated with a local conditional probability distribution (here, referred as ) that describes the stochastic behavior of in consideration of the conditional independence properties asserted from the underlying structured model i.e.,
. Eventually, the joint measurement of probability is derived from the aggregation of the marginalized local conditional probability distributions, typically under the form. For instance, the Markov model approach [mm_first, MM] factorizes the joint probability as , where is the order of the believed Markov property.
As introduced in Section 2.1, the joint probability can be employed as a good proxy for password strength. However, such a global assessment unavoidably hides much fine-grained information that can be extremely valuable to the interests of a password meter. In particular, the joint probability offers us an atomic interpretation of the password strength, but it fails at disentangling the relation between password strength and password structure. That is, it does not clarify which factors of an evaluated password are making that password insecure. However, as widely demonstrated by non-probabilistic approaches [zxcvbn, FLA2, telepathwords], users benefit from the awareness of which part of the chosen password is easily predictable and which is not. In this direction, we argue that the local conditional probabilities that naturally appear in the estimation of the joint one, if correctly shaped, can offer detailed insights into the strength or the weakness of each factor of a password.
Such character-level probability assignments are an explicit interpretation of the relation between the structure of a password and its security.
The main intuition here is that: high values of tell us that (i.e., character at position in the string) has a high impact on increasing the password probability and must be changed to make the password stronger. Characters with low conditional probability, instead, are pushing the password to have low probability and must be maintained unchanged.
Figure 2 reports some visual representations of such probabilistic reasoning. In the figures, the background color of each segment renders the value of the local conditional probability of the character. Red describes high probability values, whereas green describes low probability assignments.
This fine-grained estimation can be used as a sound guide for the user at composition time, taking the form of active feedback. Such a mechanism can naturally figure out weak passwords components and explicitly guide the user to change them.
For instance, the local conditional probabilities can spot the presence of predictable tokens in the password without the explicit use of dictionaries (Figure 1(a)). These measurements are able to automatically describe common password patterns like those manually modeled from other approaches [FLA2], see Figures 1(b), 1(c) and 1(d). More importantly, they can potentially describe latent composition patterns that have never been observed and modeled by human beings. In doing this, neither supervision nor human-reasoning is required.
Unfortunately, existing PPSMs, by construction, leverage arbitrary designed structured probabilistic models that fail to produce the required estimates. Those assume independence properties and causality relations among characters that are not strictly verified in practice. As a result, their conditional probability measurements fail to model correctly a coherent character-level estimation that can be used to provide the required feedback mechanism.
Hereafter, we show that relaxing these biases from the mass estimation process will allow us to implement the feedback mechanism described above. To that purpose, we have to build a new probabilistic estimation framework based on complete and undirected models.
3.2 An undirected description of password distribution
To simplify the understanding of our method, we start with a description of the probabilistic reasoning of previous approaches. Then, we fully motivate our solution by comparison with them. In particular, for the comparison, we chose the state-of-the-art neural approach proposed in [FLA] (henceforth, referred as FLA) as a representative instance, since it is the least biased as well as the most accurate among the existing PPSMs.
FLA uses a recurrent neural network (RNN) to estimate password mass at the character level. That model assumes a stochastic process represented by a Bayesian network like the one depicted in Figure 2(a). As previously anticipated, such a density estimation process bears the burden of bold assumptions on its formalization. Namely, the description derived from a Bayesian Network implies the existence of (1) a causality order and (2) forward independence among password characters. The first assumed property asserts that the causality flows in a single and specific direction in the generative process i.e., from the start of the string to its end. Therefore, characters influence their probabilities only asymmetrically; that is, the probability of is conditioned from but not vice versa. In practice, this implies that the observation of the value assumed from does not affect our belief in the value expected from , yet the opposite does. In a similar way, the second believed property - forward independence - asserts that a character is independent from the characters that follow it in the string i.e., .
Assuming that the underlying stochastic process verifies these properties eventually simplifies the estimation of both the single local conditional probabilities and the joint one. In particular, the local conditional probability of each character can be computed as , where explicate that the ’th character solely depends on the characters that precede it in the string. Just as easily, the joint probability factorizes in:
by chain rule.
Unfortunately, although those assumptions do simplify the estimation process, the conditional probability , per se, fails in giving a direct and coherent estimation of the security contribution of the single character in the password. This is particularly true for characters in the first positions of the string up to the point that the first character, i.e., , is assumed to be independent of any other symbol in the password; its probability is the same for any possible configuration of the remaining random variables . Yet, in the context of a sound character-level feedback mechanism, the symbol must be defined as “weak” or “strong” strictly according to the context defined by the entire string. For instance, given two passwords and , the probability should be different if measured on or . More precisely, we expect to be much higher than , as observing drastically changes our expectations about the possible values assumed from the first character in the string. On the other hand, observing tells us no much about the event . Yet, this interaction cannot be described through the Bayesian network reported in Figure 2(a), where eventually results equal to . The same reasoning applies to trickier cases. For instance, the password . Here, arguably, the security contribution of the first character ’(’ strongly depends from the presence or absence of the last character444Even if not so common, strings enclosed among brackets or other special characters often appear in password leaks. i.e., . The symbol , indeed, can be either a good choice (as it introduces entropy in the password) or a poor one (as it implies a predictable template in the password), but this solely depends on the value assumed from another character in the string (the last one in this example). We argue that a good meter should be able to model similar templates and encourage the user to break them at composition time (e.g., by advising the user to remove the first or the last bracket). Yet, again, such relation is a priori excluded from existing structured models.
It should be apparent that the assumed structural independence prevents the resulting local conditional probabilities from being sound descriptors of the real character probability as well as of their security contribution. Consequently, such measures cannot be used to build the fine-grained feedback mechanism suggested in Section 3.1. The same identical conclusion applies to other class of PPSMs [MM, PCFG] which add even more structural biases on top of those illustrated by the model in Figure 2(a).
Under a broader view, directed models are intended to be used in contexts where the causality relationships among random variables can be fully understood. Unfortunately, even if passwords are physically written character after character by the users, it is not possible to assert neither independence nor cause-effect relations among the symbols that compose them. Differently from plain dictionary words, passwords are built on top of much more complex structures and articulated interactions among characters that cannot be fully described without relaxing many of the assumptions leveraged by existing PPSMs. In the act of relaxing such assumptions, we base our estimation on an undirected and complete555"Complete" in the graph theory sense. graphical model, as this represents the most general description of the password generative distribution. That is, neither independence nor causality among random variables are a priori implied. Figure 2(b) depicts the respective Markov Random Field (MRF) for passwords of length four. According to that description, the probability of the character directly depends on any other character in the string i.e., the full context. In other words, we model each variable as a stochastic function of all the others. This intuition is better captured from the consequent evaluation of local conditional probability (Eq. 2).
The measurement asserts that the probability of a character is potentially influenced by the configuration of all the other nodes in the graph. Nevertheless, if needed, independence or causality relations can be autonomously ascribed as context-specific independence666Independence that is verified only in case of specific configurations of the random variables. from the probabilistic model itself. That is, the model is free to capture independence from the observed data and assert them as true at inference time. Henceforth, we use the notation to refer to the local conditional distribution of the ’th character given the password . When is not clear from the context, we write to make it explicit. The notation or , instead, refers to the marginalization of the distribution according to the symbol .
Eventually, such undirected formalization intrinsically weeds out all the limitations observed for the previous estimation process (i.e., the Bayesian network in Figure 2(a)). Now, every local measurement is computed within the context offered by any other symbol in the string. Therefore, relations that were previously assessed as impossible can be naturally described. This statement becomes apparent as soon as we reconsider the examples made above in the discussion. In the example, , indeed, the local conditional probability of the first character can be now backward-influenced from the context offered from the subsequent part of the string. This is clearly observable from the output of an instance of our meter (whose implementation is discussed in Section 4) reported in Figure 3(a), where the value of drastically varies between the two cases, i.e., and . As expected, we have verified in the example. A similar intuitive result is reported in the right column of Figure 3(b), where the example is considered. Here, the meter first scores the string then scores the complete password . In this case, we expect that the presence of the last character ’)’ would consistently influence the conditional measurement of the first bracket in the string. Such expectation is perfectly captured from the reported output, where appending at the end of the string the symbol ’)’ increases the probability of the first bracket of a factor .
However, obtaining these improvements does not come for free. Indeed, under the MRF construction, the productory over the local conditional probabilities (better defined as potential functions or factors within this context) does not strictly lead to the joint probability distribution of . Instead, such product results in a unnormalized version of it shown in Equation 3.
In the equation, is the untractable partition function. This result follows from the Hammersley–Clifford theorem [hamme], as our conditional measurements (i.e., ) are strictly non-negative (more information about those will be given in Section 4). Nevertheless, the unnormalized joint distribution preserves the core properties needed to the meter functionality. Most importantly, we have that:
That is, if we sort a list of guesses according to the true joint or according to the unnormalized version , we obtain the same identical ordering. Consequently, no deviation from the adversarial interpretation of PPSMs described in Section 2.1 is implied. Indeed, we have for every password distribution, key-space and suitable sorting function. Furthermore, the joint probability distribution, if needed, can be approximated using suitable approximation methods, as discussed in Section 9 reported in the Appendix. A more detailed analysis of the probabilistic description of our meter is reported in section 8 in Appendix.
3.2.1 Details on the password feedback mechanism
Joint probability can be understood as a compatibility score assigned to a specific configuration of the MRF; it tells us the likelihood of observing
a sequence of characters during the interaction with the password generative process. On a smaller scale, a local conditional probability measures the impact that a single character has in the final security score. Namely, it tells us how much the character contributes at the probability of observing a certain password . Within this interpretation, low-probabilities characters push the joint of to be closer to zero (secure), whereas high-probability characters (i.e., ) offer almost no contribute at lowering password probability (insecure). Therefore, users can strengthen their candidate passwords by substituting high-probability characters with suitable lower-probability ones (e.g., Figure 1).
Unfortunately, users’ perception of password security has been shown to diverge from the real one [dousers], and, without an explicit guide, it would be difficult for them to select suitable lower-probability substitutes. To address this limitation, one can ascribe an additional mechanism capable of suggesting secure substitute symbols to the users. Interestingly, our local conditional distributions are naturally suitable to that purpose. Indeed, are able to clarify which symbol is a secure substitute and which is not for each character of . In particular, a distribution , defined on the whole alphabet , assigns a probability to every symbol that the character can potentially assume. For a symbol , the probability measures how much the event is probable knowing all the observable characters in . Under this interpretation, a candidate, secure substitution to is a symbol with very low (as this will lower the joint probability of ). In particular, every symbol s.t. given is a secure substitution for . Table 1 better depicts such intuition. The Table reports the alphabet sorted by for each in the example password . The bold symbols between parenthesis indicate . Within this representation, all the symbols below the respective for each are suitable substitutions capable of improving password strength. This intuition will be empirically proven in Section 5.2. It is important to note that such suggestion mechanism must be randomized to avoid promoting bias in the final password distribution.777i.e., if weak passwords are always perturbed in the same way, these will be easily guessed from an aware attacker. At this end, one can present to the user just random symbols among the pool of secure substitute i.e., .
In summary, in this section, we presented and motivated an estimation process able to unravel the feedback mechanism described in Section 3.1. Maintaining a purely theoretical focus, no information about the implementation of such methodology has been offered to the reader. Next, in Section 4, we describe how such a meter can be shaped via an efficient deep learning framework.
4 Meter implementation
In this section, we present a deep-learning-based implementation of the estimation process introduced in Section 3.2. Here, we describe the model and its training process. Then, we explain how the trained network can be used as a building block for the proposed password meter.
As can be easily understood from the discussion carried out in Section 3.2, our procedure requires the parametrization of an exponentially large number of interactions among random variables. Thus, any tabular approach, such as the one used from Markov Chains or PCFG [PCFG], is a priori excluded. To the purpose of making such a meter feasible, then, we reformulate the underlying estimation process so that it can be approximated with a neural network. In our approach, we simulate the Markov Random Field described in Section 3.2 using a deep convolutional neural network trained to compute (Eq. 2) for each possible configuration of the structured model.
In doing so, we train our network to solve an inpainting-like task defined over the textual domain.888We use the inpainting problem as a proxy-task to train our network in describing the underlying password distribution. A similar approach is used in [FLA], where a “guess the next character” problem is used as proxy-task. Broadly speaking, inpainting is the task of reconstructing missing information from mangled inputs, mostly images with missing or damaged patches [inpainting]. A good inpainting model must be able to infer missing content leveraging the context maintained from the observable data. Under the probabilistic perspective, the model is asked to return a probability distribution over all of the unobserved elements of , explicitly measuring the conditional probability of those concerning the observable context [deeplearningbook]. Therefore, consequently, performing a good approximation of the data probability distribution describing the underlying domain. In particular, the network has to disentangle and model the semantic relation occurring among all the factors describing the data (e.g., pixels for the image domain or characters in a string) to reconstruct input instances correctly.
Generally, the architecture and the training process used for inpainting tasks resemble an auto-encoding structure. That is, an autoencoder network[AE] is trained to learn a form of reconstruction function over the interested domain. In the general case, this kind of model is trained to revert self-induced damage carried out on instances of a train-set . At each training step, an instance is artificially mangled with an information-destructive transformation to create a mangled variation . Then, the network, receiving as input, is optimized to produce an output that most resembles the original ; that is, the network is trained to reconstruct from .
In our approach, we train a network to infer missing characters in a mangled password by modeling a “guess the missing character” problem. In particular, we iterate over a password leak (i.e., our train-set) by creating mangled passwords and train the network to recover them. The mangling operation is carried out by removing a randomly selected character from the string. For example, the train-set entry is transformed in "ilovyou" if the ’th character is selected for deletion, where the symbol ’’ represents the "empty character". A compatible proxy-task has been previously used in [IPGVRL] to learn a suitable password representation for guessing attacks.
We chose to model our network with a deep residual structure arranged to create an autoencoder. The network follows the same general Context Encoder [CAE] architecture defined in [IPGVRL] with few modifications. The encoder and the decoder are composed of the concatenation of the same number of deep residual bottleneck blocks [resnet]. To create an information bottleneck, the encoder connects with the decoder through a latent space junction obtained through two fully connected layers. Even if not strictly necessary for our purposes, we observed that enforcing a latent space, and a prior on that, consistently increases the meter effectiveness. For that reason, we maintained the same regularization proposed in [IPGVRL]
; a maximum mean discrepancy regularization that forces a standard normal distributed latent space. The final loss function of our model is reported in Eq.4. In the equation, and refer to the encoder and decoder network respectively, is the softmax function applied row-wise999The Decoder outputs estimations; one for each input character. Therefore, we apply the softmax function separately on each of those to create probability distributions., the distance function is the cross-entropy, and refers to the maximum mean discrepancy.
Henceforth, we refer to the composition of the encoder and the decoder as . We train the model on the widely studied RockYou leak [rockyou] considering an train-test split. From it, we filter passwords presenting fewer than characters. We train different networks considering different maximum password lengths, namely, , , and . In our experiments, we report results obtained with the model trained on a maximum length equal to , as no substantial performance variation has been observed among the different networks. Eventually, we produce three neural nets with different architectures; a large network requiring of disk space, a medium-size model requiring , and a smaller version of the second that requires . These models can be further compressed using the same quantization and compression techniques harnessed in [FLA]. Fine-grained information about the used architectures and hyper-parameters are reported in Appendix 10. We implement our approach using the TensorFlow framework. All the experiments have been carried out on a Nvidia DGX-2 machine.
Model inference process:
Once the model is trained, we can use it to compute the conditional probability (Eq. 2) for each and each possible configuration of the MRF. This is done by querying the network using the same mangling trick performed during the training. The procedure used to compute for summarizes in the following steps:
We substitute the ’th character of with the empty character ’’, obtaining a mangled password .
Then, we feed to a network that outputs a probability distribution over of the unobserved random variable i.e., .
Given , we marginalize out , obtaining the probability of our interest.
For instance, if we want to compute the local conditional probability of the character ’e’ in the password , we first create "ilovyou" and use it as input for the net, obtaining , then we marginalize that (i.e., ) getting the probability . From the probabilistic point of view, this process is equivalent to fixing the observable variables in the MRF and querying the model for an estimation of the single unobserved character.
At this point, in order to cast both the feedback mechanism defined in Section 3.1 and the Unnormalized joint probability of the string, we have to measure for each character of the tested password. This is easily achieved by repeating the inference operation described above for each character comprising the input string. A graphical representation of such a process is depicted in Figure 5. It is important to highlight that the required inferences are independent, and their evaluation can be performed in parallel (i.e., batch level parallelism), introducing almost negligible overhead over the single inference. Additionally, with the use of a feed-forward network, we are avoiding the sequential computation that is intrinsic in recurrent networks (e.g., the issue afflicting [FLA]), and that can be excessive for a reactive client-side implementation. Furthermore, the convolutional structure allows the construction of very deep neural nets with a limited memory footprint.
In conclusion, leveraging the trained neural network, we can compute the potential of each factor/vertex in the Markov Random Field (defined as local conditional probabilities in our construction). As a consequence, we are now able to cast a PPSM featuring the character-level feedback mechanism discussed in Section 3.1. Finally, in Section 5, we empirically evaluate the soundness of the proposed meter.
In this section, we empirically validate the proposed estimation process as well as its deep learning implementation. First, in Section 5.1, we evaluate the capability of the meter of accurately assessing password strength at string-level. Next, in Section 5.2, we demonstrate the intrinsic ability of the local conditional probabilities of being sound descriptors of password strength at character-level.
5.1 Measuring meter accuracy
In this section, we evaluate the accuracy of the proposed meter at estimating password probabilities. To that purpose, following the adversarial reasoning introduced in Section 2.1, we compare the password order imposed from the meter with the one imposed from the ground-truth password distribution. In doing so, we rely on the baseline defined in [ontheaccuracy] for our evaluation. In particular, given a test-set (i.e., a password leak), we consider a weighted rank correlation coefficient between ground-truth ordering and the ordering imposed from the meter. The latter is obtained by applying the meter on each password of the test-set and sorting those according to the (unnormalized) joint probability. The ground-truth ordering, instead, is obtained by sorting the unique entry of the test-set according to the frequency of the password observed in the leak. In the process, we compare with other fully probabilistic meters, namely, Markov Models and the Neural approach reported in [FLA]. A detailed description of the evaluation process follows.
For modeling the ground-truth password distribution, we rely on the password leak discovered by 4iQ in the Dark Web[BC_leak] on 5th December 2017. It consists of the aggregation of leaks, including well-known entries such as Linkedin, Myspace, and RockYou and novel breaches. In total, the set counts billions pair of plain-text passwords and email addresses. In the cleaning process, we collect and count the frequency of all the unique ASCII passwords with length in the interval , obtaining a set of unique passwords that we sort in decreasing frequency order. As it has been previously observed in [Science] and following the same approach of [ontheaccuracy], we filter out all the passwords with a frequency lower than from the test-set, as rare passwords could bring to erroneous measurement. Finally, we obtain a test-set composed of unique passwords that we refer to as . Given both a large number of entries and the heterogeneity of sources composing it, we believe is a good description of real-world passwords distribution.
In the evaluation process, we compare our approach with other probabilistic meters. In particular:
The Markov model [omen] implemented in [nemo_git] (the same used in [ontheaccuracy]). We investigate different -grams configurations, namely, -grams, -grams and -grams that we refer to as , and respectively. For their training, we employ the same train-set used for our meter (i.e., RockYou with length in ). Eventually, we obtain three models , and requiring , and of disk space respectively.
The neural approach of Melicher et al. [FLA]. We use the implementation available at [FLA_git] to train the main architecture advocated in [FLA] i.e., an RNN composed of three LSTM layers of cells each and two fully connected layers. The training is carried out on the same train-set used for our meter. Eventually, we obtain a model composing of of parameters that we refer to as FLA.
Our meter. We report results from the three networks with different sizes described in Section 4.
We follow the guidelines defined by Golla and Dürmuth [ontheaccuracy] in the evaluation of the meters. We use the weighted Spearman correlation coefficient (ws) to measure the accuracy of the orderings produced by the tested meters, as this has been demonstrated to be the most reliable correlation metrics within this context [ontheaccuracy]. The metrics are defined as
where and are the sequence of rank assigned to the test-set from the ground-truth distribution and the tested meter, respectively, and where the bar notation (e.g., ) expresses the weighted mean in consideration of the sequence of weights . The weights are computed as the normalized inverse of the ground-truth ranks (Eq. 5).
In this application, the weights increase the relevance of weak passwords (i.e., the ones with small ranks) in the metrics computation; that is, the erroneous placing of weak passwords (i.e., asserting a weak password as strong) is highly penalized. Unlike [ontheaccuracy], we directly use the ranking imposed from the password frequencies in as ground-truth. Here, passwords with the same frequency value have received the same rank.
Table 2 reports the measured correlation coefficient for each tested meter. In the table, we also report the required storage as auxiliary metric.
Our meters, even the smallest, achieve a higher score than the most performant Markov Model i.e., . On the other hand, our largest model cannot directly exceed the accuracy of the state-of-the-art estimator FLA, obtaining only comparable results. However, FLA requires more disk space than ours. Indeed, interestingly, our convolutional implementation permits the creation of remarkably lightweight meters.101010Lightweight implementations are critical for the construction of suitable client-side meters. As a matter of fact, our smallest network shows a comparable result with requiring more than a magnitude less disk space.
Concluding, the results confirm that the probability estimation process defined in Section 3.2 is indeed sound and capable of accurately assess password mass at string-level. The proposed meter shows comparable effectiveness with the state-of-the-art [FLA], whereas it outperforms standard approaches such as Markov Chains. Nevertheless, we believe that even more accurate estimation can be achieved by investigating deeper architectures and/or by performing hyper-parameters tuning over the model.
|Required Disk Space||1.1MB||94MB||8.8GB||60MB||36MB||18MB||6.6MB|
5.2 Analysis of the relation between local conditional probabilities and password strength
In this Section we test the capability of the proposed meter of modeling correctly the relation between password structure and password strength. In particular, we investigate the ability of the measured local conditional probabilities of determining insecure components of the tested passwords.
Our evaluation procedure follows three main steps. Starting from a set of weak passwords :
We perform a guessing attack on in order to estimate the guess-number of each entry of the set.
For each password , we substitute characters of according to the estimated local conditional probabilities (i.e., we substitute the characters with highest ), producing a perturbed password .
We repeat the guessing attack on the set of perturbed passwords and measure the variation in the attributed guess-numbers.
Hereafter, we provide a detailed description of the evaluation procedure.
The evaluation is carried out considering a set of weak passwords. In particular, we consider the first most frequent passwords of the leaks collection
In the evaluation, we consider three types of password perturbation:
The first acts as a baseline and consists in the substitution of random positioned characters in the passwords with randomly selected symbols. Such general strategy is used from [FLA2] and [persuasion] to improve user’s password at composition time.111111[FLA2] also features more sophisticated variations of the random perturbation aimed to ensure password usability. The perturbation is applied by randomly selecting different characters from and substituting them with symbols sampled from a predefined characters pool. The pool consists of the most frequent symbols in (i.e., mainly lowercase letters and digits). Restricting the character pool aims at preventing the creation of artificially complex passwords that would not be accepted as passwords by the users (e.g., passwords containing extremely uncommon unicode symbols). We refer to this perturbation procedure as Baseline.
The second perturbation partially leverages the local conditional probabilities induced from our meter. Given a password , we compute the conditional probability for each character in the string. Then, we select and substitute the character with maximum probability i.e., . The symbol we use in the substitution is randomly selected from the same pool used for the baseline perturbation (i.e., top- frequent symbols). When is greater than one, the procedure is repeated sequentially by using the perturbed password obtained from the previous iteration as input for the next step. We refer to this procedure as Semi-Meter.
The third perturbation extends the second one by fully exploiting the local conditional distributions. Here, as in Semi-Meter-based, we substitute the character in with the highest probability. However, rather than choosing a substitute symbol in the pool at random, we select that basing the distribution , where is the position of the character to be substituted. In particular, we choose the symbol the minimize i.e., , where is the allowed pool of symbols. We refer to this method as Fully-Meter. Examples of perturbed passwords are reported in the Appendix.
Baseline (PNP) 0.022 0.351 0.549 Semi-Meter (PNP) 0.036 0.501 0.674 Fully-Meter (PNP) 0.066 0.755 0.884 Baseline (AGI) Semi-Meter (AGI) Fully-Meter (AGI) Semi-Meter / Baseline (AGI) 1.530 1.413 1.222 Fully-Meter / Baseline (AGI) 2.768 2.110 1.588 Table 3: Measurements of password strength improvement caused by different perturbations. The last two rows of the table report the AGI ratio between the two meter-based approaches and the baseline.
We evaluate password strength using the min-guess metrics described in [min-guess]. Here, guessing attacks are simultaneously performed with different guessing tools, and the guess-number of a password is considered as the minimum among the attributed guess-numbers. In performing such attacks, we rely on the combination of three widely adopted solutions, namely, HashCat [hashcat], PCFG [PCFG, PCFG_git] and the Markov chain approach proposed in [omen, omen_git]. For tools requiring a training phase i.e., OMEN and PCFG, we use the same train-set used for our model (i.e., 80% of RockYou). In a similar way, for HashCat, we use the same data set as input dictionary121212In this case, passwords are unique and sorted in decreasing frequency. and generated2 as rules set. During the guesses generation, we maintain the default settings of each implementation. We limit each tool to produce guesses. The total size of the generated guesses is TB.
In the evaluation, we are interested at measuring the increment of password strength caused from an applied perturbation. We estimate that value by considering the Average Guess-number Increment (henceforth, referred as AGI); that is, the average delta between the guess-number of the original password and the guess-number of the perturbed password:
where is the guess-number and refers to the perturbed version of the ’th password in the test set. During the computation of the guess-numbers, it is possible that we fail at assign a guess-number to a password (i.e., we do not guess it). In these cases, we attribute an artificial guess-number equals to to the un-guessed passwords. Additionally, we consider the average number of un-guessed passwords as an ancillary metrics; we refer to it with the name of Percentage Non-Guessed Passwords (PNP) and compute it as:
where when is not guessed during the guessing attack.
We perform the tests over three value of (i.e., number of perturbed characters), namely, , , and . Results are summarized in Table 3. The AGI caused by the two meter-based solutions is always greater than that produced by random perturbations. On average, that is twice more effective with respect to the baseline for the Fully-Meter and about greater for the Semi-Meter.
The largest relative benefit is observable when , i.e., a single character is modified. Focusing on the Fully-Meter approach, indeed, the guidance of the local conditional probabilities permits a guess-number increment times bigger than the one caused by a random substitution in the string. This advantage drops to when , since, after two perturbations, passwords tend to be already out from the dense zone of the distribution. Indeed, at about of the passwords perturbed with the Fully-Meter approach cannot be guessed during the guessing attack (i.e., PNP). This value is only for the baseline. More interestingly, the results tell us that substituting two () characters following the guide of the local conditional probabilities causes a guess-number increment greater than the one obtained from three () random perturbations. As a matter of fact, the AGI for the Fully-Meter perturbation is for whereas is for the baseline when .
Eventually, these results confirm that the local conditional distributions are indeed sound descriptors of password security at the structural level.
In this paper, we showed that it is possible to construct interpretable probabilistic password meters by fundamentally rethinking the underlying password mass estimation. We presented an undirected probabilistic interpretation of the password generative process that can be used to build precise and sound password feedback mechanisms. Moreover, we demonstrated that such an estimation process could be instantiated via a lightweight deep learning implementation. We validated our undirected description and deep learning solution by showing that our meter achieves comparable accuracy with other existing approaches while providing a unique character-level feedback mechanism.
The code, pre-trained models, and other materials related to our work are publicly available at: https://github.com/pasquini-dario/InterpretablePPSM/.
8 Details on the probabilistic interpretation
The Hammersley-Clifford’s theorem asserts that the joint probability distribution of a set of random variables described by an undirected model with graph can be represented as the product of non-negative factors (or potential function), one for maximal clique of :
where is the set of maximal cliques in and the notation outlines the set of variables composing a clique . Being our graph complete, we have just one maximal clique that covers all the nodes in the graph, therefore, there is a single term in the productory of Eq. 6. In our construction, we define the potential function to be a non-linear function of the parameters of the neural network. In particular, the product of the local conditional probability of each character (with given in Eq. 2); that is:
However, it is important to note that the potential function does not have a natural probabilistic interpretation per se, as this is intended to represent a measure of relative compatibility among the involved random variables. Nevertheless, we imply such interpretation through our construction. Additionally, being product of probabilities, we ensure to be non-negative as required from the Hammersley-Clifford’s theorem.
9 Estimating guess-numbers
Within the context of PPSMs, a common solution to approximate guess-numbers [gue_and_en] is using the Monte Carlo method proposed in [montecarlo_g]. With few adjustments, the same approach can be applied to our meter. In particular, we have to find out an approximation of the partition function . This can be done by leveraging on the Monte Carlo method, as well. For instance, we can estimate the partition function as follows:
where is the number of possible configurations of the MRF (i.e., the cardinality of the key-space) and is a sample from the posterior of the model. Samples from the model can be obtained in three ways: (1) sampling from the latent space of the autoencoder (as done in [IPGVRL]), (2) performing Gibbs sampling from the autoencoder, or (3) using a dataset of passwords that follow the same distribution of the model.
Once we have an approximation of , we can use it to normalize every joint probability i.e., and seamlessly apply [montecarlo_g]. Alternatively, a more articulate solution can be used in substitution of Eq. 8 like in [pmlr-v31-ma13a].
It is important to note that the estimation is performed offline and must be computed only once for the lifespan of the meter.
10 Model Architectures and hyper-parameters
In this section we detail the technical aspects of our deep learning implementation.
As previously described, we base our networks on a resnet structure. We use bottleneck residual block composed of three mono-dimensional convolutional layers as atomic building block of the networks. A graphical description of that is depicted in Figure 6. We construct three different networks with different sizes (intended as number of trainable parameters). We determine the size of the networks by varying the number of residual blocks, the kernel size of the convolutional layers in the blocks and the number of filters. The three architectures are reported in Tables 5, 6 and 7.
Table 4 reports the used hyper-parameters. During the training, we apply label smoothing which is controlled from the parameter . We found our models taking particular advantage from large batch-sizes. We limit that to for technical limitations, however, we believe that bigger batches could further increment the quality of the password estimation.
|cov1d[3, 128, same, linear]|
|cov1d[5, 128, same, linear]|
|cov1d[5, 128, same, linear]|
|ResblockBneck1D[, ] Flatten|
11 Supplementary resources
This Section reports additional resources useful to the understanding of our contribute.
Table 8 reports examples of password perturbation performed using the method Fully-meter on the three values of . The example passwords (first column) are sampled from .
Figure 7 reports additional examples of the feedback mechanism. The depicted passwords have been randomly sampled from the tail of the RockYou leak. The inner figures are sorted based on the joint probability assigned from the meter.