Official repository for "Improving Password Guessing via Representation Learning"
Learning useful representations from unstructured data is one of the core challenges, as well as a driving force, of modern data-driven approaches. Deep learning has demonstrated the broad advantages of learning and harnessing such representations. In this paper, we introduce a GAN-based representation learning approach for password guessing. We show that an abstract password representation naturally offers compelling and versatile properties that can be used to open new directions in the extensively studied, and yet presently active, password guessing field. These properties can establish novel password generation techniques that are neither feasible nor practical with the existing probabilistic and non-probabilistic approaches. Based on these properties, we introduce: (1) A framework for password guessing for practical scenarios where partial knowledge about target passwords is available and (2) an Expectation Maximization-inspired framework that can dynamically adapt the estimated password distribution to match the distribution of the attacked password set, leading to an optimal guessing strategy.READ FULL TEXT VIEW PDF
Recently, the topic of graph representation learning has received plenty...
Disentangled distributed representations of data are desirable for machi...
An unsupervised learning approach based on expectation maximization is
Unsupervised representation learning via generative modeling is a staple...
Deep learning owes its success to three key factors: scale of data, enha...
In this paper we address the problem of learning robust cross-domain
In this paper, we reformulate the forest representation learning approac...
Official repository for "Improving Password Guessing via Representation Learning"
Text-based passwords remain the most common form of authentication, as they are both easy to implement and familiar (to users)
. However, text-based passwords are vulnerable to guessing attacks. These attacks have been extensively studied, and it is still an active area of research. Modern password guessing attacks are founded on the observation that human-chosen passwords are not uniformly distributed in the password space (i.e., all possible strings). This is due to the naturalpreference of choosing (easily-)memorable passwords that cover only a small fraction of the exponentially large password space. As a consequence, real-world password distributions are typically composed by few dense zones that can be feasibly estimated by an adversary to perform password-space reduction attacks [yampolskiy2006analyzing]. Along that line, several probabilistic approaches have been proposed [fla, durmuth2015omen, pcfg]
. These techniques - under different assumptions - try to directly estimate the probability distribution behind a set of observed passwords. Such estimation is then used to generate suitable guesses and perform efficient password guessing attacks.
Orthogonal to the current lines of research, we demonstrate that an adversary can further expand the attack possibilities by leveraging representation learning techniques [bengio2013representation]. Representation learning aims at learning useful and explanatory representations [bengio2013representation] from a massive collection of unstructured data. By applying this general approach on a corpus of leaked passwords [rockyou], we demonstrate the advantages that an adversary can gain by learning a suitable representation of the observed password distribution; rather than directly estimating it. In this paper, we show that such password representation indeed permits an attacker to establish novel password guessing techniques that further jeopardize password-based authentication systems.
Inspired by the recently introduced deep learning-based approach for password guessing, i.e., PassGAN [PassGAN], we choose to model the representation of passwords in the latent space of an instance of Generative Adversarial Networks (GAN) [goodfellow2014generative] generator. This representation, thanks to its inherent smoothness [bengio2013representation], is able to enforce a semantic organization in the high-dimensional password space. Such an organization mainly implies that respective representations of semantically-related passwords are closer in the latent space of the generator. As a consequence, geometric relations in the latent space directly translate to semantic relations in the data space. A representative example of this phenomenon is loosely depicted in Figure 1, where we show some latent points (with their respective plain-text passwords) localized in a small section of the induced latent space.
We can exploit such geometric relations to perform a peculiar form of conditional password generation. In the study of such relations, we characterize two main properties, namely, password strong locality and password weak locality. These locality principles enforce different forms of passwords organization that allow us to design two novel password guessing frameworks, Substring Password Guessing (SSPG) and Dynamic Password Guessing (DPG). We highlight that the state-of-the-art approaches are either unable or inefficient to perform such type of advanced attacks. The major contributions of our work are as follows:
We are the first to demonstrate the potential of using fully unsupervised representation learning in the domain of password guessing.
We introduce a probabilistic and completely unsupervised form of template-based passwords generation. Using this technique, we build a practical framework that is able to efficiently perform targeted password guessing in the presence of partial knowledge. We call this framework SSPG. SSPG can be used: (1) by an adversary to increase the impact of side channels and similar password attacks [ali2015keystroke, marquardt2011sp, vuagnoux2009compromising, balzarotti2008clearshot]; or (2) by a legitimate user to recover his/her password. We show the efficiency of SSPG with respect to its direct competitors via experimental evaluations.
We introduce the concept of DPG: DPG is the password guessing approach that dynamically adapts the guessing strategy based on the feedback received from the interaction with the attacked passwords set. We build an Expectation-Maximization inspired DPG implementation based on the principle of password weak locality. DPG shows that an attacker can consistently increase the impact of the attack by leveraging the passwords guessed during a running attack.
It is important to highlight that these properties, and their distinctive capabilities, come practically for free with the latent representation learned by the underlying deep generative model. In addition, the ongoing continuous developments in the GAN framework would naturally further improve our approaches.
Organization: Section 2 gives an overview of the fundamental concepts related to our work. Here, we also present our model improvements and the tools upon which our core work is based. We present password strong locality along with SSPG in Section 3 and password weak locality along with DPG in Section 4. The evaluation of our proposed techniques is presented in their respective sections. Section 5 briefly discusses relevant previous works. Section 6 concludes the paper, although supplementary information are provided in Appendices.
In Section 2.1, we explain GAN and related concepts that are important to understand our work. Section 2.2 briefly discusses the technical aspects of PassGAN, which is the closest work to ours. In Section 2.3, we present our model improvements and the tool that is a fundamental building block in our approach.
GAN is a framework to train a parametric probabilistic model to perform implicit estimation of an unknown target data distribution
, for a given observed random variable[goodfellow2014generative, goodfellow2016nips].
In contrast to the common prescribed probabilistic models [diggle1984monte], implicit probabilistic models do not explicitly estimate the probability distribution; they instead approximate the stochastic procedure that actually generates data [mohamed2016learning]. In other words, we can sample data points from the model as if they were sampled from a random variable following . Although, we cannot directly compute the probability of a given state of .
This class of models is capable of successfully representing data distribution defined in a very high dimensional space, such as in the case of images [brock2018large]. GAN generators have established the new state-of-art in several generative tasks [brock2018large, zhu2017unpaired, ledig2017photo].
The parametric function used for the estimation is a deep neural network defined following an adversarial training approach. The latter process is guided by a second network(i.e., the critic/discriminator), which gives a density estimation-by-comparison [mohamed2016learning]loss function to the generative model (i.e., the generator). The adversarial training bypasses the necessity of defining an explicit likelihood-function and allows us to have a good estimation of very sharp distributions [goodfellow2016nips].
GAN generators are latent variables models. They assume that each observable data instance can be modeled by a set of latent variables. The learned generator acts as a deterministic mapping function between the latent space and the data space (i.e., where the observed data is defined). The assumed latent space is continuous , and its points (that we refer to as latent points) are distributed following a simple uninformative prior distribution that we refer to as prior latent distribution111The adjective “prior” refers to the fact that we assume the latent variables are initially distributed as ., where the semantic aspects of the latent variables are completely entrusted to the generator that learns them in an unsupervised way.
The probability distribution represented by the generator has the following form:
where is the set of learnable parameters of the generator (i.e., primarily the weights of the neural network). Both and can be arbitrary chosen and fixed before the training. They can be intended as hyper-parameters of the model. Typical choices for are or [goodfellow2016nips].
Sampling points from the latent space according to and then mapping them in the data-space through the generator, is equivalent to sampling data points from the data space according to , where p( is the approximation of the target probability distribution estimated by the generator. During this operation, we can generally also consider an arbitrary that can be different222At a cost of representing a distribution different from . from . In the rest of this paper, we will refer to theprobability density function of the latent space with the general term of latent distribution.
To accomplish the generative task, the latent representation is modeled to be able to capture the posterior distribution of the underlying explanatory factors of the observed data [radford2015unsupervised]. Similar to the feature embedding techniques [goyal2018graph, li2018word], the latent representations of semantically bounded data points show strong geometric relations in the latent space [radford2015unsupervised]
. As a result of these properties, such unsupervised learned representation is often used for several other external tasks[IGAN, schlegl2017unsupervised, samangouei2018defense].
Hitaj et al. in their work PassGAN [PassGAN] demonstrated the application of deep generative models as an implicit estimator of password distribution. The capabilities of such models are a result of: (1) their adversarial training process; and (2) the high capacity deep neural networks used for function approximation. These characteristics enable the model to capture the long-tailed distribution of a real-world passwords leak and outperform in expressivity other state-of-the-art tools [PassGAN]. PassGAN harnesses a Wasserstein GAN with gradient penalty [IWGAN] and a residual-block-based architecture [resnet]
. It assumes a latent space that has standard normal distribution as its prior latent distribution and dimensionalityequal to . The model is trained using an 80-20% split of the well known RockYou [rockyou] password leak and only passwords with or fewer characters are considered. They obtained their final test-set by removing the duplicate passwords and the common passwords occurring in both train-set and test-set.
PassGAN, due to its inherent training instability, does not exploit the full potential of the deep generative models in the context of password guessing. We propose a series of improvements in Section 2.3 to overcome these limitations. We will use the improved model (PassGAN+) as the basis for our encoder.
In Section 2.3.1, we propose our model improvements that allows us to outperform PassGAN in the task of password guessing. In Section 2.3.2, we present our encoder network that we use as a tool to learn the inverse mapping. Our core contributions are founded upon these improvements and tools.
The password guessing approach presented in PassGAN suffers from an inherent training instability [PassGAN]. Hence, the generator and the discriminator may not perform a sufficient number of training iterations. This may lead to a limited approximation of the target data distribution and reduced accuracy in the password guessing task. Training instability is a common hurdle for GAN frameworks [WGAN]. The discrete representation of the strings (i.e., passwords) in the train-set333 Each string is represented as a binary matrix obtained by the concatenation of the one-hot encoded characters.
Each string is represented as a binary matrix obtained by the concatenation of the one-hot encoded characters.introduces strong instability for two main reasons: (1) The discrete data format is very hard to reproduce for the generator because of the final softmaxactivation function, which can easily cause numeric instability and a low quality gradient; and (2) The inability of the generator to fully mimic the discrete nature of the train-set makes it very easy for the critic to distinguish444We refer to the original GAN formulation, where the critic is intended to discriminate between the true and the fake data. between real and generated data. Hence, the critic can assign the correct “class” easily, leaving no room for the enhancement of the generator; especially in the final stages of the training.
To tackle the problems above, we apply a form of stochastic smoothing over the representation of the strings contained in the train-set. Moreover, Sønderby et al. [AddNoiseDiscriminator] showed that adding noise to the input of the critic causes benefits to the training process. Hence, in contrast to the work in [IGAN], we smooth the input of the critic instead of the output prediction. The smoothing operation consists of applying an additive noise of small magnitude over the one-hot encoding representation of each character. The smoothing operation is governed by a hyper-parameter , which defines the upper-bound of the noise’s magnitude. We empirically chose and re-normalize each distribution of characters after the application of the noise. This smoothing operation has a significant impact on the dynamics of the training allowing us to perform 30 times more training iterations without training collapse [brock2018large]. We keep the general GAN framework mostly unchanged because of the excellent performance of the gradient penalty in WGAN [IWGAN].
With our improvements in the training process, we can exploit a deeper architecture for both the generator and the critic. We substitute the plain residual blocks with deeper residual bottleneck blocks [resnet]
leaving their number intact. We find the use of batch normalization in the generator to be essential to increase the number of layers of the networks successfully. For precise information about the architecture, please refer to the work[resnet]. Additionally, we reduced the dimensionality of the latent space from to inducing an additional small increment in performance for the password guessing task555We speculate that such performance increment is induced by the more compact and so smarter latent representation..
The new architecture and training process are collectively referred to as PassGAN+. With it, we are able to learn a better approximation of the target password distribution, and consequently, obtain a significant improvement on the number of guessed passwords in the password guessing scenario. This observation is supported by the results reported in Table 1, where PassGAN and our model PassGAN+ (both trained on 80-20% train-test split) are compared over the test-set of RockYou dataset.
In this paper, we use all of the improved settings described in this section for our GAN model.
To fully exploit the properties offered by the learned latent representation of passwords, we need a way to efficiently explore the latent space. Therefore, our first interest is to understand the relation between the observed data (i.e., passwords) and their respective latent representations; in particular, their position within the latent space. A direct way to model this relation is to learn the inverse of the generator function . Actually, GANs, by default, do not need to learn those functions because that requirement is bypassed by the adversarial training approach. To do so, framework variations [donahue2016adversarial, dumoulin2016adversarially] or additionally training phases [luo2017learning] are required.
To avoid any source of instability in the original training procedure, we opt to learn the inverse mapping only after the training of the generator is complete. This is accomplished by training a third encoder network that has an identical architecture as the critic, except for the size of the output layer. The network is trained to simultaneously map both the real (i.e., data coming from the train-set) and generated (i.e., data coming from ) data to the latent space. Specifically, the loss function of is mainly defined as the sum of the two cyclic reconstruction errors over the data space. This is shown in the following:
In Eq. (2), the function is the cross-entropy whereas and are sampled from the train-set and the prior latent distribution, respectively. The variable in refers to the temperature of the final softmax layer of the generator. In Eq. (2), we do not specify temperature on a generator notation when it is assumed that it does not change during the training. The combination of these two reconstruction errors aims at forcing the encoder to learn a general function capable of inverting both the true and generated data correctly. As discussed in Section 2.3.1, the discrepancy between the representation of the true and generated data (i.e., discrete and continuous data) is potentially harmful for the training process. To deal with this issue, we anneal the temperature in loss term during the training. We do that to collapse slowly the continuous representations of the generated data (i.e., the output of the generator) towards the same discrete representation of the real data (i.e., coming from the dataset). Next, an additional loss term, shown in Eq. 3, is added forcing the encoder to map the data space in a dense zone of the latent space (dense with respect to the prior latent distribution).
Our final loss function for is reported in Eq. 4. During the encoder training, we use the same train-set that we used to train the generator, but, we consider only the unique passwords in this case.
In this Section, we present the first major contribution of our paper, i.e., the password strong locality concept and its possible applications for password guessing. In Section 3.1, we introduce the concept of password strong locality with the help of different practical examples. In Section 3.2, we demonstrate the practical application of strong locality by introducing a technique that we call “password template inversion” for closely-related passwords generation. Finally, we propose a possible attack scenario using strong locality and password template inversion, i.e., SSPG, in Section 3.3.
As we briefly introduced in Section 2.1, the latent representation learned by the generator enforces geometric relation among latent points that share semantic relations in the data space. As a consequence, the latent representation maintains “similar” instances closer666For any given metrics; neural-based representations tend to be scale-invariant and typically measured using the cosine distance. each other in the latent space.
In general, the concept of similarity harnessed in the latent space of a deep generative model solely depends on the modeled data domain (e.g., images, text). In the case of our passwords latent representation, this concept of similarity mainly relies on a few key factors such as the structure of the password, the occurrence of common substrings, and the class of characters. Figure 2 (obtained by t-SNE [maaten2008visualizing]) depicts this observation by showing a 2D representation of small portions around three latent points (corresponding to three sample passwords “jimmy91”, “abc123abc”, and “123456”) in the latent space. Looking at the area with password “jimmy91” as the center, we can observe how the surrounding passwords share the same general structure (5L2D i.e., 5 letters followed by 2 digits) and tend to maintain the substring “jimmy” with minor variations. Likewise, the area with the string “abc123abc” exhibits a similar phenomenon; where such string was not present in the selected train-set and does not represent a common password template.
As this property of the latent space forces passwords in the vicinity to share very specific characteristics, such as identical substrings, we refer to it as passwords strong locality. The passwords strong locality property asserts that latent representation of passwords that share specific characteristics are organized close to each other in the latent space.
One of the direct consequences of the geometric ordering imposed by strong locality is that it provides a natural way to generate a specific class of passwords. Hence, if our aim is to generate passwords strictly related to a chosen prototype password , then we simply have to fetch latent points around the latent representation of (i.e., ). By strong locality, the obtained latent points should be valid latent representations of passwords with an arbitrary strong relation with . In this context, we refer the chosen (or its corresponding latent representation ) with the term pivot. The three dark red boxes in Figure 2 are the pivot points in the latent space for their corresponding passwords. However, exploiting this property of the latent space for password guessing requires us to solve two non-trivial challenges:
For a chosen pivot , we must efficiently obtain its corresponding latent representation i.e., .
We have to define a technique to explore the latent space surrounding the obtained latent point .
As described in Section 2.3.2, we solve the first challenge by exploiting an additional network that is trained to approximate the inverse function . The second challenge of exploring the latent space is solved by restricting the generator to sample from a confined area of the latent space (loosely represented by the small dashed circles in Figure 2). To that purpose, we consider a new latent distribution for the generator. The new distribution has the latent representation of the pivot password as its center and an arbitrary small scale. To remain coherent with prior latent distribution and partially avoid distribution mismatch for the sampled points [white2016sampling]
, we chose a Gaussian distribution:, where the latent representation of is obtained through .
According to the concept of password strong locality, the strength of the semantic relation between a sampled latent point and its pivot should be proportional to the spatial distance between them. As a consequence, the chosen value of
(i.e., standard deviation) offers us a direct way to control the level of semantic bounding present in the generated passwords. This intuition is better explained by Table2, where passwords obtained with different values of for the same pivot password are reported.
Lower values of produce highly aligned passwords while larger values of permit to explore far from the pivot and to produce a different type of “similar” passwords. As shown in Table 2, all the passwords generated with retained not only the structure of the pivot (i.e., 5L2D), but also observed minor variations coherent with underlying password distribution. On another side, passwords generated with tend to escape the password template imposed by the pivot and reaching related-but-dissimilar password structures (e.g., “jimmy91992” and “j144988”).
As briefly discussed in Section 3.1, the password strong locality property offers a natural way to generate a very specific/confined class of passwords for a chosen pivot, a task accomplished by exploiting an encoder network . This encoder is trained to approximate the inverse function , and it is the only tool we have to explore the latent space meaningfully. The default behavior of the encoder is to take as an input a string and precisely localize the corresponding latent representation in the latent space. As showed in Table 2, sampling from a distribution centered on the obtained latent point, allows us to generate a set of related passwords. However, this approach alone does not find wide possibilities of application on its own in the password guessing scenario.
In this Section, we show that it is possible to “trick” the encoder network to further localize general classes of passwords in the latent space. The users can arbitrarily define these classes via a minimal template, which expresses the fine-/coarse-grained definition of the target password. This general approach offers straightforward applications in a real-world scenario. Some of those scenarios will be discussed and empirically evaluated in the next sections.
The encoder network can be forced to work around a specific password definition by introducing a wildcard character into its alphabet. The wildcard character - represented with the symbol ‘’ in this paper - can be used as a placeholder to express the presence of an unspecified character. For instance, the template “jimmy” expresses a class of passwords starting with the string “jimmy” followed by two undefined characters. When the encoder inverts this string, the obtained latent point represents the center of the cluster of passwords in the latent space with a total length of characters and a prefix “jimmy”. Therefore, sampling around this latent point allows us to generate good realizations (according to ) of the input template. Column A of Table 3 shows a specimen of such obtained passwords for the template “jimmy
”. In practice, we implement this behavior by mapping a wildcard character to an empty one-hot encoded vector when the matrix corresponding to the input string is given to the encoder. The wildcard characters can be placed at any position or in any quantity to define an arbitrary complex passwords template; some examples are reported in the second row of Table3.
An intuitive and interesting aspect of this approach is that the wildcards are substituted with the most probable characters following the distribution (i.e., the probability distribution modeled by the generator). This phenomenon can be observed in the generated samples (Column A of Table 3): wildcards in most of the generated passwords have been substituted with digits to potentially reproduce the very common password pattern ‘lower_case_string+digits’ [ur2015added]. On the contrary, passwords from the template “91” are reported in Column E of Table 3. In this example, we are asking the generator to find 7 characters long passwords with the last two characters as digits. Here, the generated passwords tend to lie towards two most likely password classes for this case, i.e., ‘lower_case_string+digits’ and all_digits. On the other hand, templates with more observable digits (e.g., Column F of Table 3) end-up generating all_digits passwords with higher probability.
Template-based passwords generation with state-of-the-art tools:
To the best of our knowledge, state-of-the-art tools cannot perform this type of password generation. The probabilistic approaches such as Markov Model (MM)-based and RNN-based (e.g., FLA[fla]) are unable to match the expressiveness offered by our wildcard-based approach. In the case of MM, the assumed -Markov property limits the prediction of characters based on substrings of size or less only, but not more. More importantly, the forward-directionality of the process eliminates the possibility of a efficient estimation of wildcards occurring prior to a given substring (e.g., the case in Column C of Table 3).The issue of forward-directionality also affects the character level RNN used by FLA, where the probability of an exponential number of passwords must be computed before using the characters in the template to prune the passwords tree. This is the case of the template reported in column E where the required computational cost for FLA is not very far from computing all the passwords into the chosen probability threshold and filter the ones coherent with the template. On the contrary, the generation of a password for every template costs a single network inference777Two, if we also count the template inversion, which is performed once. following our approach. Additionally, this issue persists even in the case of a bi-directional RNN [schuster1997bidirectional], where the model would fail to efficiently model the “occasionally-punctured” (i.e., plain characters between wildcards) templates such as the ones reported in the Column D and Column F of Table 3.
The tools that potentially can match the efficiency of our approach are the non-probabilistic ones. In particular, such approaches can take as an input a single string and derive millions of passwords by applying the mangling rules. Usually, the language used to write these rules is very expressive and can match the expressivity of our approach. However, such state-of-the-art tools use human-crafted rules whereas our approach is totally unsupervised. Moreover, the mangling rules are fixed before the password generation process and are plainly applied over every dictionary entry. In other words, there is no relation between the input string and the applied rule888Intuitively, not every mangling rules have equal reason to be applied on all the dictionary entries.. Our approach eventually overcomes this limitation; the generated passwords are dependent and univocally induced from the used template. This means that only passwords coherent with the template and the approximate password distribution are eventually generated.
An interesting scenario in the domain of password guessing is when the target-password is partially known. There are, at least, two practical situations when the target-password is indeed partially available: (1) targeted-attacks by which an adversary targets a particular user(s) via side-channels [ali2015keystroke, marquardt2011sp, vuagnoux2009compromising, balzarotti2008clearshot] and other similar approaches to infer the password of the victim. Such attacks often reveal only part of the password/text correctly because of attacks’ accuracy [chen2010side]; and (2) when the user forgets her/his own password but remembers it partially, a pretty common situation due to the characteristics of human memory.
Formally, we consider a scenario where an attacker possesses a non-empty set of arbitrary small substrings of an unknown target password . The attacker does not have any knowledge about the correct position of the substring(s) in as well as about the length of . But, we consider that the available substring(s) is error-free. The attacker aims at recovering the full password using the information offered by . We refer to this scenario by the name of SubString Password Guessing (SSPG). Thanks to the attributes of strong locality, SSPG can also be intuitively performed with erroneous/noisy substrings. We will discuss the performance of SSPG with noisy substrings in Appendix B.
As discussed in Section 3.2, the strong locality property enforces a geometric relation among passwords having common substrings. Consequently, passwords sharing at least one substring are likely to be located in a few specific clusters in latent space. A natural solution to perform a smart SSPG is to localize these clusters and sample latent points from them. By mapping the sampled latent points in the data space, we would indeed be able to generate good candidate passwords that contain the required substrings with high probability. The first obstacle in this procedure arises due to the way these clusters are distributed in the latent space. Passwords of different lengths tend to be organized in different sections of the latent space999We will exploit this property in Section 4.1 for a different type of password guessing.. The reason for such an organization is that the length of a password is modeled as one of the core explanatory factors [bengio2013representation] by the latent representation. Consequently, passwords with different lengths are distributed far from each other. For instance, the password “123456” will be sufficiently far from the password “123456789” even if these two passwords share a significantly large substring “123456”101010In other words, it is unlikely that we can reach to the latent representations of password “123456789” using string “123456” as pivot.. Next, a similar hindrance results from the latent representation of a substring position inside the password. For instance, the latent representations of “jimmy91” will be far from the one of “91jimmy”.
In other words, passwords containing a given substring are distributed in more than one cluster in the latent space. Therefore, to correctly assess every possible password, we have to cover all such clusters during the password generation process. Accordingly, the challenge is to localize those clusters in the latent space correctly. Nevertheless, we already know an efficient way to localize those clusters; that is, using our template inversion (discussed in Section 3.2). Consider the following example: if we want to localize the clusters of passwords of length that starts with “jimmy”, we have to invert the template “jimmy”. Likewise, we have to use “jimmy” for passwords of length that starts with “jimmy”. Next, the same can be done for the position of the substring, e.g., “jimmy”, “jimmy”, and “jimmy” for passwords of length containing “jimmy” somewhere. Therefore, for any substring, we can easily spot all the valid zones of the latent space by enumerating all the possible password templates containing that substring. As a representative example, we obtain all the possible templates (shown in Table A.2 in Appendix A) for passwords with maximum length that contain the substring “jimmy”. The same operation can be performed for multiple disjointed substrings just by computing the possible valid templates. The pseudo-code for our SSPG approach is shown in Algorithm 1. The symbols and refer to the Encoder and the Generator, respectively whereas the routine enumerateTemplates is a function that returns a list of the valid templates for a given set of substrings and a maximum password length. The operation means that the generated password complies with the template .
In this Section, we evaluate our approach against the state-of-the-art mangling rules-based approaches, i.e., JTR [jtr], HashCat [hashcat], and PCFG [pcfg] for the SSPG scenario. We have already discussed in Section 3.2 that due to their inefficiency, MM-based and RNN-based tools are not a practical choice for this type of attack. We are aware that the state-of-the-art tools are not fully suitable or designed to perform SSPG. For a fair comparison, we limit our approach to generate the same number of valid passwords as the most performing competitor tool and focus on the qualitative comparison of the generated passwords. To further increase the fairness, we choose the most suitable SSPG scenario for the competitor tools: i.e., by considering the substring set containing just one single entry .
We use the LinkedIn password leak [linkedin_leak] as the base dataset for the performance evaluation of different tools. This password leak is composed of over unique passwords. From this set, we keep passwords with length or less obtaining unique passwords, which is times the RockYou train-set used to train our model (details in Section 2.3.1). For a holistic evaluation, we created two sub-datasets from this filtered LinkedIn dataset. These sub-datasets are created by selecting passwords that contain peculiar substrings. In other words, these substrings model two different level of attacker’s knowledge and attack-scenarios:
The substrings used to create the first sub-dataset is a list of English first names [english_names]. The aim of this evaluation set is to model a scenario where an attacker knows the first name of the target. The first names are usually public and commonly used as part of the passwords. As a representative example, the LinkedIn leak has passwords containing English first names [english_names]. Moreover, the first names are also frequently used as part of common/classic password templates that can be easily reproduced by mangling rules. To create the sub-dataset, we proceed as follows: for each name in the list, we take all the passwords in the filtered-LinkedIn dataset that have as a proper substring in it: i.e., . Here, is the substring operator (i.e., “is a substring in”) and is the filtered-LinkedIn dataset. Next, we further retain only those passwords in that have a minimum cardinality of in the filtered-LinkedIn dataset. This filtering gave us the final set containing passwords and . The average length of these selected substrings is and are composed of lower-case characters only.
The second sub-dataset represents a more general case, and it also better fulfills the preconditions discussed in Section 3.3. In this case, we select -most common substrings with minimum-length of 3 that are present in the passwords of the LinkedIn leak. Similar to the previous sub-dataset of English first names, we select . However, we select passwords in a set with minimum cardinality of in the filtered-LinkedIn dataset as the average length of selected substrings is . The final sub-dataset is composed of passwords and . A small sample of the selected substrings shown in Table 4.
Given an evaluation set (i.e., or ), we evaluate each tool in the task of password guessing over each subset of passwords (i.e., ) separately. In the process, every tool is exposed to the substring , which is exploited for generating guesses. Then, the intersection between the set of generated password/guesses and is computed, i.e., the number of generated passwords that match the passwords in . Finally, the overall cardinality of the matched passwords for the entire evaluation set (i.e., or ) against the total number of generated passwords for the entire evaluation set (i.e., or ) is used as the evaluation criterium for each tool.
We generate guesses by using the Algorithm 1 for our approach. In particular, we use and (chosen empirically) in the algorithm for each . Next, the value of has been chosen to match the maximum number of passwords produced by one of the state-of-the-art tools.
For mangling rules-based tools (i.e., HashCat and JTR), we implement SSPG by keeping the substring as the only entry in the dictionaries and applying the rules on this single-entry-dictionary to generate passwords. For each of these two tools, we chose the largest set of rules available (to the best of our knowledge). These rules are KoreLogic [korelogic_rules] for JTR and Dive [dive_rules] for HashCat, which are respectively composed of over and rules. We generate every possible password using the rule-sets and only consider those passwords as valid passwords that contain the substring as a proper substring.
For PCFG, we train/learn its grammar by using the same dataset (i.e., RockYou train-set) used to train our generator, and by using the default parameters [pcfg_github]. During the password generation process, we use as the only valid entry for the Alpha variables of PCFG [pcfg] whereas the entries for other types of variable (e.g., Capitalization and Digits) are kept unaltered. We can evaluate PCFG only with because it can not model variables that represent strings composed of mixed character classes (such as “a19” and “a20” reported in Table 4 for ). Here, we generate passwords with each and retain only the unique passwords containing the given substring .
Figure 3 (a) and Figure 3 (b) show the results for different tools over the evaluation sets and , respectively. As mentioned in Section 3.3.1, each line depicts the sum of the matched passwords against the total number of generated valid passwords for the entire evaluation set. HashCat, JTR, and PCFG (when applicable) generate a heterogeneous number of valid passwords. For HashCat and JTR, that value is directly influenced by the number and types of word-mangling rules. JTR with KoreLogic rule-set produces the highest number () of valid passwords. Therefore, we tune the variable in our Algorithm 1 to produce passwords in the same magnitude. At the bottom, PCFG is able to produce just valid passwords despite the higher limits of requested passwords. The reason for such a low number of passwords is the small size of grammars produced after the training, whose number cannot be increased.
In case of , HashCat generates the least number of valid passwords. PCFG generates more passwords than HashCat but matches slightly fewer passwords than HashCat. On the other side, JTR generates the highest number of valid passwords and matches more passwords than both HashCat and PCFG. In this experiment, our GAN model matches over 240% more passwords than JTR for equal number of guesses. It is important to highlight that in contrast to other tools, our GAN model can continue to generate (depending on in Algorithm 1) and efficiently match more passwords.
In case of , HashCat again generates the least number of valid passwords whereas JTR again generates the highest number of valid passwords. Nevertheless, both these tools match a significantly smaller number of passwords as compared to our GAN model in SSPG mode. Our approach matches over 900% and 400% more passwords than HashCat and JTR, respectively. The reason why both HashCat and JTR perform poorly is that the mangling rule-sets are designed to match most probable passwords and are not intended to work with random strings. Hence, in such a complex situation, our approach - that is not biased to a specific scenario - can perform significantly better than the other state-of-the-art tools.
In this Section, we present our second major contribution, i.e., the password weak locality concept and its possible applications in the field of password guessing. In Section 4.1, we introduce the concept of password weak locality with the help of different practical examples. Section 4.2 presents DPG from theoretical (Section 4.2.1) as well as practical (Section 4.2.2) point of view. Finally, we mention potential applications of DPG.
The embedding properties of the latent representation map passwords with similar characteristics close to each other in the latent space. We called this property strong locality, and we exploited it intending to generate variations of a chosen pivot password or template (discussed in Section 3.1). In that case, the adjective “strong” highlights the strict semantic relation among the generated set of passwords. However, the same dynamics that enables the strong locality also allows a more generic and broad form of semantic bounding among passwords. This latter property seems to be able to partially capture the general features of the whole passwords distribution. Such features could be very abstract properties of the distribution, such as the average passwords length and character distribution due to password policies. We refer to this observed property as password weak locality in contrast with the strong locality.
As a representative example, Figure 4 depicts the 2D representation of passwords from myspace [myspace], hotmail [hotmail], and phpbb [phpbb] on the latent space learned by a generator on the RockYou train-set111111It is important to emphasize that these graphical depictions are obtained by a dimension reduction algorithm. Hence, they do not depict latent space accurately. So, they merely serve as a representative illustration. We will verify our assumption empirically later in the paper.. We can observe that the passwords coming from the same (within one) dataset tend to be concentrated in the latent space and do not spread abruptly all over the spectrum. This can be traced back to the fact that passwords sharing very general features (e.g., like those coming from the same passwords distribution) are mapped close to each other in wide but bounded zones of the latent space.
The dimensionality of the fraction of latent space covered by an entire passwords set (the red parts in Figures 4 (a), (b), and (c) clearly depends on the heterogeneity of its passwords. Passwords from smaller sets (e.g., myspace) are concentrated in restricted and dense zone of the latent space, whereas passwords from larger sets (e.g., as phpbb) tend to cover a bigger section while they are still closely knitted.
In the following sections, we will present evidence of the strong locality property, and we will show how to exploit this property of the latent space to improve password guessing.
Probabilistic password guessing tools implicitly or explicitly attempt to capture the data distribution behind a set of observed passwords, i.e., the train-set. This modeled distribution is then used to generate new and coherent guesses during a password guessing attack. A train-set is usually composed of passwords that were previously leaked. By assumption, every password-set leak is characterized by a specific password distribution . When we train the probabilistic model, we implicitly assume
to be general enough to well-represent the entire class of password distributions. This generality is essentially due to the fact that the real-word password guessing attacks are indeed performed over sets of passwords that potentially come from completely different password distributions. As a matter of fact, we typically do not have any information about the distribution of the attacked-set. This can indeed be completely different from the one used for model training. As a representative example, different password policies or users’ predominant languages can cause the test-set’s distribution to drastically differ from the train-set’s distribution. This discrepancy in the distribution of the train-set and test-set is a well-known issue in the domain of machine learning, and it is referred to ascovariate shift [covariate_shift].
As stated above, typically, we do not know anything about the distribution of the attacked-set. However, once we crack the first password, we can start to observe and model the attacked distribution. Every new successful-guess provides valuable information that we can leverage to improve the quality of the attack, i.e., to reduce the covariate shift. This iterative procedure recalls a Bayesian-like approach since there is continuous feedback between observations and probability distribution. However, we highlight that in our case we do not use neither a prior nor a posterior probability distribution.
neither a prior nor a posterior probability distribution.
For fully data-driven approaches - a naive solution to incorporate such new information - is to fine-tune the model so as to change the learned password distribution. However, prescribed probabilistic models such as FLA directly estimate the password distribution using a parametric function:
where is the set of weights of a neural network. In this case, the only chance to modify the distribution in a meaningful way is to act on harnessing a learning process. However, this is not an easy/attractive solution. Mainly because the new guessed passwords are potentially not enough representative121212A very few cracked passwords against a dataset of millions of un-cracked passwords. to force the model to generalize over the new information. Additionally, the computational cost of applying fine-tuning on the network is considerable, and sound results cannot be ensured due to the sensitivity of the process.
Similar to FLA, our generative model also exploits a neural network as an estimator. However, the modeled distribution is a joint probability distribution, shown in Eq. 6:
where is referred to as the latent distribution.
As introduced in Section 2.1, when (i.e., prior latent distribution), is provable to be a good approximation of the target data distribution (i.e., the distribution followed by the train-set). Nevertheless, can be arbitrarily chosen and used to indirectly change the probability distribution modeled by the generator. The RHS of the Eq. 6 clearly shows that is not the only free parameter affecting the final passwords distribution. Indeed, is completely independent of the generator, and so can be modified arbitrarily without acting on the parameters of the neural network.
This possibility, along with the passwords weak locality of the latent space, allows us to correctly and efficiently generalize over the new guessed passwords, leading the pre-trained network to model a password distribution closer to the attacked one. It is noteworthy that this capability of generalizing over the new points is achieved via the weak locality and not from the neural network itself. The intuition here is that when we change to assign more density to a specific guessed password , we are also increasing the probability of its neighboring passwords that, due to the weak locality property, are the passwords with similar characteristics. This, in turn, makes possible to highlight the general features of the guessed passwords (e.g., structure, length, character set, etc.), instead of focusing on its more fine-grained131313Features that do not give us hints on the attacked password distribution. and specific aspects.
So, by controlling the latent distribution, we can choose to increase the probabilities of the zones that are potentially covered by the passwords coming from the attacked distribution. We call this technique Dynamic Password Guessing (DPG). In the case of homogeneous distribution (e.g., myspace), we can narrow down the solution space around the dense zones, and avoid exploring the whole latent-space. On the other side, for passwords sets sampled from distributions far from the one modeled by the generator, we can focus on zones of the latent space which, otherwise, would have been poorly explored. In both cases, we can reduce the covariate shift and improve the performance of the password guessing attack.
In this Section, we explain DPG from a practical point of view. Algorithm 2 briefly describes the DPG.
Here, represents the target set of passwords, is the collection of all the passwords guessed by the generator, and is defined as the hot-start parameter of the attack, an ingredient that we describe later in this section. The variable in the pseudo-code, represents the latent distribution from which we sample latent points. For simplicity, we use the notation to directly express the sampling operation of a latent point from the latent space according to the distribution . The procedure makeLatentDistribution returns the latent distribution induced from the group of guessed passwords at step . Leveraging the maximum-likelihood framework, we choose such distribution to maximize the probability of the set of observed passwords according to Eq. 7 by using the latent distribution as the only free parameter.
This is accomplished by considering a latent distribution conditioned to the set of passwords guessed at each step . The final password distribution represented by the generator during the DPG is reported in Eq. 8.
As a natural extension of the proximity password generation harnessed in Section 3.2, we choose to represent as a finite mixture of isotropic Gaussians. In particular, the mixture is composed by Gaussians, where: (1) is the number of the latent points in ; and (2) for each , a Gaussian is defined as with center as and a fixed standard deviation .
When the probability of a password, i.e., , is known141414We known its frequency in the attacked set of passwords. In a off-line attack, this is usually the case., we weight the importance of the distribution as ; otherwise a uniform distribution among the Gaussians is assumed. Equation 9 defines the probability density function of the latent space.
Every new guessed password introduces a new Gaussian centered at to the mixture. As a consequence, every new guessed password contributes to changes in the latent distribution by moving the density of the distribution in the zone of the latent space where it lies. Figure 5 visualizes this phenomenon.
Figure 6 depicts the performance comparison between a static attack (e.g., PassGAN) and the DPG over the three passwords sets. Adaptively changing the latent distribution allows us to boost the number of guessed passwords per unit of time. In the phpbb set, we match additional passwords with respect to the static attack technique. Importantly, this improvement comes without any additional information or assumption over the attacked passwords set. In addition, the computational overhead due to the new sampling technique is negligible. The steep improvement in the performance obtained with the DPG gives additional support to our assumption made on the weak locality of the latent space. Furthermore, it confirms that reducing the covariate shift has a direct and concrete impact on the number of guessed passwords.
The sudden growth in the guessed passwords in the DPG (shown in Figure 6) is due to the hot-start or parameter. In other words, we use the prior latent distribution until a predetermined number () of passwords have been guessed. After that, we start to use the conditional latent distribution . The reason is that if the DPG starts with the very first guessed password, then the latent distribution can be stuck in a small area of the latent space. However, launching DPG after guessing a sufficient number of passwords (i.e., after finding a set of unbiased latent points in the latent space) gives us the possibility to match a heterogeneous set of passwords, which correctly localize the dense zones of the latent space where the attacked passwords are likely to lie. These observations are also evident with our empirical results shown in Figure 7, which depicts a comparison among the static attack, a DPG with , and a DPG with (i.e., no hot-start). These results confirm that the absence of hot-start indeed affects and eventually degrades the performance of the DPG.
The final hyper-parameter of our attack is the standard deviation () assigned to every Gaussian in the mixture. This value defines how far we want to sample from the clusters of observed passwords. A larger value of allows us to be less biased and explore a wider zone around the guessed passwords; whereas a smaller value permits a more focused inspection of the latter. Therefore, the value of can be interpreted as the parameter controlling the exploration-exploitation trade-off151515 A trade-off often occurring in reinforcement learning.
A trade-off often occurring in reinforcement learning.in the attack. Figure 8 depicts the effect of different values of on the performance of DPG. Smaller values of yields better overall results. This outcome suggests that it is not necessary to sample too far from the dense zones imposed by , and rather a focused exploration of those zones is beneficial. This observation is perfectly coherent with the concept of weak locality, giving further support to the speculated ability of the latent space of capturing and translating general features of an entire password distribution in geometric relations.
Applications and conclusions: We demonstrated that the DPG framework offers a direct way to deal with the covariate shift phenomenon that naturally occurs in real-world password guessing scenarios. Furthermore, the potential of DPG is not limited to this specific phenomenon. Following are a few potential applications of DPG:
DPG can be easily extended to support a form of “bootstrapping” for the password guessing attack. As an example, consider a situation in which the attacker knows a small set of passwords from the targeted password set. The attacker can utilize this additional information to create a bias in the password guessing attack and boost its performance even before the attack. More precisely, these known passwords can be modeled in the latent distribution of the generator by moving the density around the new latent points that are obtained with the encoder network. In practice, this can be easily achieved by initializing the set (Algorithm 2) with these known passwords. Moreover, these latent points could also be obtained using the template inversion technique presented in Section 3.2. This is the case in which the attacker retrieves, somehow, only partial information and not the entire passwords.
Hitaj et al. demonstrated [PassGAN] that GAN-based models are able to produce a class of passwords that is significantly different (disjoint) from the passwords obtained using other state-of-the-art tools. Therefore, combining such different password guessing approaches allows us to improve the final outcome [PassGAN]. Our DPG technique offers a direct way to enhance the performance of such combinations further. As a matter of fact, DPG allows us to focus on the zones of latent space that were not covered (due to its design/bias) by the first tool in the pipeline. Consequently, DPG technique can significantly increase the probability of generating guesses that were not generated by the previous tool.
In conclusion, the building blocks of the DPG, i.e., the malleable latent density and dynamic approximation of the attacked password distribution, are generalized approaches that can be used to increase the attack’s performance under various assumptions. These concepts can be easily extended and are naturally open to various applications.
Password guessing is a classical attack, by which an attacker tries to guess the right password by repeatedly testing various candidate passwords. Systematic studies on password guessing dates back to 1979 [morris1979password], and probably, password guessing attacks exist since the inception of the concept of passwords [bidgoli2006handbook]. Since a vast number of works have been proposed in this active area of research, we limit the discussion to state-of-the art tools and techniques used for password cracking in this section.
Narayanan et al. [narayanan2005fast]
proposed to use standard Markov modeling techniques from natural language processing to generate password guesses. Their approach requires manual intervention for defining password rules that describe the structure of the generated passwords. Weir et al.[pcfg] extended this technique via Probabilistic Context-Free Grammars (PCFGs). In particular, Weir et al. [pcfg] showed a technique to “learn” the password rules from a given set of passwords. Durmuth et al. [durmuth2015omen] and Ma et al. [ma2014study] have also proposed enhancements in this direction of password guessing.
John The Ripper (JTR) [jtr] and HashCat [hashcat] are the two most widely used password guessing tools. Both JTR and HashCat have demonstrated their effectiveness at guessing/recovering the passwords from several leaked password dataset [cracking_passwords_101]. Both the tools support a number of password guessing strategies including: (1) classical brute-force attacks; (2) dictionary-based attacks; (3) rule-based (also called mangled wordlist) attacks [hashcat_rules, korelogic_rules], which is one of the most exploited technique; and (4) Markov model-based attacks [hashcat_markov, jtr_markov].
Ciaramella et al. [ciaramella2006neural] introduced neural networks for password guessing in their seminal work. In the same line of development, Melicher et al. [fla]
proposed FLA (Fast, Lean, and Accurate) that uses recurrent neural networks[graves2013generating, sutskever2011generating] to estimate the password distribution, which is then used to guess the strength of a password. Hitaj et al. [PassGAN] presented PassGAN that uses a GAN to autonomously learn the distribution of real passwords from actual password leaks, and to generate password guesses.
Similarly to our SSPG framework, different works have focused on creating password variations for a given starting password [pal2019beyond, das2014tangled], primarily with the aim of modeling credential tweaking attacks. Credential tweaking is a targeted attack where the adversary knows the targeted user’s credentials for one or more services and aims to compromise accounts of the same user on other services. Different from credential stuffing, here user’s passwords are suppose to be “tweaked” versions161616The user can create such password variations to accommodate passwords composition policies of different services. of the known ones. In this direction, Pal [pal2019beyond] et al. proposed novel attack/defense techniques for credential tweaking. Both the attack and the defense techniques are built on top of a password similarity concept. They model a specific form of semantic similarity by using a supervised dataset of user-password pairs. They assume the distributional hypothesis for passwords to be true, and define two passwords to be ‘similar’ if they are often chosen together by users. The proposed attack technique is founded on a probabilistic neural model, and it aims to produce tweaked variations of an input password. The produced variations are then used as suitable guesses for the targeted tweaking attack. More interestingly, their defensive technique firstly glances the application of supervised representation learning in password guessing. Their technique is based on constructing an embedding space that is learned using the out-of-the-shelf word embedding tool. In such space, the geometric relation between passwords is used to estimate the similarity between chosen passwords. This similarity measure is then used to build a “Personalized Password Strength Meter” that aims to spot the use of tweaked password by the user at password creation time. In contrast to our password representation, their embedding space does not allow sampling operation and so passwords generation as well.
Orthogonal to the current research directions, we propose a complete paradigm shift in the task of password guessing. We demonstrate that locality principles imposed by the latent representation of a GAN generator can open new practical and theoretical possibilities in this field. Based on these properties, we propose two new password guessing frameworks, i.e., SSPG and DPG. SSPG, along with its underlying foundation, i.e., the template-based password generation, is useful in several real-world scenarios. We empirically demonstrated its efficiency over its potential competitors. DPG demonstrates that the knowledge from freshly guessed passwords can be successfully generalized and used to reduce the covariate shift phenomenon. We believe that these properties can also be used to do an efficient estimation of password guessability. We will explore this possibility in our future efforts.
The code, pre-trained models, validation sets, and other materials related to our work are publicly available at: https://tinyurl.com/yyqbv7n2
Table A.1 lists the value of hyper-parameters used to train the encoder network. Table A.2 shows the templates for passwords with maximum length of 10 that contain a substring “jimmy”. Figure A.1 shows the distribution of matched passwords over each substring for our approach against the most performing competitor tool in the task of error-free SSPG.
|Temperature decay step||250000|
SSPG comes handy naturally in scenarios where partial information - a substring - of the target-password is known. SSPG assumes that the known substring is correct. But, what if the known substring is not completely accurate? In other words, how SSPG performs if the the substring is erroneous (e.g., “min” instead of “man”)? This situation can arise due to the following reasons for the two cases that we mentioned in Section 3.3: (1) inaccuracy of the side-channel attack; and (2) incorrect remembrance of the password by the user. Formally, we now consider an attack scenario where the available substring is a noisy version of the true substring that is actually present in the target password .
The password strong locality property offers us a way to deal with this kind of scenarios as well. As shown in Section 3.1, it allows us to explore passwords with common substrings as well as the passwords with variations of the substrings. The samples reported in Table 2, present some such cases. Single character variations e.g., ‘s’ in “simmy91” or ‘m’ in “mimmy91”, are reachable from “jimmy91” with a lower value of (i.e., 0.05). On the other side, reaching higher-character variations requires a higher value of , e.g., “sirsy91” from “jimmy91” needs .
As shown in our results (Section 3.3.1), the state-of-the-art tools do not work at all or, at most, perform inefficiently SSPG with error-free substring (Section 3.3) whereas none of the state-of-the-art tools is designed to model SSPG with noisy substring. To be specific, enumerating passwords containing with character level RNN [fla] does not provide any information about the passwords containing . Likewise, applying word mangling rules on is unlikely to produce passwords with as substring. On the contrary, our approach with strong locality will map passwords containing or closer in the latent space, given that and are similar to each other (e.g., “man” and “min” against “abc” and “9$t”). Therefore, sampling around the pivots induced by can also cover the passwords related to depending on the similarity of and as well as chosen value of . Furthermore, the chosen value of can be used to reflect the confidence that an attacker has in the eavesdropped . The only modification required in our Algorithm 1 to perform this SSPG with noisy substring is to remove the clause at line number .
To give empirical support to our claim, we repeat the SSPG experiment with (Section 3.3.1), but, now with a noisy version of the substrings. In particular, we compute the noisy substring for each (in other words, for each ) by applying a distortion function on . This distortion function selects a random character in and substitutes it with another random character of the same character-class. Then, we apply the SSPG algorithm (Algorithm 1) with and the same value of . We chose two different values of : and . The passwords obtained from these new experiments are evaluated against the results with the exact substring, i.e., we take matched-passwords with as the ground truth instead of the entire set . Figure B.1 depicts the proportion of matched passwords with respect to the experiment with exact substrings for the two new experiments.
To present a clear picture, we report only random entries. With (the same value used for the experiment with the exact substring), we are able to cover part of the previous results. Hence, for the dynamics mentioned above, we can match passwords even if the substring used to localize the zones of the latent space is partially erroneous. However, the obtained results are heterogeneous; in few cases we match high number of passwords while we match less in others. We believe that this is intrinsically related to the organization of the latent space where not all perturbations are treated equally. Nevertheless, increasing the value of enables us to explore far from the pivots induced by the noisy string, which increases the probability to cover the areas of latent space dedicated to the clean string . As evident in our results, increasing the value of uniformly improves the overall performance.