1 Introduction
In recent years, we see many exciting developments in applied machine learning and, in particular, its application in the fundamental problem of language modeling
Sutskever et al. (2011); Jozefowicz et al. (2016)in the field of natural language processing (NLP). However, these advancements can be exploited by computationally resourceful entities such as a surveillance state to effectively monitor its citizens’
ostensibly private communications at scale.We are motivated to study the communication privacy problem of concealing sensitive messages in monitored channels. In order to avoid raising suspicion in the monitoring party, we want to hide the intended message inside a fluent message, known as a stegotext, indistinguishable from what is expected in such channels. This is a problem studied primarily in steganography and steganography researchers have a keen interest in linguistic steganography as it presents fundamental challenges Chang and Clark (2014); the linguistic channel carries few bits per symbol on average Shannon (1951); Brown et al. (1992) making it hard to hide a message. In contrast, images and sound recordings have a high information theoretic entropy comparing to a written message making it relatively easy to embed a message in the noise floor of the channel.
This problem of hiding secret messages in plain sight might evoke spy stories of concealing messages on newspaper advertisements during Cold War. Such manual methods have been superseded by algorithmic approaches. Classic methods prior to the advance of applied machine learning in this domain typically try to produce grammatical English with generative grammar Chapman and Davida (1997). However, such generation methods fall short in terms of statistical imperceptibility Meng et al. (2008). This makes them vulnerable to automated detection. Generating fluent^{1}^{1}1It is often referred to as “naturalness” in linguistic steganography literature. text at scale is at the heart of the steganography problem, and language models (LM) studied in NLP provide a natural solution by letting us draw samples of fluent texts.
At the working heart of a LMbased stegosystem, there lies an encoding algorithm that encodes a ciphertext (a random string indistinguishable from a series of fair coin flips) into a fluent stegotext using an LM. From the communication standpoint, this encoding must be uniquely decodable, i.e. different ciphertext are encoded into different stegotexts otherwise the receiver will not be able to decode and recover the ciphertext. Instead of sampling according to the LM, an encoding algorithm effectively induces a new language model by providing a nonstandard way to draw samples from the LM. Thus, from the language modeling standpoint, in order to achieve statistical imperceptibility, extra care is needed to ensure the resulting LM is close to the original LM (Sec. 2.2). Various uniquely decodable algorithms has been devised by recent pioneering works Fang et al. (2017); Yang et al. (2018)
leveraging recurrent neural networkbased LMs, and the highquality stegotexts generated show tremendous promise in terms of both fluency and information hiding capacity. However, these methods do not explicitly provide guarantees on imperceptibility. Instead, their imperceptibility, as we will argue, relies on
implicit assumptions on the statistical behaviors of the underlying LM, and ultimately, of fluent texts (Sec. 3). We will empirically evaluate these assumptions and show that they are problematic (Sec. 3.1). In response, we will propose an improved encoding algorithm patientHuffman that explicitly maintains imperceptibility (Sec. 3.3).To see that imperceptibility crucially depends on the statistics of fluent texts, consider plausible continuations of the following two prefixes, “I like your” and “It is on top.” In the first case, there are many likely next words such as “work”, “style”, “idea”, “game”, “book”, whereas in the latter, there are few such as “of”, “,”, “and”, “.” with “of” being overwhelmingly likely. Intuitively speaking, the distribution over next tokens in fluent texts is sometimes uniform and sometimes highly concentrated.^{2}^{2}2
Under the estimates of
GPT2117M, the first continuation has entropy of 11.2 bits and the latter, 0.43 bits. The most likely next tokens shown are also drawn from this model. When it is concentrated, if we choose the next token by flipping fair coins, we will be sampling from a very different distribution and risk being detected after a few samples. In patientHuffman, we actively evaluate how different the encoding distribution and the LM distribution are, and avoid encoding at steps that can expose us.The highlights of this work are the following:

We quantify statistical imperceptibility with total variation distance (TVD) between language models. We study the TVD of several encoding algorithms and point out the implicit assumption for them to be nearimperceptible.

We use a stateoftheart transformerbased, subwordlevel LM, GPT2117M Radford et al. (2019), to empirically evaluate the plausibility of assumptions implicitly made by different encoding methods.

We propose an encoding algorithm patientHuffman with strong relative statistical imperceptibility.
2 Formalism
Suppose Alice (sender) wants to send Bob (receiver) a sensitive message (plaintext) through a channel monitored by Eve (adversary). This channel may be shared by many other communicating parties. Furthermore, Eve expects to see fluent natural language texts in this channel. Alice wants to avoid sending nonfluent texts through this channel to raise Eve’s suspicion while ensuring that only Bob can read the sensitive message.
One class of solutions is to

Alice encrypts the plaintext message into a ciphertext with a key shared with Bob.^{3}^{3}3Public key encryption can also work. Alice will encrypt the plaintext with Bob’s public key and Bob decrypts with his private key in that case.

Alice hides the ciphertext, which has the statistics of random coin flips, into a fluent stegotext.

Alice sends the stegotext through a channel monitored by Eve.

Bob receives the stegotext and seeks the ciphertext from it.

Bob decrypts the ciphertext with the shared key and obtain the plaintext message.
Linguistic stegosystems concern with steps 2 (hide) and 4 (seek), i.e. encoding a random bitstring into a fluent stegotext and extracting the original bitstring from such fluent stegotexts, respectively.
A vocabulary of size is a finite set of tokens.^{4}^{4}4Tokens can be characters, subword units or words depending on the modeling choices. We will be using subword units based on byte pair encoding in our experiments. An extended vocabulary is the set of all finite sequences of tokens from . We call its elements texts. A language model is a measure over some extended vocabulary . Furthermore, we assume that we have access to the conditional distribution over the next token given a prefix and the distribution of the initial token . An LM specified in this way allows us to draw samples easily. We can draw a sample text by drawing each one at a time for according to LM. We call the random sample text an fluent text.
Total variation distance (TVD) between two measures and over the same events denoted by algebra , is (see A.1 for more facts).
A ciphertext of length
is a random variable
. An encoding algorithm is an injective map from ciphertexts to distributions over texts which may depend on the LM . Injectivity ensures that the stegotexts are unique decodable.2.1 Nearimperceptibility
Instead of using the informal notion of imperceptibility common in steganography literature which relies on a human eavesdropper (playing Eve) judging the quality, we consider a formal statistical notion of nearimperceptibility. We say a measure over texts , i.e. an LM, is imperceptible with respect to a language model if . This formalization is motivated by the fact that for any algorithm, it takes at least many samples to tell whether the samples come from or with high confidence.^{5}^{5}5This is a basic result from information theory. See for example Tulsiani (2014). The smaller is, the more samples are required for Eve to discover the presence of our steganographic communication regardless of her computational resource. Therefore, we want to find encoding algorithms that are nearimperceptible with respect to the true LM of the monitored channel.
2.2 Decomposition of TVD
Suppose the true LM of the monitored channel is , and we have access to a base LM , then running encoding algorithm induces an effective LM . Consider the TVD between the effective LM and the true LM
by triangle inequality.
The first term on the right hand side corresponds to how good our LM is, which is limited by the advancement in LM research. The second term is the gap due to our encoding algorithm and it is the focus of this study. Without knowing how large the first term is, we can still pursue a meaningful relative imperceptibility guarantee of the form, “it will be as hard to detect the presence of the steganographic communication as detecting the presence of fluent texts.”
We can further decompose the second term on the right hand side over each generation step and suppose is the prefix, we can use Pinsker’s inequality Tulsiani (2014)
and additivity of Kullback–Leibler divergence (KL divergence)
^{6}^{6}6We will consistently compute KL divergence in base , i.e. we measure entropy in bits. over product measures to obtain a bound via the KL divergence on each stepHence in order to obtain relative nearimperceptibility, it is sufficient to ensure that at each generation step, the effective LM is close to the base LM . (See an analogous decomposition in terms of perstep TVD in A.1.)
3 Analysis
Suppose is a prefix (tokens generated up to the current step), and the base LM is .
3.1 Bins
Fang et al. (2017) divide the vocabulary into disjoint bins of equal sizes, , that is, and . The partition is randomly chosen and shared between Alice and Bob. Then we split a ciphertext into many length blocks . We encode the ciphertext by encoding each . To encode a random block , we pick a token from the th bin, i.e. , according to . Suppose falls in the bin , we effectively sample a token according to
and the KL divergence is
(See A.3 for detailed derivation.) The last term is the entropy of the partitions at the current step which is bounded between zero and . Hence, the KL divergence is at most
at each step. However, if the probability mass is roughly evenly distributed over each of the
bins, then the KL divergence is close to zero. This is the implicit assumption about fluent texts Bins makes.We empirically examine how well this assumption holds. We use GPT2117M as the base LM and sample from it 50 prefixes with 40 steps each, saving 2K steps of conditional distributions. We fix a randomly chosen partition of bins. The computed KL divergence concentrates in the lowbit region with a second mode near 3 bit, the maximum (Fig. 1). The mean of the distribution is bits, meaning that in ten steps the KL bound on TVD will be vacuous, encoding about 30 bits of ciphertext.
3.2 Variablelength coding (Vlc)
Instead of using a fixedlength coding (one stegotext token always encodes bits in Bins), VLC encodes one or more bits per generated token Yang et al. (2018). VLC constructs a Huffman coding of at each step according to .^{7}^{7}7This takes . Then we sample a token from the constructed Huffman tree by following the bits in ciphertext starting at the root, taking the left subtree if the bit is zero else the right subtree until reaching a leaf. The resulting Huffman distribution assigns probability mass for a token at depth . Being a minimum redundancy code, the corresponding Huffman distribution has the minimum KL divergence among binary prefixfree codes Huffman (1952) of at most 1 bit. But will there be steps with large KL divergence like the example “It is on top” in Sec. 1? We computed the KL divergence of Huffman codes for the same 2K samples (Fig. 2). The mean of 0.12 bits is significantly lower than Bins’s but it still has a second mode near 1 bit, the maximum.
3.3 patientHuffman
We improve VLC further by explicitly checking if the TVD^{8}^{8}8Computing TVD or KL divergence is . (or the KL divergence) between the base LM distribution and the Huffman distribution is small enough (Algorithm 1). If the TVD is greater than a specified threshold at the current encoding step, instead of sampling from the Huffman distribution, we sample from the base LM distribution and patiently wait for another opportunity.
Clearly, this ensures that each step incurs no more additional TVD than the specified threshold . In principle, if we set for the th step, then we can bound the total TVD, guaranteeing the relative nearimperceptibility of the generated stegotext.
However, in practice, getting any meaningful bounds (total TVD ) will require setting very small and this translates to an empirical assumption that many fluent texts’ next token distributions lie arbitrarily close to the Huffman distributions. Examining Fig. 2, we see that there are many steps with KL divergence close to zero. This assumption, though more benign than VLC’s or Bins’s empirically, is hard to establish theoretically for fluent text.
4 Discussion
We focus on the encoding algorithm in our analysis but it is not hard to see that Bob can correctly decode the ciphertext from the stegotext by running the same algorithm with the same LM and the same ciphertext block size (and other parameters if any) as Alice, e.g. patientHuffman with the same threshold, and extract the unique (Huffman) code corresponding to the observed token as ciphertext.
The generic approach of embedding a ciphertext into a stegotext that has some anticipated distribution studied in this paper can very well apply to other channels such as images or audios where we can access the marginal distribution via a (deep) generative model.
Formal notions of steganographic secrecy have been studied in the cryptography community. In particular, Hopper et al. (2008) develop a complexity theoretic notion and characterize its necessary conditions and its maximum bandwidth under a perfect sampling oracle. This is stronger than our setting where a trained LM provides us an approximate access to the marginal distribution. The information theoretic notion of imperceptibility we proposed independently is most similar to the notion of steganographic security in Cachin (2004). Further study connecting these results is needed. Of particular interest is an extension called robust steganography, where an active adversary may alter messages, e.g. by injecting typographical errors. The stegosystems studied here are vulnerable to such attacks.
OpenAI’s decision of making GPT2117M publicly available enables our empirical studies and it likely will for other studies. However, this released trained version is inferior to the full GPT2 model Radford et al. (2019). While we appreciate OpenAI’s general precaution and specific arguments against its release, we want to note, with this work, that its release can also offer social good by enhancing communication privacy. We advocate for the public release of strong trained models as a way to mitigate the disparity in access to both data and computational resources.
Lastly, the full implementation of the stegosystem proposed in this work is made opensource under a permissive license.^{9}^{9}9https://github.com/falcondai/lmsteganography. We also include generated samples and illustrative examples.
Acknowledgments
We thank the anonymous reviewers for their suggestions. We thank David McAllester for a helpful discussion on Huffman coding. We thank Avrim Blum for bringing related works in the cryptography community to our belated attention.
References
 Brown et al. (1992) Peter F Brown, Vincent J Della Pietra, Robert L Mercer, Stephen A Della Pietra, and Jennifer C Lai. 1992. An estimate of an upper bound for the entropy of english. Computational Linguistics, 18(1):31–40.
 Cachin (2004) Christian Cachin. 2004. An informationtheoretic model for steganography. Information and Computation, 192(1):41–56.
 Chang and Clark (2014) ChingYun Chang and Stephen Clark. 2014. Practical linguistic steganography using contextual synonym substitution and a novel vertex coding method. Computational Linguistics, 40(2):403–448.
 Chapman and Davida (1997) Mark Chapman and George Davida. 1997. Hiding the hidden: A software system for concealing ciphertext as innocuous text. In International Conference on Information and Communications Security, pages 335–345. Springer.
 Fang et al. (2017) Tina Fang, Martin Jaggi, and Katerina Argyraki. 2017. Generating steganographic text with lstms. In Proceedings of ACL 2017, Student Research Workshop, pages 100–106.
 Hopper et al. (2008) Nicholas Hopper, Luis von Ahn, and John Langford. 2008. Provably secure steganography. IEEE Transactions on Computers, 58(5):662–676.
 Huffman (1952) David A Huffman. 1952. A method for the construction of minimumredundancy codes. Proceedings of the IRE, 40(9):1098–1101.
 Jozefowicz et al. (2016) Rafal Jozefowicz, Oriol Vinyals, Mike Schuster, Noam Shazeer, and Yonghui Wu. 2016. Exploring the limits of language modeling. arXiv preprint arXiv:1602.02410.
 Meng et al. (2008) Peng Meng, Liusheng Huang, Zhili Chen, Wei Yang, and Dong Li. 2008. Linguistic steganography detection based on perplexity. In 2008 International Conference on MultiMedia and Information Technology, pages 217–220. IEEE.
 Radford et al. (2019) Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. (Accessed on 2019423).
 Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), volume 1, pages 1715–1725.
 Shannon (1951) Claude E Shannon. 1951. Prediction and entropy of printed english. Bell system technical journal, 30(1):50–64.
 Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 1017–1024.
 Tulsiani (2014) Madhur Tulsiani. 2014. Pinsker’s inequality and its applications to lower bounds. Lecture Notes.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.
 Yang et al. (2018) Zhongliang Yang, Xiaoqing Guo, Ziming Chen, Yongfeng Huang, and YuJin Zhang. 2018. Rnnstega: Linguistic steganography based on recurrent neural networks. IEEE Transactions on Information Forensics and Security.
Appendix A Appendices
a.1 Basic facts about total variation distance
Over a countable space and its discrete algebra , TVD is related to the metric, . A few useful basic facts to recall are

TVD is a metric, thus it obeys the triangle inequality.

TVD is upper bounded by KullbackLeibler (KL) divergence via Pinsker’s inequality where is the KL divergence measured in bits.

TVD is subadditive over product measures , and relatedly, KL divergence is additive .
As an alternative to the upper bound due to KL divergence, we can also bound TVD via its subadditivity under product measures
In fact, this can cover more general cases, such as the analogous analysis of FLC Yang et al. (2018) which zeros out everything except for the most likely tokens. We omit it due to page limit.
a.2 GPT2 Language Model
The GPT2 language model we used is a general purpose language model from OpenAI trained on WebText Radford et al. (2019), which contains millions of web pages covering diverse topics. Citing concerns of malicious use, OpenAI only publicly released a small trained model with 117 million parameters. And that is the particular language model we use for empirical study in this work, GPT2117M.
We choose to use GPT2 as the base language model in our work for several reasons. First, GPT2 is trained on a large amount of data that we do not have access to. Second, it empirically achieves stateoftheart performance across seven challenging semantics tasks, which includes question answering, reading comprehension, summarization and translation. Third, its architecture contains many late innovations such as transformer Vaswani et al. (2017), instead of a recurrent neural network, and byte pair encoding for its vocabulary Sennrich et al. (2016).
a.3 Derivation of Sec. 3.1
The effective LM is equal to
(1) 
The KL divergence follows as
where is the entropy of the partition .
Comments
There are no comments yet.