FontCode: Embedding Information in Text Documents using Glyph Perturbation

07/28/2017 ∙ by Chang Xiao, et al. ∙ 0

We introduce FontCode, an information embedding technique for text documents. Provided a text document with specific fonts, our method embeds user-specified information in the text by perturbing the glyphs of text characters while preserving the text content. We devise an algorithm to chooses unobtrusive yet machine-recognizable glyph perturbations, leveraging a recently developed generative model that alters the glyphs of each character continuously on a font manifold. We then introduce an algorithm that embeds a user-provided message in the text document and produces an encoded document whose appearance is minimally perturbed from the original document. We also present a glyph recognition method that recovers the embedded information from an encoded document stored as a vector graphic or pixel image, or even on a printed paper. In addition, we introduce a new error-correction coding scheme that rectifies a certain number of recognition errors. Lastly, we demonstrate that our technique enables a wide array of applications, using it as a text document metadata holder, an unobtrusive optical barcode, a cryptographic message embedding scheme, and a text document signature.



There are no comments yet.


page 1

page 5

page 6

page 9

page 12

page 13

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Information embedding, the technique of embedding a message into host data, has numerous applications: Digital photographs have metadata embedded to record such information as capture date, exposure time, focal length, and camera’s GPS location. Watermarks embedded in images, videos, and audios have been one of the most important means in digital production to claim copyright against piracies [Bloom et al., 1999]. And indeed, the idea of embedding information in light signals has grown into an emerging field of visual light communication (e.g., see [Jo et al., 2016]).

In all these areas, information embedding techniques meet two desiderata: (i) the host medium is minimally perturbed, implying that the embedded message must be minimally intrusive; and (ii) the embedded message can be robustly recovered by the intended decoder even in the presence of some decoding errors.

Remaining reclusive is the information embedding technique for text documents, in both digital and physical form. While explored by many previous works on digital text steganography, information embedding for text documents is considered more challenging to meet the aforementioned desiderata than its counterparts for images, videos, and audios [Agarwal, 2013]. This is because the “pixel” of a text document is individual letters, which, unlike an image pixel, cannot be changed into other letters without causing noticeable differences. Consequently, existing techniques have limited information capacity or work only for specific digital file formats (such as PDF or Microsoft Word).

We propose FontCode, a new information embedding technique for text documents. Instead of changing text letters into different ones, we alter the glyphs (i.e., the particular shape designs) of their fonts to encode information, leveraging the recently developed concept of font manifold [Campbell and Kautz, 2014]

in computer graphics. Thereby, the readability of the original document is fully retained. We carefully choose the glyph perturbation such that it has a minimal effect on the typeface appearance of the text document, while ensuring that glyph perturbation can be recognized through Convolutional Neural Networks (CNNs). To recover the embedded information, we develop a decoding algorithm that recovers the information from an input encoded document—whether it is represented as a vector graphics file (such as a PDF) or a rasterized pixel image (such as a photograph).

Exploiting the features specific to our message embedding and retrieval problems, we further devise an error-correction coding scheme that is able to fully recover embedded information up to a certain number of recognition errors, making a smartphone into a robust FontCode reader (see Fig. 1).


As a result, FontCode is not only an information embedding technique for text documents but also an unobtrusive tagging mechanism, finding a wide array of applications. We demonstrate four of them. (i) It serves as a metadata holder in a text document, which can be freely converted to different file formats or printed on paper without loss of the metadata—across various digital and physical forms, the metadata is always preserved. (ii) It enables to embed in a text unobtrusive optical codes, ones that can replace optical barcodes (such as QR codes) in artistic designs such as posters and flyers to minimize visual distraction caused by the barcodes. (iii) By construction, it offers a basic cryptographic scheme that not only embeds but also encrypts messages, without resorting to any additional cryptosystem. And (iv) it offers a new text signature mechanism, one that allows to verify document authentication and integrity, regardless of its digital format or physical form.

Technical contributions

We propose an algorithm to construct a glyph codebook, a lookup table that maps a message into perturbed glyphs and ensures the perturbation is unobtrusive to our eyes. We then devise a recognition method that recovers the embedded message from an encoded document. Both the codebook construction and glyph recognition leverage CNNs. Additionally, we propose an error-correction coding scheme that rectifies recognition errors due to factors such as camera distortion. Built on a 1700-year old number theorem, the Chinese Remainder Theorem, and a probabilistic decoding model, our coding scheme is able to correct more errors than what block codes based on Hamming distance can correct, outperforming their theoretical error-correction upper bound.

2. Related Work

We begin by clarifying a few typographic terminologies [Campbell and Kautz, 2014]: the typeface of a character refers to a set of fonts each composed of glyphs that represent the specific design and features of the character. With this terminology, our method embeds messages by perturbing the glyphs of the fonts of text letters.

Font manipulation

While our method perturbs glyphs using the generative model by Campbell and Kautz [2014], other methods create fonts and glyphs with either computer-aided tools [Rugglcs, 1983] or automatic generation. The early system by [Knuth, 1986] creates parametric fonts and was used to create most of the Computer Modern typeface family. Later Shamir and Rappoport [1998] proposed a system that generate fonts using high-level parametric features and constraints to adjust glyphs. This idea was extended to parameterize glyph shape components [Hu and Hersch, 2001]. Other approaches generate fonts by deriving from examples and templates [Lau, 2009; Suveeranont and Igarashi, 2010]. More recent works have focused on structuring fonts based on similarity [Loviscach, 2010] or crowdsourced attributes [O’Donovan et al., 2014].

Font recognition

Automatic font recognition from a photo or image has been studied [Avilés-Cruz et al., 2005; Jung et al., 1999; Ramanathan et al., 2009]. These methods identify fonts by extracting statistical and/or typographical features of the document. Recently in [Chen et al., 2014]

, the authors proposed a scalable solution leveraging supervised learning. Then, Wang et al. 

[2015] improved font recognition using Convolutional Neural Networks. Their algorithm can run without resorting to character segmentation and optical character recognition methods. In our system, we use existing algorithms to recognize text fonts of the input document, but further devise an algorithm to recognize glyph perturbation for recovering the embedded information. Unlike existing font recognition methods that identify fonts from a text of many letters, our algorithm aims to identify glyph perturbation for individual letters.

Text steganography

Our work is related to digital steganography (such as digital watermarks for copyright protection), which has been studied for decades, mostly focusing on videos, images, and audios. However, digital text steganography is much more challenging [Agarwal, 2013], and thus much less developed. We categorize existing methods based on their features (see Table 1):

Methods based on cover text generation (

CTG) hide a secrete message by generating an ad-hoc cover text which looks lexically and syntactically convincing [Wayner, 1992, 2009]. However, this type of steganography is unable to embed messages in existing text documents. Thus, they fail to meet the attribute that we call cover text preservation (CP), and have limited applications.

The second type of methods exploits format-specific features (FSF). For example, specifically for Microsoft Word document, Bhaya et al. [2013] assign each text character a different but visually similar font available in Word to conceal messages. Others hide messages by changing the character scale and color or adding underline styles in a document [Stojanov et al., 2014; Panda et al., 2015], although those changes are generally noticeable. More recent methods exploit special ASCII codes and Unicodes that are displayed as an empty space in a PDF viewer [Chaudhary et al., 2016; Rizzo et al., 2016]. These methods are not format independent (FI); they are bounded to a specific file format (such as Word or PDF) and text viewer. The concealed messages would be lost if the document was converted in a different format or even opened with a different version of the same viewer. If also fails to preserve the concealed message when the document is printed on paper (PP) and photographed later.

More relevant to our work is the family of methods that embed messages via what we call structural perturbations (SP). Line-shift methods [Alattar and Alattar, 2004; Brassil et al., 1995] hide information by perturbing the space between text lines: reducing or increasing the space represents a 0 or 1 bit. Similarly, word-shift methods [Kim et al., 2003; Brassil et al., 1995] perturb the spaces between words. Others perturb the shape of specific characters, such as raising or dropping the positions of the dots above “i” and “j” [Brassil et al., 1995]. These methods cannot support fine granularity coding (FGC), in the sense that they encode 1 bit in every line break, word break or special character that appears sparsely. Our method provides fine granularity coding by embedding information in individual letters, and thereby has a much larger information capacity. In addition, all these methods demonstrate retrieval of hidden messages from digital document files only. It is unclear to what extent they can decode from real photos of text.

Category Previous work Attribute
CTG  [Wayner, 2009; Agarwal, 2013]
SP  [Brassil et al., 1995; Kim et al., 2003; Alattar and Alattar, 2004; Gutub and Fattani, 2007]
FSF  [Panda et al., 2015; Bhaya et al., 2013; Chaudhary et al., 2016; Rizzo et al., 2016]
Our work
Table 1. A summary of related text steganographic methods. CP indicates cover text preservation; FGC indicates fine granularity coding; FI indicates format-independent; PP indicates printed paper.

Table 1 summarizes these related work and their main attributes. Like most of these work, our method aims to perturb the appearance of the text in an unobtrusive, albeit not fully imperceptible, way. But our method is advantageous by providing more desired attributes.

3. Overview

Our FontCode system embeds in a text document any type of information as a bit string. For example, an arbitrary text message can be coded into a bit string using the standard ASCII code or Unicode111See the character set coding standards by the Internet Assigned Numbers Authority ( We refer to such a bit string as a plain message.

In a text document, the basic elements of embedding a plain message are the letters, appearing in a particular font. Our idea is to perturb the glyph of each letter to embed a plain message. To this end, we leverage the concept of font manifold proposed by Campbell and Kautz [2014]. Taking as input a collection of existing font files, their method creates a low-dimensional font manifold for every character—including both alphabets and digits—such that every location on this manifold generates a particular glyph of that character. This novel generative model is precomputed once for each character. Then, it allows us to alter the glyph of each text letter in a subtle yet systematic way, and thereby embed messages.

In this paper, when there is no confusion, we refer to a location on a font manifold and its resulting glyph interchangeably.

Figure 2. Embedding and extraction.

Here we sample 5 points around the Times New Roman on the manifold (left), generating the perturbed glyphs to embed integers (top-right). We embed “1” in letter “a” using the second glyph (in orange) in the perturbation list. In the retrieval step, we evaluate a probability value (inverse of distance) by our CNNs (bottom-right), and extract the integer whose glyph results in the highest probability.

Figure 3. Overview. (left) Our embedding method takes an input message and a text document. It encodes the message into a series of integers and divides the letters into blocks. The integers are assigned to each block and embedded in individual letters. (right) To recover the message, we extract integers by recognizing the glyphs of individual letters. Then, the integers are decoded into the original plain message.

3.1. Message Embedding

Our message embedding method consists of two steps, (i) precomputation of a codebook for processing all documents and (ii) runtime embedding of a plain message in a given document.

During precomputation, we construct a codebook of perturbed glyphs for typically used fonts. Consider a font such as Times New Roman. Each character in this font corresponds to a specific location, , on the font manifold. We identify a set of locations on the manifold as the perturbed glyphs of and denote them as (see Fig. 2). Our goal in this step is to select the perturbed glyphs such that their differences from the glyph of the original font is almost unnoticeable to our naked eyes but recognizable to a computer algorithm (detailed in §4). The sets of perturbed glyphs for all characters with typically used fonts form our codebook.

At runtime, provided a text document (or a text region or paragraphs), we perturb the glyph of the letter in the document to embed a given plain message. Consider a letter in an original glyph in the given document. Suppose in the precomputed codebook, this letter has perturbed glyphs, namely . We embed in the letter an integer in the range of by changing its glyph from to (see Fig. 2-top). A key algorithmic component of this step is to determine the embedded integers for all letters such that together they encode the plain message.

In addition, we propose a new error-correcting coding scheme to encode the plain message. This coding scheme adds certain redundancy (i.e., some extra data) to the coded message, which, at decoding time, can be used to check for consistency of the coded message and recover it from errors (see §5).

3.2. Message Retrieval

To retrieve information from a coded text document, the first step is to recover an integer from each letter. For each letter in the document, suppose again this letter has perturbed glyphs in the codebook, whose manifold locations are . We recognize its glyph in the current document as one of the perturbed glyphs. We extract an integer if is recognized as (see Fig. 2-bottom). Our recognition algorithm works with not only vector graphics documents (such as those stored as PDFs) but also rasterized documents stored as pixel images. For the latter, the recognition leverages convolutional neural networks.

The retrieved integers are then fed into our error correction coding scheme to reconstruct the plain message. Because of the data redundancy, even if some glyphs are mistakenly recognized (e.g., due to poor image quality), those errors will be rectified and we will still be able to recover the message correctly (see §5).

4. Glyph Recognition

We start by focusing on our basic message embedding blocks: individual letters in a text document. Our goal in this section is to embed an integer number in a single letter by perturbing its glyph, and later retrieve that integer from a vector graphics or pixel image of that letter. In the next section, we will address what integers to assign to the letters in order to encode a message.

As introduced in §3.1, we embed an integer in a letter through glyph perturbation, by looking up a precomputed codebook. Later, when extracting an integer from a letter, we compute a “distance” metric between the extracted glyph and each perturbed glyph in in the codebook; we obtain integer if the “distance” of glyph is the smallest one.

Our recognition algorithm supports input documents stored as vector graphics and pixel images. While it is straightforward to recognize vector graphic glyphs, pixel images pose significant challenges due to camera perspectives, rasterization noise, blurriness, and so forth. In this section, we first describe our algorithm that decodes integers from rasterized (pixel) glyphs. This algorithm leverages convolutional neural networks (CNNs), which also allow us to systematically construct the codebook of perturbed glyphs. Then, we describe the details of embedding and extracting integers, as well as a simple algorithm for recognizing vector graphic glyphs.

4.1. Pixel Image Preprocessing

When a text document is provided as a pixel image, we use the off-the-shelf optical character recognition (OCR) library to detect individual letters. Our approach does not depend on any particular OCR library. In practice, we choose to use Tesseract [Smith, 2007], one of the most popular open source OCR engines. In addition to detecting letters, OCR also identifies a bounding box of every letter on the pixel image.

To recognize the glyph perturbation of each letter using CNNs, we first preprocess the image. We crop the region of each letter using its bounding box detected by the OCR. We then binarize the image region using the classic algorithm by Otsu 

[1975]. This step helps to eliminate the influence caused by the variations of lighting conditions and background colors. Lastly, we resize the image region to have 200200 pixels. This 200200, black-and-white image for each letter is the input to our CNNs (Fig. 4-left).

Figure 4. Network architecture. Our CNN takes as input a 200

200, black-and-white image containing a single letter (left). It consists 3 convolutional layers (middle) and 2 fully connected layers (right). It outputs a vector, in which each element is the estimated probability of recognizing the glyph in the input image as a perturbed glyph in the codebook.

4.2. Network Structure

We treat glyph recognition as an image classification problem: provided an image region of a letter which has a list of perturbed glyphs

in the codebook, our goal is to classify the input glyph of that letter as one from the list. Therefore, we train a CNN for each letter in a particular font.

Thanks to the image preprocessing, we propose to use a simple CNN structure (as illustrated in Fig. 4), which can be quickly trained and evaluated. The input is a 200

200, black-and-white image containing a letter. The CNN consists of three convolutional layers, each followed by a ReLU activation layer (i.e.,

) and a 2

2 max pooling layer. The kernel size of the three convolutional layers are 8

832, 5564, and 3332, respectively. The convolutional layers are connected with two fully connected (FC) layers, arranged in the following form (see Fig. 4):

Here “Dropout” indicates a 0.5-dropout layer, and FC is an m-dimensional FC layer. In the second FC layer,

is the number of perturbed glyphs in the codebook for the letter in a certain font. After passing through a softmax layer, the output is an

-dimensional vector indicating the probabilities (or the inverse of “distance”) of classifying the glyph of the input letter as one of the perturbed glyphs in the list. The glyph of the letter is recognized as if its corresponding probability in the output vector is the largest one.

4.3. Network Training

Training data

Our CNNs will be used for recognizing text document images that are either directly synthesized or captured by digital cameras. Correspondingly, the training data of the CNNs consist of synthetic images and real photos. Consider a letter whose perturbed glyphs from a standard font are . We print all the glyphs on a paper and take 10 photos with different lighting conditions and slightly different camera angle. While taking the photos, the camera is almost front facing the paper, mimicking the scenarios of how the embedded information in a document would be read by a camera. In addition, we include synthetic images by rasterizing each glyph (whose vector graphics path is generated using [Campbell and Kautz, 2014]) into a 200200 image.

Data augmentation

To train the CNNs, each training iteration randomly draws a certain number of images from the training data. We ensure that half of the images are synthetic and the other half are from real photos. To reduce overfitting and improve the robustness, we augment the selected images before feeding them into the training process. Each image is processed by the following operations:

  • [itemsep=0pt,topsep=2pt,leftmargin=3.8mm]

  • Adding a small Gaussian noise with zero mean and standard deviation 3.

  • Applying a Gaussian blur whose standard deviation is uniformly distributed between 0 (no blur) and 3px.

  • Applying a randomly-parameterized perspective transformation.

  • Adding a constant border with a width randomly chosen between 0 and 30px.

We note that similar data augmentations have been used in [Wang et al., 2015; Chen et al., 2014] to enrich their training data. Lastly, we apply the preprocessing routine described in §4.1, obtaining a binary image whose pixel values are either 0 or 1.

Implementation details

In practice, our CNNs are optimized using the Adam algorithm [Kingma and Ba, 2015] with the parameters , and a mini-batch size of 15. The network was trained with update iterations at a learning rate of

. Our implementation also leverages existing libraries, Tensorflow 

[Abadi et al., 2015]

and Keras 

[Chollet et al., 2015].

4.4. Constructing The Glyph Codebook

The CNNs not only help to recognize glyphs at runtime, but also enable us to systematically construct the glyph codebook, a lookup table that includes a set of perturbed glyphs for every character in commonly used fonts (such as Times New Roman, Helvetica, and Calibri). Our codebook construction aims to satisfy three criteria: (i) The included glyph perturbation must be perceptually similar; their differences from the original fonts should be hardly noticeable to our eyes. (ii) The CNNs must be able to distinguish the perturbed glyphs reliably. (iii) Meanwhile, we wish to include as many perturbed glyphs as possible in the codebook for each character, in order to increase the capacity of embedding information (see §5).

To illustrate our algorithm, consider a single character in a given font. We use its location on the character’s font manifold to refer to the original glyph of the font. Our construction algorithm first finds a large set of glyph candidates that are perceptually similar to . This step uses crowdsourced perceptual studies following a standard user-study protocol. We therefore defer its details until §6, while in this section we denote the resulting set of glyphs as , which typically has hundreds of glyphs. Next, we reduce the size of by discarding glyphs that may confuse our CNNs in recognition tests. This is an iterative process, wherein each iteration takes the following two steps (see Algorithm 1 in Appendix):

Confusion test. We randomly select pairs of glyphs in (100 in practice). For each pair, we check if our CNN structure (introduced in §4.2) can reliably distinguish the two glyphs. In this case, we consider only the two glyphs, and thus the last FC layer (recall §4.2) has two dimensions. We train this network with only synthetic images processed by the data augmentations. We use synthetic images generated by different augmentations to test the accuracy of the resulting CNN. If the accuracy is less than 95%, then we record this pair of glyphs in a set .

Updating glyph candidate set. Next, we update the glyph candidates in to avoid the recorded glyph pairs that may confuse the CNNs while retaining as many glyph candidates as possible in . To this end, we construct a graph where every node represents a glyph in . Initially, this graph is completely connected. Then, the edge between node and is removed if the glyph pair is listed in , meaning that and are hardly distinguishable by our CNN. With this graph, we find the maximum set of glyph that can be distinguished from each other. This amounts to the classic maximum clique problem on graphs. Although the maximum clique problem has been proven NP-hard (see page 97 of [Karp, 1972]), our graph size is small (with up to 200 nodes), and the edges are sparse. Consequently, many existing algorithms suffice to find the maximum clique efficiently. In practice, we choose to use the method of [Konc and Janezic, 2007]. Lastly, the glyphs corresponding to the maximum clique nodes form the updated glyph candidate set , and we move on to the next iteration.

This iterative process stops when there is no change of . In our experiments, this takes up to 5 iterations. Because only synthetic images are used as training data, this process is fully automatic and fast, but it narrows down the glyph candidates conservatively—two glyphs in might still be hardly distinguishable in real photos. Therefore, in a last step, we train a CNN as described in §4.3 using both synthetic images and real photos to verify if the CNN can reliably recognize all glyphs in . Here the last FC layer of CNN has a dimension of (i.e., the size of ). At this point, is small (typically around 25). Thus, the CNN can be easily trained. Afterwards, we evaluate the recognition accuracy of the CNN for each glyph in , and discard the glyphs whose recognition accuracy is below 90%. The remaining glyphs in are added to the codebook as the perturbed glyphs of the character.

4.5. Embedding and Retrieving Integers


Now we can embed integers into a text document in either vector graphic or pixel image format. First, we extract from the input document the text content and the layout of the letters. Our method also needs to know the original font of the text. This can be specified by the user or automatically obtained from the metadata of a vector graphic document (such as a PDF). If the document is provided as a pixel image, recent methods [Chen et al., 2014; Wang et al., 2015] can be used to recognize the text font.

In order to embed an integer in a letter of font , we look up its perturbed glyph list in the precomputed codebook, and generate the letter shaped by the glyph  [Campbell and Kautz, 2014]. We then scale the glyph to fit it into the bounding box of the original letter (which is detected by the OCR library for pixel images), and use it to replace the original letter of the input document.

Figure 5. A fragment of codebook. A complete codebook for all alphabetic characters in lower case for Times New Roman can be found in the supplemental document.

To retrieve integers from a pixel image, we extract the text content and identify regions of individual letters using the OCR tool. After we crop the regions of the letters, they are processed in parallel by the recognition algorithm: they are preprocessed as described in §4.1 and then fed into the CNNs. An integer is extracted from a letter if the output vector from its CNN has the -th element as the largest (recall §4.2).

If the input document is provided as a vector graphics image (e.g., as a PDF file), glyph recognition of a letter is straightforward. In our codebook, we store the outlines of the perturbed glyphs as polylines. In the integer embedding stage, these polylines are scaled to replace the paths of the original glyphs. In the retrieval stage, we compute the “distance” between the letter’s glyph polyline, denoted as , in the document and every glyph in the perturbed glyph list of the codebook: we first scale the polylines of to match the bounding box of , and then compute the distance between the polyline vertices of and . We recognize as (and thus extract the integer ) if the distance between and is the minimum.

5. Error Correction Coding

After presenting the embedding and retrieval of integer values over individual letters, we now focus on a plain message represented as a bit string and describe our coding scheme that encodes the bit string into a series of integers, which will be in turn embedded in individual letters. We also introduce an error-correction decoding algorithm that decodes the bit string from a series of integers, even when some integers are incorrect.

Algorithms presented in this section are built on the coding theory for noise channels, founded by Shannon’s landmark work [Shannon, 1948]. We refer the reader to the textbook [Lin and Costello, 2004] for a comprehensive introduction. Here, we start with a brief introduction of traditional error-correcting codes, to point out their fundamental differences from our coding problem.

5.1. Challenges of the Coding Problem

The most common error-correction coding scheme is block coding, in which an information sequence is divided into blocks of symbols each. We denote a block using a -tuple, . For example, a 128-bit binary string can be divided into 16 blocks of 8-bit binary strings. In this case, , and is either 0 or 1. In general, can vary in other ranges of discrete symbols. A fundamental requirement of error-correction coding is that (i) the number of possible symbols for each must be the same, and (ii) this number must be a prime power , where is a prime number and is a positive integer. In abstract algebraic language, it requires the blocks to be in a Galois Field, GF(). For example, the aforementioned 8-bit strings are in GF(2). Most of the current error-correction coding schemes (e.g., Reed-Solomon codes  [Reed and Solomon, 1960]) are built on the algebraic operations in the polynomial ring of GF(). At encoding time, they transform -tuple blocks into -tuple blocks in the same Galois field GF(), where is larger than . At decoding time, some symbols in the -tuple blocks can be incorrect. But as long as the total number of incorrect symbols in a block is no more than regardless of the locations of incorrections, the coding scheme can fully recover the original -tuple block.

In our problem, a text document consists of a sequence of letters. At first glance, we can divide the letter sequence into blocks of letters each. Unfortunately, these blocks are ill-suited for traditional block coding schemes. For example, consider a five-letter block, , with an original font . Every letter has a different capacity for embedding integers: in the codebook, has glyphs, so it can embed integers in the range . However, in order to use block coding, we need to find a prime power that is no more than any (to construct a Galois field GF()). Then every letter can only embed integers, which can be significantly smaller than . Perhaps a seemingly better approach is to find for every , and use the 5-letter block to represent a -bit binary string, where . In this case, this binary string is in GF(2), valid for block coding. Still, this approach wastes much of the letters’ embedding capacity. For example, if a letter has 30 perturbed glyphs, this approach can only embed integers in . As experimentally shown in §8.2, it significantly reduces the amount of information that can be embedded. In addition, traditional block coding method is often concerned with a noisy communication channel, where an error occurs at individual bits. In contrast, our recognition error occurs at individual letters. When the glyph of a letter is mistakenly recognized, a chunk of bits becomes incorrect, and the number of incorrect bits depends on specific letters. Thus, it is harder to set a proper relative redundancy (i.e., in block coding) to guarantee successful error correction.

5.2. Chinese Reminder Codes

To address these challenges, we introduce a new coding scheme based on a 1700-year old number theorem, the Chinese remainder theorem (CRT) [Katz and Imhausen, 2007]. Error-correction coding based on Chinese remainder theorem has been studied in the field of theoretical computing [Goldreich et al., 1999; Boneh, 2000], known as the Chinese Remainder Codes (CRC). Building on CRC, we propose a coding scheme offering improved error-correction ability (§5.3). We start our presentation by defining our coding problem formally.

Problem definition

Given a text document, we divide its letter sequence into blocks of letters each. Consider a block, denoted as in an original font . Our goal for now is to embed an integer in this -letter block, where is in the range . We will map the plain message into a series of integers later in this section. Formally, we seek an encoding function , where is an -vector of integers. Following the coding theory terminology, we refer as a codeword. The function needs to ensure that every can be embedded in the letter . Meanwhile, must be injective, so can be uniquely mapped to through a decoding function . Additionally, the computation of and must be fast, to encode and decode in a timely fashion.

Figure 6. Analogy of Euclidean distance on 2D plane to gain intuition of Theorem 5.2. Here and represent two codewords having the minimum Hamming distance . Suppose is recognized as a code vector . If the distance (e.g., ), it is safe to decode into , as no other codeword is closer to . If , there might be multiple codewords having the same distance to (e.g., when is ), and may be mistakenly decoded as (e.g., when is ).

5.2.1. Hamming Distance Decoding

We now introduce Hamming Distance decoding, a general decoding framework for block codes, to pave the way for our coding scheme.

Definition 5.1 (Hamming Distance).

Given two codewords and , the Hamming Distance measures the number of pair-wise elements in which they differ.

For example, given two codewords, and , .

Now we consider glyph recognition errors. If the integer retrieval from a letter’s glyph is incorrect, then the input of the decoding function may not be a valid codeword. In other words, because of the recognition errors, it is possible that no satisfies . To distinguish from a codeword, we refer to as a code vector.

The Hamming distance decoding uses a decoding function based on the Hamming distance: returns an integer such that the Hamming distance is minimized over all . Intuitively, although may not be a valid codeword, we decode it as the integer whose codeword is closest to under the measure of Hamming distance. Let denote the minimum Hamming distance among all pairs of valid codewords, namely,

The error-correction ability of Hamming distance decoding is bounded by the following theorem [Lin and Costello, 2004],

Theorem 5.2 ().

Let denote the number of incorrect symbols in a code vector . Then, the Hamming distance decoding can correctly recover the integer if ; it can detect errors in but not recover the integer correctly if .

An illustrative understanding of this theorem using an analogy of Euclidean distance is shown in Fig. 6. This theorem holds regardless of the encoding function , as long as it is a valid injective function.

5.2.2. Chinese Reminder Codes

We now introduce the following version of the Chinese Reminder Theorem, whose proof can be found in [Rosen, 2011].

Theorem 5.3 (Chinese Remainder Theorem).

Let denote positive integers which are mutually prime and . Then, there exists an injective function,

defined as , such that for all , .

This theorem indicates that given pairs of integers with all being mutually prime, there exists a unique non-negative integer satisfying for all . Indeed, can be computed using the formula,


where , and is computed by solving a system of modular equations,

using the classic Euclidean algorithm [Rosen, 2011].

If we extend the list of to mutually prime numbers () such that the additional numbers are all larger than up to (i.e., for any + and ), then we have an encoding function for non-negative integer :


This encoding function already adds redundancy: because is smaller than the product of any numbers chosen from , we can compute from any of the pairs of , according to the Chinese Reminder Theorem. Indeed, as proved in [Goldreich et al., 1999], the minimum Hamming distance of the encoding function (2) for all is . Thus, the Hamming decoding function of can correct up to errors by Theorem 5.2.


We now describe our encoding algorithm based on the encoding function in (2).

  • [itemsep=0pt,topsep=2pt,leftmargin=3.8mm]

  • Computing . Suppose that the letter sequence of a document has been divided into blocks, denoted as , each with letters. Consider a block , where indicates a letter with its original font , whose integer embedding capacity is (i.e., ’s font has perturbed glyphs in the codebook). We depth-first search mutually prime numbers , such that and the product of minimal is maximized. At the end, we obtain and the product of minimal denoted as for each block . Note that if we could not find mutually prime numbers for block , we simply ignore the letter whose embedding capacity is smallest among all and include to this block. We repeat this process until we find a valid set of mutually prime numbers.

  • Determining . Given the plain message represented as a bit string , we now split the bits into a sequence of chunks, each of which is converted into an integer and assigned to a block . We assign to each block an integer with bits, which is sequentially cut from the bit string (see Fig. 7).

    Figure 7. An example of determining : (top) The bit string representation of the plain message; (bottom) Each block of letters is highlighted by a color; (middle) values assigned to each block.
  • Embedding.  For every block , we compute the codeword using the CRT encoding function (2), obtaining . Each is then embedded in the glyph of the letter in the block as described in §4.5.


At decoding time, we recognize the glyphs of the letters in a document and extract integers from them, as detailed in §4.5. Next, we divide the letter sequence into blocks, and repeat the algorithm of computing and as in the encoding step, for every block. Given a block , the extracted integers from its letters form a code vector . We refer to it as a code vector because some of the may be incorrectly recognized. To decode , we first compute where stacks all in the block. If , then is the decoding result , because the Hamming distance . Otherwise, we decode using the Hamming decoding function: concretely, since we know the current block can encode an integer in the range , we decode into the integer by finding


As discussed above, this decoding function can correct up to incorrectly recognized glyphs in each block. Lastly, we convert into a bit string and concatenate from all blocks sequentially to recover the plain message.

Implementation details

A few implementation details are worth noting. First, oftentimes letters in a document can carry a bit string much longer than the given plain message. To indicate the end of the message, we attach a special chunk of bits (end-of-message bits) at the end of each plain message, very much akin to the end-of-line (newline) character used in digital text systems. Second, in practice, blocks should be relatively short (i.e., is small). If is large, it becomes much harder to find mutually prime numbers that are no more than each letter’s embedding capacity. In practice, we choose and which allows one mistaken letter in every 5-letter block. The small and also enable brute-force search of in (3) sufficiently fast, although there also exists a method solving (3) in polynomial time [Goldreich et al., 1999; Boneh, 2000].

5.3. Improved Error Correction Capability

Using the Chinese Remainder Codes, the error-correction capacity is upper bounded. Theorem 5.2 indicates that at most mistakenly recognized glyphs are allowed in every letter block. We now break this theoretical upper bound to further improve our error-correction ability, by exploiting specific properties of our coding problem. To this end, we propose a new algorithm based on the maximum likelihood decoding [Lin and Costello, 2004].

In coding theory, maximum likelihood decoding is not an algorithm. Rather, it is a decoding philosophy, a framework that models the decoding process from a probabilistic rather than an algebraic point of view. Consider a letter block and the code vector formed by the extracted integers from . We treat the true codeword encoded by as a latent variable (in statistic language), and model the probability of given the extracted code vector , namely . With this likelihood model, our decoding process finds a codeword that maximizes the probability , and decodes into an integer using the Chinese Remainder Theorem formula (1).

When there are at most errors in a block, the Hamming decoding function (3) is able to decode. In fact, it can be proved that when the number of errors is under this bound, there exists a unique that minimizes  [Lin and Costello, 2004], and that when the number of errors becomes , there may be multiple reaching the minimum (see Fig. 6 for an intuitive explanation). The key idea of breaking this bound is to use a likelihood model to choose an when ambiguity occurs for the Hamming decoding function (3).

Consider a block with errors, and suppose in (3) we find different integers , all of which lead to the same minimal Hamming distance to the code vector . Let , denote the corresponding codewords. We use another subscript to index the integer elements in code vectors and codewords. For every , some of its integer elements differs from the corresponding integer elements . Here can be interpreted as the index of the perturbed glyphs of the -th letter. We denote this glyph as and the letter’s glyph extracted from the input image as . If is indeed the embedded integer, then is mistakenly recognized, resulting in a different glyph number .

We first model the probability of this occurrence, denoted as , using our “distance” metric. Intuitively, the closer the two glyphs are, the more likely one is mistakenly recognized as the other. Then the probability of recognizing a codeword as a code vector accounts for all inconsistent element pairs in and , namely,


Here is (inversely) related to our distance metric between two glyphs: for pixel images, is the probability of recognizing as , which is the -th softmax output from the CNN of the -th letter, given the input glyph image (recall §4.2); for vector graphics, , the inverse of the vector graphic glyph distance (defined in §4.5). The denominator is for normalization, where iterates through all perturbed glyphs of the -th letter in the codebook. For pixel images, the denominator is always 1 because of the unitary property of the softmax function. Lastly, the likelihood

needed for decoding is computed by Bayes Theorem,


Here, is fixed, and so is as all codewords are equally likely to be used. As a result, we find that maximizes (5) among all ambiguous codewords, and decode using .

In our experiments (§8.3), we show that the proposed decoding scheme indeed improves the error-correction ability. For our implementation wherein and , in each block we are able to correct up to 2 errors, as opposed to 1 indicated by Theorem 5.2, thereby increasing the error tolerance from 20% to 40%.

Figure 8. MTurk result. (left) The font manifold of the letter “a” color-mapped using the perceptual similarity value of each point to the standard Time New Roman font. (right) Sampled glyphs that correspond to the manifold locations indicated by red circles. The colors indicate their perceptual similarity to the Times New Roman.

6. Perceptual Evalutation

We now describe our crowd-sourced perceptual studies on Mechanical Turk (MTurk). The goal of the studies is, for each character in a particular font, to find a set of glyphs that are perceptually similar to its original glyph . When constructing the glyph codebook, we use these sets of glyphs to initialize the candidates of perturbed glyphs (recall §4.4).

The user studies adopt a well-established protocol: we present the MTurk raters multiple questions, and each question uses a two-alternative forced choice (2AFC) scheme, i.e., the MTurk raters must choose one from two options. We model the rater’s response using a logistic function (known as the Bradley-Terry model [1952]), which allows us to learn a perceptual metric (in this case the perceptual similarity of glyphs) from the raters’ response. We note that this protocol has been used previously in other perceptual studies (e.g., [O’Donovan et al., 2014; Um et al., 2017]), and that we will later refer back to this protocol in §8.3 when we evaluate the perceptual quality of our results. Next, we describe the details of the studies.


We first choose on the font manifold a large region centered at , and sample locations densely in this region. In practice, all font manifolds are in 2D, and we choose 400 locations uniformly in a squared region centered at . Let denote the set of glyphs corresponding to the sampled manifold locations. Then, in each MTurk question, we present the rater a pair of glyphs randomly selected from , and ask which one of the two glyphs looks closer to the glyph of the original font . An example is shown in Fig. 9. We assign 20 questions to each MTurk rater. Four of them are control questions, which are designed to detect untrustworthy raters. The four questions ask the rater to compare the same pair of glyphs presented in different order. We reject raters whose answers have more than one inconsistencies among the control questions. At the end, about 200 raters participated the studies for each character.

Figure 9. MTurk user interface.

From the raters’ response, we estimate a scalar value , the perceptual similarity value of the glyph to the original glyph , for every glyph . Let a set of tuples record the user study responses, where and are randomly selected glyphs being compared, is the MTurk rater’s ID, and the binary value is the rater’s choice: indicates the rater judges perceptually closer to , and otherwise. We model the likelihood of the rater’s choice using a logistic function,


where the scalar indicates the rater’s reliability. With this model, all the similarity values and user reliability values are obtained by minimizing a negative log-likelihood objective function,

where indices the MTurk response in , stacks the raters’ reliability values, and stacks the similarity values for all . The values are then normalized in the range of .

We learn the similarity values for the glyphs of every character with a font independently. Fig. 8 visualizes the perceptual similarity of glyphs near the standard Times New Roman for the character “a”. Lastly, We iterate through all the pre-sampled glyphs , and add those whose similarity value is larger than a threshold (0.85 in practice) into a set , forming the set of perceptually similar glyphs for constructing the codebook (in §4.4).

7. Applications

Our method finds many applications. In this section, we discuss four of them, while referring to the supplemental video for their demonstrations and to §8.1 for implementation summaries.

7.1. Application I: Format-Independent Metadata

Many digital productions carry metadata that provide additional information, resources, and digital identification [Greenberg, 2005]. Perhaps most well-known is the metadata embedded in photographs, providing information such as camera parameters and copyright. PDF files can also contain metadata222If you are reading this paper with Adobe Acrobat Reader, you can view its metadata by choosing “FileProperties” and clicking the “Additional Metadata” under the “Description” tab.. In fact, metadata has been widely used by numerous tools to edit and organize digital files.

Currently, the storage of metadata is ad hoc, depending on specific file format. For instance, a JPEG image stores metadata in its EXIF header, while an Adobe PDF stores its metadata in XML format with Adobe’s XMP framework [Adobe, 2001]. Consequently, metadata is lost whenever one converts an image from JPEG to PNG format, or rasterizes a vector graphics image into a pixel image. Although it is possible to develop a careful converter that painstakingly preserves the metadata across all file formats, the metadata is still lost whenever the image or document is printed on paper.

Our FontCode technique can serve as a means to host text document metadata. More remarkably, the metadata storage in our technique is format-independent, because format conversion and rasterization of text documents, as well as printing on paper, all preserve the glyphs of letters. Once information is embedded in a document, one can freely convert it to a different file format, or rasterize it into a pixel image (as long as the image is not severely downsampled), or print it on a piece of paper. Throughout, the metadata is retained (see video).

7.2. Application II: Imperceptible Optical Codes

Our FontCode technique can also be used as optical barcodes embedded in a text document, akin to QR codes [Denso, 2011]. Barcodes have numerous applications in advertising, sales, inventory tracking, robotics, augmented reality, and so forth. Similar to QR codes that embed certain level of redundancy to correct decoding error, FontCode also supports error-correction decoding. However, all existing barcodes require to print black-and-white blocks and bars, which can be visually distracting and aesthetically imperfect. Our technique, in contrast, enables not only an optical code but an unobtrusive optical code, as it only introduces subtle changes to the text appearance. Our retrieval algorithm is sufficiently fast to provide point-and-shoot kind of message decoding. It can be particularly suitable for use as a replacement of QR codes in an artistic work such as a poster or flyer design, where visual distraction needs to be minimized. As a demonstration, we have implemented an iPhone application to read coded text (see video).

7.3. Application III: Encrypted Message Embedding

Our technique can further encrypt a message when embedding it in a document, even if the entire embedding and retrieval algorithms are made public. Recall that when embedding an integer in a letter of a glyph , we replace with a glyph chosen from a list of perturbed glyphs in the codebook. Let denote this list. Even though the perturbed glyphs for every character in a particular font are precomputed, the order of the glyphs in each list can be arbitrarily user-specified. The particular orders of all together can serve as an encryption key.

For example, when Alice and Bob333Here we follow the convention in cryptography, using Alice and Bob as placeholder names for the convenience of presenting algorithms. communicate through encoded documents, they can use a publicly available codebook, but agree on a private key, which specifies the glyph permutation of each list . If an original list of a letter is permuted into by the key, then Alice uses the glyph , rather than , to embed an integer in the letter, and Bob deciphers the message using the same permuted codebook. For a codebook that we precomputed for Times New Roman (see the supplemental document), if we only consider lowercase English alphabet, there exist different keys (which amount to a 837-bit key); if we also include uppercase English alphabet, there exist keys (which amount to a 1470-bit key). Thus, without resorting to any existing cryptographic algorithm, our method already offers a basic encryption scheme. Even if others can carefully examine the text glyphs and discover that a document is indeed embedding a message, the message can still be protected from leaking.

7.4. Application IV: Text Document Signature

Leveraging existing cryptographic techniques, we augment FontCode to propose a new digital signature technique, one that can authenticate the source of a text document and guarantee its integrity (thereby protecting from tampering). This technique has two variations, working as follows:

Scheme 1

When Alice creates a digital document, she maps the document content (e.g., including letters, digits, and punctuation) into a bit string through a cryptographic hash function such as the MD5 [Rivest, 1992] and SHA [Eastlake and Jones, 2001]. We call this bit string the document’s hash string. Alice then chooses a private key to permute the codebook as described in §7.3, and uses the permuted codebook to embed the hash string into her document. When Bob tries to tamper this document, any change leads to a different hash string. Without knowing Alice’s private key, he cannot embed the new hash string in the document, and thus cannot tamper the document successfully. Later, Alice can check the integrity of her document, by extracting the embedded hash string and comparing it against the hash string of the current document.

Scheme 2

The above algorithm allows only Alice to check the integrity of her document, as only she knows her private key. By combining with asymmetric cryptography such as the RSA algorithm [Rivest et al., 1978], we allow everyone to check the integrity of a document but not to tamper it. Now, the codebook is public but not permuted. After Alice generates the hash string of her document, she encrypts the hash string using her private key by an asymmetric cryptosystem, and obtains an encrypted string. Then, she embeds the encrypted string in the document using the FontCode method, while making her public key available to everyone. In this case, Bob cannot tamper the document, as he does not have Alice’s private key to encrypt the hash string of an altered document. But everyone in the world can extract the encrypted string using the FontCode method, and decipher the hash string using Alice’s public key. If the deciphered hash string matches the hash string of the current document, it proves that (i) the document is indeed sourced from Alice, and (ii) the document has not been modified by others.


In comparison to existing digital signatures such as those in Adobe PDFs, our method is format-independent. In contrast to PDF files whose signatures are lost when the files are rasterized or printed on physical papers, our FontCode signature is preserved regardless of file format conversion, rasterization, and physical printing.

Further extension

In digital text forensics, it is often desired to not only detect tampering but also locate where in the text the tampering occurs. In this regard, our method can be extended for more detailed tampering detection. As shown in our analysis (§8.2), in a typical English document, we only need about 80 letters to embed (with error correction) a string of 128 bits, which is the length of a hash string resulted from a typical cryptographic hash function (e.g., MD5). Given a document, we divide its text into a set of segments, each with at least 80 letters. We then compute a hash string for each segment and embed the encrypted strings in individual segments. This creates a fine granularity of text signatures, allowing the user to check which text segment is modified, and thereby locating tampering occurrences more precisely. For example, in the current text format of this paper, every two-column line consists of around 100 letters, meaning that our method can identify tampering locations up to two-column lines in this paper. To our knowledge, digital text protection with such a fine granularity has not been realized.

Discussion about the storage

We close this section with a remark on the memory footprint of our steganographic documents. If a document is stored as a pixel image, then it consumes no additional memory. If it is in vector graphics format, our current implementation stores the glyph contour polylines of all letters. A document with 376 letters with 639 bits encoded will consume 1.2M memory in a compressed SVG form and 371K in a compressed JPEG format. PDF files can embed glyph shapes in the file and refer to those glyphs in the text. Using this feature, we can also embed the entire codebook in a PDF, introducing about 1.3M storage overhead, regardless of the text length. In the future, if all glyphs in the codebook are pre-installed on the operating system, like the current standardized fonts, the memory footprint of vector graphic documents can be further reduced.

Figure 10. (top) an input text document. (bottom) the output document that embeds a randomly generated message.

8. Results and Validation

We now present the results and experiments to analyze the performance of our technique and validate our algorithmic choices. Here we consider text documents with English alphabet, including both lower- and upper-case letters, while the exact method can be directly applied to digits and other special characters. We first present our main results (§8.1), followed by the numerical (§8.2) and perceptual (§8.3) evaluation of our method.

8.1. Main Results

We implemented the core coding scheme on a Intel Xeon E5-1620 8 core 3.60GHz CPU with 32GB of memory. The CNNs are trained with an NVidia Geforce TITAN X GPU. Please see our accompanying video for the main results. A side-by-side comparison of an original document with a coded document is shown in Fig. 10.

Metadata viewer

We implemented a simple text document viewer that loads a coded document in vector graphics or pixel image format. The viewer displays the document. Meanwhile, it extracts the embedded metadata with our decoding algorithm and presents it in a side panel.

Unobtrusive optical codes

We also implemented an iPhone application (see Fig. 1), by which the user can take a photo of an encoded text displayed on a computer screen or printed on paper. The iPhone application interface allows the user to select a text region to capture. The captured image is sent through the network to a decoding server, which recovers the embedded message and sends it back to the smartphone.

Embedding encrypted message

Our implementation allows the user to load an encryption key file that specifies the permutation for all the lists of perturbed glyphs in the codebook. The permutation can be manually edited, or randomly generated—given a glyph list of length , one can randomly sample a permutation from the permutation group  [Seress, 2003] and attach it to the key.

Text document signature

We use our technique to generate a MD5 signature as described in §7. Since the MD5 checksum has only 128 bits, we always embed it in letters from the beginning of the text. Our text document viewer can check the signature and alert the user if the document shows as tampered.

8.2. Validation

Figure 11. (left) The accuracy of our CNN decoder changes as the resolution in height of the letters increases. (right) Theoretical improvement of maximum likelihood decoding (green) over the Chinese Remainder coding (orange).
Information capacity

As described in §5, we use and in our error-correction coding scheme. In every 5-letter block, if the mutually prime numbers are , then this block can encode integers in , where , , and are the smallest three numbers among all . Thus, this block can encode at most bits of information.

To estimate the information capacity of our scheme for English text, we randomly sample 5 characters from the alphabet to forming a block. The characters are sampled based on the widely known English letter frequencies (e.g., “e” is the most frequently used while “z” is the least used) [Ferguson and Schneier, 2003]. We compute the average number of bits that can be encoded by a character. The result is 1.77, suggesting that on average we need 73 letters to encode a 128-bit MD5 checksum for the application of digital signature (§7).

Next, we compare our coding scheme with the simple approach discussed in §5.1. Recall that our method can correct at least one error in a block of 5 letters, which amounts to correcting bits out of the total bits. The simple approach in §5.1 can store bits of information. But in order to correct one recognition error of the letter with the standard linear block codes, it needs to spend bits for adding redundancy, leaving bits to store information. We compare it with our method using The Lord of The Rings, Chapter 1 as our input text, which contains in total 41682 useful letters (9851 words). As shown in the adjacent figure, as the number of letters increases, our method can embed significantly more information.

Decoding accuracy

We first evaluate the glyph recognition accuracy of our CNNs. For every character, we print it repeatedly on a paper with randomly chosen glyphs from the codebook and take five photos under different lighting conditions. Each photo has regions of around 220px220px containing a character. We use these photos to test CNN recognition accuracy, and for all characters, the accuracy is above 90%.

We also evaluate the decoding accuracy of our method. We observed that decoding errors are mainly caused by image rasterization. If the input document is in vector graphics format, the decoding result is fully accurate, as we know the glyph outlines precisely. Thus, we evaluate the decoding accuracy with pixel images. We again use The Lord of The Rings, Chapter 1 to encode a random bit string. We rasterize the resulting document into images with different resolutions, and measure how many letters and blocks can be decoded correctly. Fig. 11-left shows the testing result.

We also theoretically estimate the decoding robustness of our maximum likelihood decoding method (§5.3). Suppose the probability of correctly recognizing the glyph of a single letter is a constant . The probability of correctly decoding a single 5-letter block can be derived analytically: if we only use Chinese Remainder Decoding algorithm (§5.2), is . With the maximum likelihood decoding (§5.3), becomes . The improvement is visualized in Fig. 11-right.


We use tensorflow[Abadi et al., 2015] with GPU support to train and decode the input. It takes 0.89 seconds to decode a text sequence with 176 letters (30 blocks). Our encoding algorithm is running in a single thread CPU, taking 7.28 seconds for the same length of letters.

Figure 12. Decoding Probability. (top) A small region of photo to be decoded, where red boxes indicate recognition errors. (bottom) Each row visualizes the probabilities of recognizing input glyphs (from the image) as the perturbed glyphs. Along the x-axis, every bar corresponds to a perturbed glyph in the codebook. The values on the y-axis are the output from our CNNs. When an error occurs, the blue and red bars indicate the discrepancy between the correctly and incorrectly recognized glyphs.
Error correction improvement

In §5.3, we hypothesize that the probability of recognizing an input pixel glyph as a glyph in the codebook is proportional to the softmax output of the CNNs. We validated this hypothesis experimentally. Here we denote the softmax output value for recognizing as a perturbed glyph as . As shown in Fig. 12 as an example, when is mistakenly recognized as as opposed to its true glyph , the probability values and are both high (although is higher) and close to each other, indicating that may be recognized as or with close probabilities. Thus, we conclude that using a likelihood model proportional to is reasonable.

Additionally, we extensively tested our decoding algorithm on photos of text documents, and verified that it can successfully decode all 5-letter blocks that have at most 2 errors. A small snapshot of our tests is shown in Fig. 13.

Figure 13. Correcting 2 errors. We decode information from an input photo. Red boxes indicate where recognition errors occur. While there exist one block contains two errors, our decoding algorithm still successfully decodes.

8.3. Perceptual Evaluation

To evaluate the subjective distortion introduced by perturbing the glyphs of a document, we conducted two user studies on MTurk. Both studies follow the standard 2AFC protocol which is described in §6 and has been used in other contexts.

Figure 14. User Study A. a to f are original font glyphs, 0.95, 0.85(we used), 0.75, 0.65, another different font, respectively.

Study A assesses the perceptual distortion of a perturbed glyph with respect to a standard font. We prepare a set of paragraphs, and the glyphs of each paragraph are from one of the six categories: (1) the standard Times New Roman; (2-5) the perturbed glyphs from four glyph codebooks, in which the thresholds used to select the perceptually similar glyph candidates are 0.95, 0.85 (i.e., the value we use in §6), 0.75, and 0.65, respectively; and (6) a different font (Helvetica). The font size of each paragraph ranges from 25pt to 60pt, so the number of letters in the paragraphs varies. In each 2AFC question, we present the MTurk rater three short paragraphs: one is in standard Times New Roman, the other two are randomly chosen from two of the six categories. We ask the rater to select from the latter two paragraphs the one whose font is closest to standard Times New Roman (shown in the first paragraph). We assign 16 questions of this type to each rater, and there were 169 raters participated. An example question is included in the supplemental document.

After collecting the response, we use the same model (6) to quantify the perceptual difference of the paragraphs in each of the six categories with respect to the one in standard Times New Roman. In this case, in Eq. (6) is the perceptual difference of a category of paragraphs to the paragraphs in standard Times New Roman. As shown in Fig. 14, the results suggest that the glyphs in our codebook (generated with a threshold of 0.85) lead to paragraphs that are perceptually close to the paragraphs in original glyphs—much closer than the glyphs selected by a lower threshold but almost as close as the glyphs selected by a higher threshold (i.e., 0.95).

Study B assesses how much the use of perturbed glyphs in a paragraph affects the aesthetics of its typeface. We prepare a set of paragraphs whose glyphs are from one of the 12 categories: We consider four different fonts (see Fig. 15). For each font, we generate the glyphs in three ways, including (1) the unperturbed standard glyph; (2) the perturbed glyphs from our codebook using a perceptual threshold of 0.85; and (3) the perturbed glyphs from a codebook using a threshold of 0.7. In each 2AFC question, we present the MTurk rater two short paragraphs randomly chosen from two of the 12 categories, and ask the rater to select the paragraph whose typeface is aesthetically more pleasing to them. We assign 16 questions of this type to each rater, and there were 135 participants. An example question is also included in the supplemental document.

Again, using the logistic model (6), we quantify the aesthetics of the typeface for paragraphs in the aforementioned three categories of glyphs. Fig. 15 shows the results, indicating that, while the aesthetics are different across the four fonts, for each font paragraphs using glyphs from our codebook (a threshold of 0.85) are aesthetically comparable to the paragraphs in standard glyphs, whereas using glyphs selected by a lower perceptual threshold significantly depreciates the typeface aesthetics.

We note that in order to identify untrustworthy raters, in both user studies we include four control questions in each task assigned to the raters, in a way similar to those described in §6.

Figure 15. User Study B. We conduct user studies for four different fonts simultaneously, and learn the scores of aesthetics when different types of glyphs are used: Green bars indicate the original glyphs, orange bars indicate the perturbed glyphs from our codebook, and purple bars indicate the perturbed glyphs chosen by a lower perceptual threshold (i.e., 0.7).

9. Conclusion and Future Work

We have introduced a new technique for embedding additional information in text documents. Provided a user-specified message, our method assigns each text letter an integer and embeds the integer by perturbing the glyph of each letter according to a precomputed codebook. Our method is able to correct a certain number of errors in the decoding stage, through a new error-correction coding scheme built on two algorithmic components: the Chinese Remainder coding and the maximum likelihood decoding. We have demonstrated our method with four applications, including text document metadata storage, unobtrusive optical codes on text, symmetric-key encryption, and format-independent digital signatures.

Currently we only consider standard fonts such as regular Times New Roman but not their variants such as Times New Roman Bold Italic. We can treat those variants separately and include their perturbed glyphs in the codebook. Then, our method will work with those font variants.

When extracting messages from a rasterized text document, we rely on the OCR library to detect and recognize the characters, but we cannot recover any OCR detection error. If a character is mistakenly recognized by the OCR, the integer embedded in that character is lost, and our error-correction decoding may not be able to recover the plain message since different characters have different embedding capacities. However, in our experiments, the OCR library always recognizes characters correctly.

If a part of the text is completely occluded from the camera, our decoding algorithm will fail, as it needs to know how the text is split into blocks. In the future, we hope to improve our coding scheme so that it is able to recover from missing letters in the text as well.

In general, the idea of perturbing the glyphs for embedding messages can be applied to any symbolic system, such as other languages, mathematical equations, and music notes. It would be interesting to explore similar embedding methods and their applications for different languages and symbols.

Lastly, while our method is robust to format conversion, rasterization, as well as photograph and scan of printed papers, it suffers from the same drawback that almost all text steganographic methods have: if a text document is completely retyped, the embedded information is destroyed.


  • [1]
  • Abadi et al. [2015] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015.

    TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems.

    (2015). Software available from
  • Adobe [2001] A Adobe. 2001. Manager’s Introduction to Adobe eXtensible Metadata Platform. The Adobe XML Metadata Framework (2001).
  • Agarwal [2013] Monika Agarwal. 2013. Text steganographic approaches: a comparison. International Journal of Network Security & Its Applications (IJNSA) 5, 1 (2013).
  • Alattar and Alattar [2004] Adnan M Alattar and Osama M Alattar. 2004. Watermarking electronic text documents containing justified paragraphs and irregular line spacing. In Electronic Imaging 2004. International Society for Optics and Photonics, 685–695.
  • Avilés-Cruz et al. [2005] Carlos Avilés-Cruz, Risto Rangel-Kuoppa, Mario Reyes-Ayala, A Andrade-Gonzalez, and Rafael Escarela-Perez. 2005. High-order statistical texture analysis—-font recognition applied. Pattern Recognition Letters 26, 2 (2005), 135–145.
  • Bhaya et al. [2013] Wesam Bhaya, Abdul Monem Rahma, and AL-Nasrawi Dhamyaa. 2013. Text steganography based on font type in ms-word documents. Journal of Computer Science 9, 7 (2013), 898.
  • Bloom et al. [1999] Jeffrey A Bloom, Ingemar J Cox, Ton Kalker, J-PMG Linnartz, Matthew L Miller, and C Brendan S Traw. 1999. Copy protection for DVD video. Proc. IEEE 87, 7 (1999), 1267–1276.
  • Boneh [2000] Dan Boneh. 2000. Finding smooth integers in short intervals using CRT decoding. In

    Proceedings of the 32nd ACM symposium on Theory of computing

    . 265–272.
  • Bradley and Terry [1952] Ralph Allan Bradley and Milton E Terry. 1952. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika 39, 3/4 (1952), 324–345.
  • Brassil et al. [1995] Jack T Brassil, Steven Low, Nicholas F. Maxemchuk, and Lawrence O’Gorman. 1995. Electronic marking and identification techniques to discourage document copying. IEEE Journal on Selected Areas in Communications 13, 8 (1995), 1495–1504.
  • Campbell and Kautz [2014] Neill D. F. Campbell and Jan Kautz. 2014. Learning a Manifold of Fonts. ACM Trans. Graph. 33, 4 (July 2014), 91:1–91:11.
  • Chaudhary et al. [2016] Sunita Chaudhary, Meenu Dave, and Amit Sanghi. 2016. Text Steganography Based on Feature Coding Method. In Proceedings of the International Conference on Advances in Information Communication Technology & Computing. ACM, 7.
  • Chen et al. [2014] Guang Chen, Jianchao Yang, Hailin Jin, Jonathan Brandt, Eli Shechtman, Aseem Agarwala, and Tony X Han. 2014. Large-scale visual font recognition. In

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    . 3598–3605.
  • Chollet et al. [2015] François Chollet et al. 2015. Keras. (2015).
  • Denso [2011] ADC Denso. 2011. Qr code essentials. Denso Wave 900 (2011).
  • Eastlake and Jones [2001] D. Eastlake, 3rd and P. Jones. 2001. US Secure Hash Algorithm 1 (SHA1). (2001).
  • Ferguson and Schneier [2003] Niels Ferguson and Bruce Schneier. 2003. Practical cryptography. Vol. 23. Wiley New York.
  • Goldreich et al. [1999] Oded Goldreich, Dana Ron, and Madhu Sudan. 1999. Chinese remaindering with errors. In Proceedings of the 31st ACM symposium on Theory of computing. 225–234.
  • Greenberg [2005] Jane Greenberg. 2005. Understanding metadata and metadata schemes. Cataloging & classification quarterly 40, 3-4 (2005), 17–36.
  • Gutub and Fattani [2007] Adnan Gutub and Manal Fattani. 2007. A novel Arabic text steganography method using letter points and extensions. World Academy of Science, Engineering and Technology 27 (2007), 28–31.
  • Hu and Hersch [2001] Changyuan Hu and Roger D Hersch. 2001. Parameterizable fonts based on shape components. IEEE Computer Graphics and Applications 21, 3 (2001), 70–85.
  • Jo et al. [2016] Kensei Jo, Mohit Gupta, and Shree K. Nayar. 2016. DisCo: Display-Camera Communication Using Rolling Shutter Sensors. ACM Trans. Graph. 35, 5 (July 2016).
  • Jung et al. [1999] Min-Chul Jung, Yong-Chul Shin, and Sargur N Srihari. 1999. Multifont classification using typographical attributes. In Document Analysis and Recognition, 1999. ICDAR’99. Proceedings of the Fifth International Conference on. IEEE, 353–356.
  • Karp [1972] Richard M Karp. 1972. Reducibility among combinatorial problems. In Complexity of computer computations. Springer, 85–103.
  • Katz and Imhausen [2007] Victor J Katz and Annette Imhausen. 2007. The Mathematics of Egypt, Mesopotamia, China, India, and Islam: A Sourcebook. Princeton University Press.
  • Kim et al. [2003] Young-Won Kim, Kyung-Ae Moon, and Il-Seok Oh. 2003. A Text Watermarking Algorithm based on Word Classification and Inter-word Space Statistics.. In ICDAR. 775–779.
  • Kingma and Ba [2015] Diederik Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR) (2015).
  • Knuth [1986] DE Knuth. 1986. The METAFONT book. Addison-Welsey. Reading Mass (1986).
  • Konc and Janezic [2007] Janez Konc and Dušanka Janezic. 2007. An improved branch and bound algorithm for the maximum clique problem. proteins 4, 5 (2007).
  • Lau [2009] Vincent M. K. Lau. 2009. Learning by Example for Parametric Font Design. In ACM SIGGRAPH ASIA 2009 Posters (SIGGRAPH ASIA ’09). 5:1–5:1.
  • Lin and Costello [2004] Shu Lin and Daniel J Costello. 2004. Error control coding. Pearson Education India.
  • Loviscach [2010] Jörn Loviscach. 2010. The universe of fonts, charted by machine. In ACM SIGGRAPH 2010 Talks. ACM, 27.
  • O’Donovan et al. [2014] Peter O’Donovan, Jānis Lībeks, Aseem Agarwala, and Aaron Hertzmann. 2014. Exploratory Font Selection Using Crowdsourced Attributes. ACM Trans. Graph. 33, 4 (July 2014), 92:1–92:9.
  • Otsu [1975] Nobuyuki Otsu. 1975. A threshold selection method from gray-level histograms. Automatica 11, 285-296 (1975), 23–27.
  • Panda et al. [2015] Jeebananda Panda, Nishant Gupta, Parag Saxena, Shubham Agrawal, Surabhi Jain, and Asok Bhattacharyya. 2015. Text Watermarking using Sinusoidal Greyscale Variations of Font based on Alphabet Count. (2015).
  • Ramanathan et al. [2009] R Ramanathan, KP Soman, L Thaneshwaran, V Viknesh, T Arunkumar, and P Yuvaraj. 2009.

    A novel technique for english font recognition using support vector machines. In

    Advances in Recent Technologies in Communication and Computing, 2009. ARTCom’09. International Conference on. IEEE, 766–769.
  • Reed and Solomon [1960] Irving S Reed and Gustave Solomon. 1960. Polynomial codes over certain finite fields. Journal of the society for industrial and applied mathematics 8, 2 (1960), 300–304.
  • Rivest [1992] R. Rivest. 1992. The MD5 Message-Digest Algorithm. (1992).
  • Rivest et al. [1978] Ronald L Rivest, Adi Shamir, and Leonard Adleman. 1978. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 21, 2 (1978), 120–126.
  • Rizzo et al. [2016] Stefano Giovanni Rizzo, Flavio Bertini, and Danilo Montesi. 2016. Content-preserving Text Watermarking through Unicode Homoglyph Substitution. In Proceedings of the 20th International Database Engineering & Applications Symposium. ACM, 97–104.
  • Rosen [2011] Kenneth H Rosen. 2011. Elementary number theory. Pearson Education.
  • Rugglcs [1983] Lynn Rugglcs. 1983. Letterform design systems. Technical Report. Stanford University.
  • Seress [2003] Ákos Seress. 2003. Permutation group algorithms. Vol. 152. Cambridge University Press.
  • Shamir and Rappoport [1998] Ariel Shamir and Ari Rappoport. 1998. Feature-based design of fonts using constraints. In Electronic Publishing, Artistic Imaging, and Digital Typography. Springer, 93–108.
  • Shannon [1948] Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal 27 (1948), 379–423, 623–656.
  • Smith [2007] R. Smith. 2007. An Overview of the Tesseract OCR Engine. In Proceedings of the Ninth International Conference on Document Analysis and Recognition - Volume 02 (ICDAR ’07). IEEE Computer Society, Washington, DC, USA, 629–633.
  • Stojanov et al. [2014] Ivan Stojanov, Aleksandra Mileva, and Igor Stojanovic. 2014. A New Property Coding in Text Steganography of Microsoft Word Documents. (2014).
  • Suveeranont and Igarashi [2010] Rapee Suveeranont and Takeo Igarashi. 2010. Example-based automatic font generation. In International Symposium on Smart Graphics. Springer, 127–138.
  • Um et al. [2017] Kiwon Um, Xiangyu Hu, and Nils Thuerey. 2017. Perceptual Evaluation of Liquid Simulation Methods. ACM Trans. Graph. 36, 4 (2017).
  • Wang et al. [2015] Zhangyang Wang, Jianchao Yang, Hailin Jin, Eli Shechtman, Aseem Agarwala, Jonathan Brandt, and Thomas S. Huang. 2015. DeepFont: Identify Your Font from An Image. In Proceedings of the 23rd ACM International Conference on Multimedia (MM ’15). ACM, New York, NY, USA, 451–459.
  • Wayner [1992] Peter Wayner. 1992. Mimic functions. Cryptologia 16, 3 (1992), 193–214.
  • Wayner [2009] Peter Wayner. 2009. Disappearing cryptography: information hiding: steganography & watermarking. Morgan Kaufmann.