1 Introduction
Scholars interested in understanding details related to production and provenance of historical documents rely on methods of analysis ranging from the study of orthographic differences and stylometrics, to visual analysis of layout, font, and printed characters. Recently developed tools like Ocular BergKirkpatrick et al. (2013) for OCR of historical documents have helped automate and scale some textual analysis methods for tasks like compositor attribution Ryskina et al. (2017) and digitization of historical documents Garrette et al. (2015). However, researchers often find the need to go beyond textual analysis for establishing provenance of historical documents. For example, Hinman (1963)’s study of typesetting in Shakespeare’s First Folio relied on the discovery of pieces of damaged or distinctive type through manual inspection of every glyph in the document. More recently, Warren et al. (2020) examine pieces of distinctive types across several printers of the early modern period to posit the identity of clandestine printers of John Milton’s Areopagitica (1644). In such work, researchers frequently aim to determine whether a book was produced by a single or multiple printers (Weiss (1992); Malcolm (2014); Takano (2016)). Hence, in order to aid these visual methods of analyses, we propose here a novel probabilistic generative model for analyzing extracted images of individual printed characters in historical documents. We draw from work on both deep generative modeling and interpretable models of the printing press to develop an approach that is both flexible and controllable – the later being a critical requirement for such analysis tools.
As depicted in Figure 1, we are interested in identifying clusters of subtly distinctive glyph shapes as these correspond to distinct metal stamps in the typecases used by printers. However, other sources of variation (inking, for example, as depicted in Figure 1
) are likely to dominate conventional clustering methods. For example, powerful models like the variational autoencoder (VAE)
Kingma and Welling (2014) capture the more visually salient variance in inking rather than typeface, while more rigid models (e.g. the emission model of Ocular BergKirkpatrick et al. (2013)), fail to fit the data. The goal of our approach is to account for these confounding sources of variance, while isolating the variables pertinent to clustering.Hence, we propose a generative clustering model that introduces a neural editing process to add expressivity, but includes interpretable latent variables that model wellunderstood variance in the printing process: biaxial translation, shear, and rotation of canonical type shapes. In order to make our model controllable
and prevent deep latent variables from explaining all variance in the data, we introduce a restricted inference network. By only allowing the inference network to observe the visual residual of the observation after interpretable modifications have been applied, we bias the posterior approximation on the neural editor (and thus the model itself) to capture residual sources of variance in the editor – for example, inking levels, ink bleeds, and imaging noise. This approach is related to recently introduced neural editor models for text generation
Guu et al. (2018).In experiments, we compare our model with rigid interpretable models (Ocular) and powerful generative models (VAE) at the task of unsupervised clustering subtly distinct typeface in scanned images early modern documents sourced from Early English Books Online (EEBO).
2 Model
Our model reasons about the printed appearances of a symbol (say majuscule F) in a document via a mixture model whose components correspond to different metal stamps used by a printer for the document. During various stages of printing, random transformations result in varying printed manifestations of a metal cast on the paper. Figure 2 depicts our model. We denote an observed image of the extracted character by . We denote choice of typeface by latent variable (the mixture component) with prior . We represent the shape of the th stamp by template , a square matrix of parameters. We denote the interpretable latent variables corresponding to spatial adjustment of the metal stamp by , and the editor latent variable responsible for residual sources of variation by . As illustrated in Fig. 2, after a cluster component is selected, the corresponding template undergoes a transformation to yield . This transformation occurs in two stages: first, the interpretable spatial adjustment variables () produce an adjusted template (section 2.1), , and then the neural latent variable transforms the adjusted template (section 2.2), .
The marginal probability under our model can be written as
where refers to the distribution over the binary pixels of
where each pixel has a bernoulli distribution parametrized by the value of the corresponding pixelentry in
.2.1 Interpretable spatial adjustment
Early typesetting was noisy, and the metal pieces were often arranged with slight variations which resulted in the printed characters being positioned with small amounts of offset, rotation and shear. These realvalued spatial adjustment variables are denoted by , where represents the rotation variable, represents offsets along the horizontal and vertical axes, denotes shear along the two axes. A scale factor, , accounts for minor scale variations arising due to the archiving and extraction processes. All variables in are generated from a Gaussian prior with zero mean and fixed variance as the transformations due to these variables tend to be subtle.
In order to incorporate these deterministic transformations in a differentiable manner, we map to a template sized attention map for each output pixel position in as depicted in Figure 3
. The attention map for each output pixel is formed in order to attend to the corresponding shifted (or scaled or sheared) portion of the input template and is shaped according to a Gaussian distribution with mean determined by an affine transform. This approach allows for strong inductive bias which contrasts with related work on
spatialVAE Bepler et al. (2019) that learns arbitrary transformations.2.2 Residual sources of variations
Apart from spatial perturbations, other major sources of deviation in early printing include random inking perturbations caused by inconsistent application of the stamps, unpredictable ink bleeds, and noise associated with digital archiving of the documents. Unlike in the case of spatial perturbations which could be handled by deterministic affine transformation operators, it is not possible to analytically define a transformation operator due to these variables. Hence we propose to introduce a noninterpretable realvalued latent vector , with a Gaussian prior , that transforms into a final template via neurallyparametrized function
with neural network parameters
. This function is a convolution over whose kernel is parametrized by , followed by nonlinear operations. Intuitively, parametrizing the filter by results in the latent variable accounting for variations like inking appropriately because convolution filters capture local variations in appearance. Srivatsan et al. (2019) also observed the effectiveness of using to define a deconvolutional kernel for font generation.2.3 Learning and Inference
Our aim is to maximize the log likelihood of the observed data () of images wrt. model parameters:
During training, we maximize the likelihood wrt. instead of marginalizing, which is an approximation inspired by iterated conditional modes Besag (1986):
However, marginalizing over remains intractable. Therefore we perform amortized variational inference to define and maximize a lower bound on the above objective Kingma and Welling (2014). We use a convolutional inference neural network parametrized by (Fig. 4), that takes as input, the mixture component , the residual image , and produces mean and variance parameters for an isotropic gaussian proposal distribution . This results in the final training objective:
We use stochastic gradient ascent to maximize this objective with respect to and .
3 Experiments
We train our models on printed occurrences of 10 different uppercase character classes that scholars have found useful for bibliographic analysis Warren et al. (2020) because of their distinctiveness. As a preprocessing step, we ran Ocular BergKirkpatrick et al. (2013)
on the grayscale scanned images of historical books in EEBO dataset and extracted the estimated image segments for the letters of interest.
3.1 Quantitative analysis
We show that our model is superior to strong baselines at clustering subtly distinct typefaces (using realistic synthetic data), as well as in terms of fitting the real data from historical books.
3.1.1 Baselines for comparison
Ocular: Based on the emission model of Ocular that uses discrete latent variables for the vertical/horizontal offset and inking variables, and hence has limited expressivity.
only: This model only has the interpretable continuous latent variables pertaining to spatial adjustment.
VAEonly: This model is expressive but doesn’t have any interpretable latent variables for explicit control. It is an extension of Kingma et al. (2014)
’s model for semisupervised learning with a continuous latent variable vector in which we obtain tighter bounds by marginalizing over the cluster identities explicitly. For fair comparison, the encoder and decoder convolutional architectures are the same as the ones in our full model. The corresponding training objective for this baseline can be written as:
Noresidual: The only difference from the full model is that the encoder for the inference network conditions the variational distribution on the entire input image instead of just the residual image .
3.1.2 Font discovery in Synthetic Data
Early modern books were frequently composed from two or more type cases,
resulting in documents with mixed fonts. We aim to learn the different shapes of metal stamps that were used as templates for each cluster component in our model.
Data: In order to quantitatively evaluate our model’s performance, we experiment with synthetically generated realistic dataset for which we know the ground truth cluster identities in the following manner: For each character of interest, we pick three distinct images from scanned segmented EEBO images, corresponding to three different metal casts. Then we randomly add spatial perurbations related to scale, offset, rotation and shear. To incorporate varying inking levels and other distortions, we randomly either perform erosion, dilation, or a combination of these warpings using OpenCV Bradski (2000) with randomly selected kernel sizes. Finally, we add a small Gaussian noise to the pixel intensities and generate 300 perturbed examples per character class.
Results:
We report macroaveraged results across all the character classes on three different clustering measures, Vmeasure Rosenberg and Hirschberg (2007), Mutual Information and Fowlkes and Mallows Index Fowlkes and Mallows (1983). In Table 11, we see that our model significantly outperforms all other baselines on every metric. Ocular and only models fail because they lack expressiveness to explain the variations due to random jitters, erosions and dilations. The VAEonly model, while very expressive, performs poorly because it lacks the inductive bias needed for successful clustering. The Noresidual model performs decently but our model’s superior performance emphasizes the importance of designing a restrictive inference network such that doesn’t explain any variation due to the interpretable variables.
Vmeasure  Mutual Info  F&M  NLL  

Ocular  0.42  0.45  0.61  379.21 
only  0.49  0.51  0.70  322.04 
VAEonly  0.22  0.29  0.38  263.45 
Noresidual  0.54  0.58  0.73  264.27 
Our Model  0.73  0.74  0.85  257.92 
3.1.3 Fitting Real Data from Historical Books
For the analysis of real books, we selected three books from the EEBO dataset printed by different printers. We modeled each character class for each book separately and report the macroaggregated upper bounds on the negative log likelihood (NLL) in Table 11. We observe that adding a small amount of expressiveness makes our only model better than Ocular. The upper bounds of other inference network based models are much better than the (likely tight)^{1}^{1}1For Ocular and only models, we report the upper bound obtained via maximization over the interpretable latent variables. Intuitively, these latent variables are likely to have unimodal posterior distributions with low variance, hence this approximation is likely tight. bounds of both the interpretable models. Our model has the lowest upper bound of all the models while retaining interpretability and control.
3.2 Qualitative analysis
We provide visual evidence of desirable behavior of our model on collections of character extractions from historical books with mixed fonts. Specifically, we discus the performance of our model on the mysterious edition of Thomas Hobbes’ Leviathan known as “the 25 Ornaments” edition. Hobbes (1651 [really 1700?]). The 25 Ornaments Leviathan is an interesting test case for several reasons. While its title page indicates a publisher and year of publication, both are fabricated Malcolm (2014). The identities of its printer(s) remain speculative, and the actual year of publication is uncertain. Further, the 25 Ornaments exhibits two distinct fonts.
3.2.1 Quality of learned templates
Our model is successful in discovering distinctly shaped typefaces in the 25 Ornaments Leviathan. We focus on the case study of majuscule letters F and R, each of which have two different typefaces mixed in throughout. The two typefaces for F differ in the length of the middle arm (Fig. 1), and the two typefaces for R have differently shaped legs. In Fig. 5, we show that our model successfully learns the two desired templates and for both the characters which indicates that the clusters in our model mainly focus on subtle differences in underlying glyph shapes. We also illustrate how the latent variables transform the model templates to for four example F images. The model learns complex functions to transform the templates which go beyond simple affine and morphological transformations in order to account for inking differences, random jitter, contrast variations etc.
3.2.2 Interpretable variables () and Control
Finally, we visualize the ability of our model to separate responsibility of modelling variation among the interpretable and noninterpretable variables appropriately. We use the inferred values of the interpretable () variable for each image in the dataset to adjust the corresponding image. Since the templates represent the canonical shape of the letters, the variables which shift the templates to explain the images can be reverse applied to the input images themselves in order to align them by accounting for offset, rotation, shear and minor size variations. In Fig. 6, we see that the input images (top row) are uneven and vary by size and orientation. By reverse applying the inferred values, we are able to project the images to a fixed size such that they are aligned and any remaining variations in the data are caused by other sources of variation. Moreover, this alignment method would be crucial for automating certain aspects of bibliographic studies that focus on comparing specific imprints.
4 Conclusion
Beyond applications to typeface clustering, the general approach we take might apply more broadly to other clustering problems, and the model we developed might be incorporated into OCR models for historical text.
5 Acknowledgements
This project is funded in part by the NSF under grants 1618044 and 1936155, and by the NEH under grant HAA25604417.
References
 Bepler et al. (2019) Tristan Bepler, Ellen Zhong, Kotaro Kelley, Edward Brignole, and Bonnie Berger. 2019. Explicitly disentangling image content from translation and rotation with spatialvae. In Advances in Neural Information Processing Systems, pages 15409–15419.
 BergKirkpatrick et al. (2013) Taylor BergKirkpatrick, Greg Durrett, and Dan Klein. 2013. Unsupervised transcription of historical documents. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 207–217, Sofia, Bulgaria. Association for Computational Linguistics.
 Besag (1986) Julian Besag. 1986. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society: Series B (Methodological), 48(3):259–279.
 Bradski (2000) G. Bradski. 2000. The OpenCV Library. Dr. Dobb’s Journal of Software Tools.

Fowlkes and Mallows (1983)
Edward B Fowlkes and Colin L Mallows. 1983.
A method for comparing two hierarchical clusterings.
Journal of the American statistical association, 78(383):553–569.  Garrette et al. (2015) Dan Garrette, Hannah AlpertAbrams, Taylor BergKirkpatrick, and Dan Klein. 2015. Unsupervised codeswitching for multilingual historical document transcription. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1036–1041, Denver, Colorado. Association for Computational Linguistics.
 Guu et al. (2018) Kelvin Guu, Tatsunori B Hashimoto, Yonatan Oren, and Percy Liang. 2018. Generating sentences by editing prototypes. Transactions of the Association for Computational Linguistics, 6:437–450.
 Hinman (1963) Charlton Hinman. 1963. The printing and proofreading of the first folio of Shakespeare, volume 1. Oxford: Clarendon Press.
 Hobbes (1651 [really 1700?]) Thomas Hobbes. 1651 [really 1700?]. Leviathan, or, the matter, form, and power of a commonwealth ecclesiastical and civil. By Thomas Hobbes of Malmesbury. Number R13935 in ESTC. [false imprint] printed for Andrew Crooke, at the Green Dragon in St. Pauls Churchyard, London.
 Kingma and Welling (2014) Diederik P. Kingma and Max Welling. 2014. Autoencoding variational bayes. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 1416, 2014, Conference Track Proceedings.
 Kingma et al. (2014) Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. 2014. Semisupervised learning with deep generative models. In Advances in neural information processing systems, pages 3581–3589.
 Malcolm (2014) Noel Malcolm. 2014. Editorial Introduction. In Leviathan, volume 1. Clarendon Press, Oxford.

Rosenberg and Hirschberg (2007)
Andrew Rosenberg and Julia Hirschberg. 2007.
Vmeasure: A conditional entropybased external cluster evaluation
measure.
In
Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLPCoNLL)
, pages 410–420.  Ryskina et al. (2017) Maria Ryskina, Hannah AlpertAbrams, Dan Garrette, and Taylor BergKirkpatrick. 2017. Automatic compositor attribution in the first folio of shakespeare. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 411–416, Vancouver, Canada. Association for Computational Linguistics.
 Srivatsan et al. (2019) Nikita Srivatsan, Jonathan Barron, Dan Klein, and Taylor BergKirkpatrick. 2019. A deep factorization of style and structure in fonts. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 2195–2205, Hong Kong, China. Association for Computational Linguistics.
 Takano (2016) Akira Takano. 2016. Thomas Warren: A Printer of Leviathan (head edition). Annals of Nagoya University Library Studies, 13:1–17.
 Warren et al. (2020) Christopher N. Warren, Pierce Williams, Shruti Rijhwani, and Max G’Sell. 2020. Damaged type and Areopagitica’s clandestine printers. Milton Studies, 62.1.
 Weiss (1992) Adrian Weiss. 1992. Shared Printing, Printer’s Copy, and the Text(s) of Gascoigne’s ”A Hundreth Sundrie Flowres”. Studies in Bibliography, 45:71–104.
Appendix A Character wise quantitative analysis
The quantitative experiments were performed on the following character classes: A, B, E, F, G, H, M, N, R, W.
Vmeasure  Mutual Info  F&M  NLL  

only  0.77  0.82  0.89  264.90 
VAEonly  0.33  0.38  0.5  230.45 
Noresidual  0.79  0.85  0.90  231.45 
Our Model  0.78  0.86  0.89  226.25 
Vmeasure  Mutual Info  F&M  NLL  

only  0.37  0.39  0.59  261.1 
VAEonly  0.15  0.2  0.32  229.1 
Noresidual  0.37  0.39  0.58  228.1 
Our Model  0.68  0.73  0.81  226.25 
Vmeasure  Mutual Info  F&M  NLL  

only  0.33  0.36  0.55  282.4 
VAEonly  0.17  0.19  0.30  253.2 
Noresidual  0.33  0.35  0.56  251.45 
Our Model  0.65  0.70  0.76  234.05 
Vmeasure  Mutual Info  F&M  NLL  

only  0.09  0.10  0.55  258.40 
VAEonly  0.03  0.05  0.31  218.2 
Noresidual  0.12  0.09  0.59  208.1 
Our Model  0.81  0.56  0.94  204.48 
Vmeasure  Mutual Info  F&M  NLL  

only  0.60  0.62  0.73  268.40 
VAEonly  0.28  0.38  0.40  250.8 
Noresidual  0.64  0.66  0.77  244.5 
Our Model  0.60  0.62  0.73  240.84 
Vmeasure  Mutual Info  F&M  NLL  

only  0.72  0.71  0.79  313.75 
VAEonly  0.32  0.32  0.40  254.2 
Noresidual  0.90  0.97  0.94  258.8 
Our Model  0.92  1.01  0.96  249.81 
Vmeasure  Mutual Info  F&M  NLL  

only  0.62  0.64  0.78  392.06 
VAEonly  0.29  0.38  0.40  323.5 
Noresidual  0.70  0.83  0.74  329.25 
Our Model  0.75  0.84  0.87  323.04 
Vmeasure  Mutual Info  F&M  NLL  

only  0.65  0.70  0.73  331.6 
VAEonly  0.30  0.45  0.40  265.2 
Noresidual  0.74  0.81  0.82  270.11 
Our Model  0.69  0.75  0.75  264.23 
Vmeasure  Mutual Info  F&M  NLL  

only  0.07  0.08  0.55  330.6 
VAEonly  0.03  0.04  0.34  247.1 
Noresidual  0.06  0.07  0.53  251.32 
Our Model  0.46  0.32  0.78  246.02 
Vmeasure  Mutual Info  F&M  NLL  

only  0.65  0.71  0.79  418.01 
VAEonly  0.31  0.45  0.42  364.2 
Noresidual  0.72  0.78  0.82  369.5 
Our Model  0.72  0.79  0.84  364.21 