Visual Font Pairing

11/19/2018 ∙ by Shuhui Jiang, et al. ∙ Northeastern University adobe 8

This paper introduces the problem of automatic font pairing. Font pairing is an important design task that is difficult for novices. Given a font selection for one part of a document (e.g., header), our goal is to recommend a font to be used in another part (e.g., body) such that the two fonts used together look visually pleasing. There are three main challenges in font pairing. First, this is a fine-grained problem, in which the subtle distinctions between fonts may be important. Second, rules and conventions of font pairing given by human experts are difficult to formalize. Third, font pairing is an asymmetric problem in that the roles played by header and body fonts are not interchangeable. To address these challenges, we propose automatic font pairing through learning visual relationships from large-scale human-generated font pairs. We introduce a new database for font pairing constructed from millions of PDF documents available on the Internet. We propose two font pairing algorithms: dual-space k-NN and asymmetric similarity metric learning (ASML). These two methods automatically learn fine-grained relationships from large-scale data. We also investigate several baseline methods based on the rules from professional designers. Experiments and user studies demonstrate the effectiveness of our proposed dataset and methods.



There are no comments yet.


page 1

page 3

page 7

page 8

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In multimedia filed, applying artificial intelligence to facilitate art and design has drawn a lot of attention recently, such as automatic generation of visual-textural presentation layout

[1], font recognition, election and prediction [2, 3, 4, 5] and visual document analysis [6, 7]. Pairing fonts is an important task in graphic design for documents, posters, logos, advertisements and many other types of design. A designer typically picks a title font, sub-header fonts, body text fonts, and does so in a way that is harmonious and appropriate for the style. Fonts should complement each other without “clashing” or appearing disconnected. For example, Figure 1 shows the same advertisement with two choices of font for the sub-header (“Continues”), and the same font for the header (“The Heritage”). Each pair of choices conveys a different design quality: in one case, the fonts complement each other and appear more interesting, whereas when they are nearly the same the layout is less appealing. Each pair of header and sub-header fonts represents a font pair. Despite the importance of font pairing, current design tools do not provide much assistance in the challenging task for pairing fonts, aside from providing a few template designs.

Design books and websites provide many rules-of-thumb for font pairing, such as “Use Fonts from the Same Family”, “Mix Serifs and Sans Serifs” and “Create Contrast”111–webdesign-5706, but such rules can be hard to apply in practice or to formalize. Font pairing is especially challenging for novices creating designs, who may lack the intuitions for selecting fonts.

Fig. 1: Examples for PDF pages with different font pairing. The font of headers are the same, while the fonts of sub-headers are different. We show the same PDF page rendered with two choices of font for the sub-header (“Continues”), and the same font for the header (“The Heritage”).

This paper introduces the problem of automatic font pairing. Given a font selection in a document, our goal is to recommend matched fonts that produce pleasing visual effect when they are used together in different parts of a document. For example, given a header font, recommend a body font, or vice versa. There are three main challenges in font pairing. First, this is a fine-grained problem, which means that subtle distinctions between fonts may be important, as opposed to object-level co-occurrence problems (e.g., sky and airplane) [8, 9, 10] or category-level pairing (e.g., tops and skirts [11, 12, 13, 14, 15, 16]). Second, designers have listed many rules for font pairing, but they are difficult to formalize. Font pairing is not simply a problem of similarity: designers typically pair contrasting fonts as well as similar fonts. Third, font pairing is an asymmetric problem. Pairing font A as header and font B as body is different as pairing font B as header and font A as body.

It should be noted that font pairing is a complex task: a good font combination is decided by many factors beyond font itself, e.g., text sentiment, layout and even personal taste. As a first attempt to attack this problem, the goal in this paper is to recommend font pairs that satisfy majority users’ aesthetics only based on visual font information. We believe this is less challenging than also considering other elements in design context. It is also a well-posed problem since most tutorials and books [17] on this topic recommend font pairs in the same setting.

To address these challenges, we propose to learn font pairing from large-scale human-designed font pairs. However, obtaining appropriate data for this problem is challenging, as there is no existing dataset and the font pairs from Internet web pages are noisy or biased to a small set of popular fonts. We collected a new database called “FontPairing” from millions of PDF pages on the Internet. These PDFs embody a very diverse set of designs with font meta data embedded. We devise some heuristics to automatically identify header/sub-header and header/body pairs from the PDF pages with a high accuracy verified on a subset.

Given this data, we investigate two algorithms to learn font pairing: dual-space -NN and asymmetric similarity metric learning method. The intuition behind dual-space -NN is that the users may choose the same body for similar header fonts, and vice-versa. For example, if the font Univers is similar to Helvetica, then fonts that pair well with Univers should pair well with Helvetica. Given an input query font (e.g., header), we first find the

most similar fonts from the training header fonts. We then rank their corresponding body fonts by their appearance frequency in the training data. Appearance similarity is measured by a deep neural network trained for this task.

The goal of asymmetric similarity metric learning is to learn a distance function by which fonts that pair well have small distances to each other, and, conversely, mismatched fonts are far apart. The metric is discriminatively trained from our training font pairs. Especially, we jointly learn the model that bridges the asymmetric similarity and distance metric. At test time, an online prediction entails finding the nearest font pairs in the dataset.

To the best of our knowledge, our work is the first to address the automatic font pairing problem. Since there is no prior work, we compare with several baseline methods (e.g., same font family, similarity, contrast) according to the rules provided by professional designers. Experimental results show effectiveness of our visual learning based dual-space NN and asymmetric similarity metric learning methods.

2 Related Work

Fig. 2: Examples for font pair detection in freely-available PDFs. The left four columns show documents with detected header/body pairs, marked with red bounding boxes. The right four columns show examples of documents with detected header/sub-header pairs, with headers marked in red and sub-headers in blue.

To the best of our knowledge, our work is the first to address font pairing problem, and there is very little related work.

In multimedia and computer vision fields, several methods have been proposed for font recognition

[18, 19, 3, 4] and font prediction [5] based on large datasets of fonts and their images. Wang et al. [3, 4] train deep neural networks for font recognition. Zhao et al. [5] proposed multi-task deep neural networks to jointly predict font face, color and size for each text element on a web design, by considering multi-scale visual features and semantic tags of the web design. Our work is also related to systems for learning to parse web pages, such as WebZeitGeist [20]. The most relevant to our work is by O’Donovan et al. [2], who present interfaces for finding fonts based on learned models of font style. However, their work focuses only on single fonts in isolation, whereas we consider how two fonts pair with each other. Font pairing is also related to visual document analysis (e.g., [6, 7]) and automatic generation of visual-textural presentation layout [1].

In terms of methodology, our work is highly related to other visual pairing tasks, particularly pairing clothing [11, 12, 13, 14, 15, 16], furniture [21], and food [22]. Here we address font pairing, which entails particular difficulties including lack of an appropriate data source, fine-grained difference between font types and symmetrically pairing entities of the same category instead of different categories.

3 A Database for Font Pairing

In this section, we introduce the new database we generated for visual font pairing task called “FontPairing”. We collect millions of freely-available PDFs on Internet, analyze and extract header/body font pairs and header/sub-header pairs from the PDF pages.

3.1 Font Pairings From Web

Perhaps the most obvious approach to gather font pairing data is to obtain them from webpages such as Google Fonts333, Typ.io444, Typewolf555 For example, each font on Google Fonts is provided with a list of suggested pairings. However, we found these datasets inadequate for two reasons. First, they each provide a small set of pairings. The second and more significant problem is that these pairing lists are extremely unbalanced: these websites generally recommend only popular body fonts. Out of the above pairs, 43% of the font pairs involve one of the five most-popular fonts.

3.2 PDF Dataset

To address these problems, we propose to detect font pairs from millions of PDF documents available on the Internet. We have collected more than 300,000 PDF files from various websites such as and Each PDF usually contains dozens to hundreds of pages; from all these PDFs, we obtain more than 15 million pages in total. As shown in Figure 2, these PDF pages exhibit various layouts, topics, and font styles. We believe this dataset could be potentially useful for training other models for document design as well.

A key challenge is then to extract visual information from this large dataset. Although PDF is a structured document format, it is complex and does not include the annotations we need (e.g.,“header font”). Parsing such structured representations is a major challenge in itself [20]. Rather than attempting to fully parse the document, we focus only on identifying the font pairs containing the header, sub-header, and/or body fonts.

Of the PDFs we collected, 43% are scanned documents. We omit these from the dataset, to simplify parsing and avoid additional noise caused by the parsing processing. For the remaining data, we apply open PDF tools to extract text, image, and layout information from each page of each document. We define a text box as a several words with the same font style and size in a line. Each text box is annotated with the font style, font size, and the bounding box. We discard pages that contain fewer than two text boxes. We also focus only on pages with the Roman alphabet, which we identify using Python language detection tools. Of the dataset, 75% of documents are in English.

To detect a header and sub-header pair on a page, we first find the largest text box, and call this the header text. We then identify the largest text box that lies within a fixed threshold of the header text box. We then call this a header/sub-header pair, and extract the fonts from the two text boxes. Only one header/sub-header pair is found for each document. We also detect body text boxes by finding text boxes with number of characters above a threshold. The nearest body text box to a header is used to form a header/body pair. Figure 2 shows example detection results on both header/body and header/sub-header pairs.

To evaluate the accuracy of automatic pair detection on PDF, we manually label header/body and header/sub-header pairs on a small subset of PDFs (i.e., 20 PDFs with varies topics and layouts totalling 3,000 pages). Here, our purpose is not to evaluate whether these are good pairing or not. We manually compare the automatic detection results with human labeling for verification. By adjusting our detection thresholds (e.g., the distance between the text boxes of header and sub-header), we achieve about 95% precision (true positives) in our automatic detection. For header/subheader pairs, our detector achieves 85% precision (true positives). There are more variations in the layout of header/subheader, which makes this task much harder than detecting header/body pairs.

The number of total unique header fonts, sub-header fonts, body fonts and pairs are shown in Table I. Figure 3 shows the top 5 header and body fonts used in header/body pairs and Figure 5 shows the histogram of frequency a header or body font appears in unique font pairs. Only 2.7% of header/body pairs involves one of the five most popular header fonts, and 7.5% of pairs involve one of the five most popular body fonts. This indicates a far more diverse set of pairings than web recommended pairings, of which, as reported above, 43% involve one of the five most-popular body fonts.

Figure 4 shows sample font pairings in our dataset when given a query header font “CaeciliaLTStd-Heavy”. For this font, our datset includes 10 font pairs, and 20 for the entire “Caecilia” family (including Bold, Heavy, Italic, etc). This is much more diverse than those in web font sources. For example, for this same font, there are only 4 header/sub-header pairings on, 1 pairing on, 2 on, and 0 on Google Fonts.

Fig. 3: Top 5 header and body fonts used in header/body pairs.
Fig. 4: Examples of head/sub-header pairing in FontPairing dataset, for the header font “CaeciliaLTStd-Heavy”.

3.3 Quality Verification

Following Veit et al. [12], we conduct an online user study to compare whether designers and ordinary users prefer the real font pairs we detected from PDFs or the random alternatives. Our study includes 60 participants: 15 experts in graphic design (either students in art design major or staff in design company) from Upwork, and 45 non-designers with other backgrounds from Amazon Mechanical Turk and volunteers.

The study comprised a set of paired comparisons. In each comparison, a user is shown two images of the same layout, but with one font changed (Figure 1). In particular, either the header or sub-header font is replaced by selecting a random alternative. The viewer is then asked which design they prefer. We perform two variants of the study: in the first one, the entire page layouts are shown to the viewer; in the second, the user is only shown part of the page containing the relevant text boxes, so that they will focus more on the font choices rather than the context. In whole-page setting, we show 20 comparisons to the users, and in sub-page setting, we show 50 comparisons to the users. These samples are randomly sampled from all the pairs.

Under both whole-page and sub-page settings, experts prefer the original layout 75% of the time. Non-experts prefer the original 65% of the time when viewing the full page, and 60% of the time when viewing the sub-page. Note that the original layout is not necessarily superior to the font choice in the random selection, for various reasons; however, we would expect that it would be more likely to be better. Hence, these results indicate that the pairing combinations included in the dataset are aligned with the preference of expert and common users. These results also suggest that non-experts are much less sensitive to good font choices than experts, and that there is potential value to recommend good pairings to them.

We would like to clarify that, although the font pairs extracted from PDF dataset are of varying quality, by training on a large amount of data, we aim to smooth out the noise therein and discover the general pairing rules that match majority users’ preference.


font set full non-popular


unique headers 2,086 616
unique bodies 1,443 1,343
header/body pairs 13,251 5,337


unique headers 2,159 1,054
unique sub-headers 2,168 1,573
header/sub-header pairs 8,733 5,174


TABLE I: Number of unique fonts and pairs of header/body pairing (upper) and header/sub-header pairing (bottom) are shown in the “full” column. Number after removing pairs with top 50 famous body/sub-header fonts are shown in the “non-popular” column.
Fig. 5: Data distribution of head/body pairs in FontPairing dataset. The histogram of the number of times a header font appears in unique font pairs is shown in (a) and the histogram of a body font is shown in (b). Fonts are written in the PostScript font name format, which typically includes both font family (e.g., “Helvetica”) and style (e.g., “Bold”).

4 Methods

Given a dataset of font pairs, our goal is to learn a model for predicting good font pairs. For example, given a header font, we would like to recommend good body fonts to go with it. We learn separate models for header/body and header/sub-header pairings. Without loss of generality, we discuss the header/body pairing in the rest of the section.

Suppose we have

training header fonts with feature vectors

, and training body fonts with features . For each header font , there is a list of body fonts that pair with it, i.e., . Fonts may repeat in this list, so that the popularity of pairings can be captured in the data.

We use pretrained feature of each font from DeepFont method [3, 4]

as the input font feature representation. DeepFont model is trained for font recognition on the large-scale Visual Font Recognition (VFR) dataset. DeepFont introduces a Convolutional Neural Network (CNN) decomposition approach, using a domain adaptation technique based on a Stacked Convolutional Auto-Encoder (SCAE) that exploits a large corpus of unlabeled real-world text images combined with synthetic data. Using this model, we obtain the feature vector for a font

denoted as , where . As shown in Figure 6, distances in the DeepFont feature space correspond to perceptual similarity between fonts. It demonstrates the effectiveness DeepFont feature for searching for perceptually-similar fonts. We do not choose to apply the end-to-end deep neural network (DNN) to learn font pairing. The main reason is that the number of unique header/body pairs and header/sub-header pairs in our database is 13,251 and 8,733 respectively as shown in Table I, which is not enough to train an end-to-end DNN.

In the following, we discuss two methods designed for font pairing: dual-space -NN (DS-NN) and asymmetric similarity metric learning (ASML).

4.1 Dual Space -NN Search based Method

The intuition behind dual space -NN search based method (DS-NN) is that, if fonts and are similar, then fonts that pair with should be good pairings with .

Suppose we are querying which body font will go with a header font . We first find the top nearest header fonts

, based on cosine similarity

in feature space between and all the training headers . Each header has a list of body fonts that pair with it, i.e., . The fonts in are regarded as candidate body fonts for pairing . We assume that there are fonts in candidate body font set . Note that fonts may repeat in this list. A high frequency of repeat in this list demonstrates that in the training set, more similar headers are paired with this body font.

The candidate body fonts may only cover a part of the good pairings among all the body fonts. Fonts similar to the candidate body fonts may also result in pleasing pairings. Therefore, we rank all the body fonts based on the similarity score compared with candidate body fonts, and recommend top fonts with highest scores.

Here we introduce the way to calculate for font . We first calculate the cosine similarity between and each candidate body font . Second we select top candidate body fonts with the largest . Then we calculate the average of cosine similarity by multiplying , which calculates the similarity of similar header ( ) and query header :


Note that fonts may also repeat in this list of top candidate body fonts. It is similar as the idea of adding a tf weight in tf-idf (short for term frequency-inverse document frequency [23]) for each unique font in candidate body fonts. In this way, the fonts with a high frequency in the list are assigned with higher weights. The idf weight could be further integrated in Eq. (1) by multiplying for each as:


where and is the number of header fonts with which body font is paired in the training set. The main purpose for adding the idf weight is to reduce the impact of popular body fonts.

However, DS-NN does not perform well if accurate similar header fonts are hard to find in the dataset. Also, popular body fonts have more chances of pairing similar header fonts and appearing in the set of candidate body fonts in this method. Meanwhile, there are some font pairing rules that may be missed by DS-NN. For example, using the same font family for the header and body (e.g., Helvetica Bold for header and Helvetica for body). These rules are difficult to capture and could not be easily solved by calculating font similarity in the original feature space. These concerns motivate us to learn the metric in the next section to capture the common pairing strategies.

Fig. 6: Two examples for similar font retrieval based on DeepFont features. In each column, the first row is the input query font. The following rows present top 5 similar fonts measured with the distance of DeepFont feature. The robustness of DeepFont features facilitate the performance of dual-space NN method.

4.2 Asymmetric Similarity Metric Learning

The goal of Asymmetric Similarity Metric Learning method is to learn a better distance scoring function between fonts, so that fonts that pair well have low distance, and mismatched fonts have large distance. We train this scoring function offline. Then predictions are generated for a given query by finding the fonts with lowest distance based on new scoring function.

We treat the training dataset as comprising font pairs , and an indicator function when fonts are paired in the training dataset. Since our FontPairing dataset only containing positive pairs, we randomly sample negative pairs among all the other possible pairs excluded these positive pairs. The number of negative pairs and positive pairs are the same. The indicator function when fonts are negative pairs. While there may exist good pairings among the negative set, but these should be in the minority, especially since the user study found that positive pairs were more attractive than randomly picked pairs to designers. Generally speaking, the original font pairs in PDFs are usually specific designed and should achieve higher accordance than randomly picked ones.

The main idea for conventional metric learning [24] is to learn the a better scoring function , to enlarge the two font points of non-matching pairs and narrow the font points for matched pairs. The learning objective function is:


Although metric learning is very important for many supervised learning application (e.g., classification), it has a few limitations in our font pairing problem. First, instead of applying nearest neighbor classifiers with ML in classification problem, after ML, we still need to make a decision, such as with a constant threshold



However, a simple constant threshold may be sub-optimal, even if the associated metric is correct. Another challenge is that, font pairing is asymmetric. It means paring A as header and B as body is different as pairing B as header and A as body. To address these challenges, we consider a jointly model that bridges a learn a distance metric and a asymmetric similarity decision rule and propose a asymmetric similarity metric learning as:


where G is asymmetric. measures the similarity of font pairs.

Let denotes the index set of all pairwise constraints. Let if and if . We drive the formulation of the empirical discrimination using hinge loss:


where the regularization term prevents the image vector being distorted too much. is the frobenius norm. is the trade-off parameter. This objective function could be solved with dual formulation as [25].

After off-line learning the new scoring function as Eq. (5), in online pairing, we recommend font pairs according to the distance between header and body based on the new scoring function.

5 Experiments

5.1 Compared Methods

We implement the following baselines for comparison, including several based on design rules-of-thumb.

Popularity: The method aims at recommending most popular body fonts. First, we rank all body fonts according to the frequency they appear in font pairs in the collected dataset. The top-ranked body fonts are defined as popular fonts. These same fonts are always recommended, regardless of the query header.

Simple NN (S-NN)

: This method aims as recommending body fonts with highest visual similarity to the query header font as pairs. The distance of similarity between to fonts is measured by DeepFont features.

Contrast similarity (ConSim): The main intuition in this work is that the ideal pairing has similarities and contrasts in equal importance. They manually designed a contrast similarity distance metric. More details could be found at [26].

Similarity metric learning (SML): We also implement a similarity metric learning method (SML) to evaluation the effectiveness of making the metric G asymmetric in ASML. We replace asymmetric G in Eq.(5) and (6) with symmetric metric. This idea is similar as [25], but [25] address on face verification problem.

Dual-space NN (DS-NN)(Ours): Our proposed dual space NN method.

Asymmetric similarity metric learning (ASML) (Ours): Our proposed asymmetric similarity metric learning method.

Fig. 7: Performance of top recommendation on header/body (first row) and header/sub-header (second row) pairing with top precision and recall and weighted top

precision and recall evaluation metric.

Fig. 8: Examples of header/sub-header font pairing results. The input is a query header font “NewBaskerville-BoldSC”. All the pairs are rendered with same format and header font, but only with different sub-header fonts. The most left column shows three pairings in our collection. In each column, we show the top 3 recommended sub-header fonts by each methods. The PostScript name of each sub-header is shown below the image. The number of times the sub-header font appears in the unique font pairs are shown within the parentheses. We recommend to zoom in to get more details of the fonts.

5.2 Experimental Setting

We perform quantitative evaluation similar to other pairing tasks [12, 11, 27]. We conduct two sets of experiments: top- recommendation and binary classification. Without loss of generality, we discuss the setting that given a header font, we recommend body pairings as an example in the rest of the section.

The first evaluation is to formalize the pairing problem as a retrieval problem. Given a header font, we rank all the body fonts and recommend top- body fonts as good pairings. The second evaluation is to formalize the pairing problem as a binary classification problem: given a font pair, we want to classify whether it is a good pairing or not.

We split the header fonts in FontPairing dataset into training header set and test header set by a ratio of 9:1 with no overlap. Only the pairings with training headers are used as positive training pairings. In this way, in the test stage, we are able to evaluate the performance of recommending body fonts to pair an unseen header font. The body fonts in the training and test set may have overlaps.

5.2.1 Top- recommendation

In real world font pairing interfaces, we would like to recommend multiple candidate fonts for pairing, and let the user pick from this list [2]. Thus, we evaluate top- recommendation performance, namely, precision and recall at , which are widely used in recommender systems [28].

Assuming the user gets a top- recommended list of fonts, recall is the percentage of relevant fonts selected out of all the ground truth fonts, and precision is the percentage of the results that are good recommendations.

Besides the conventional top- precision and recall, we also apply weighted precision and weighted recall as the evaluation metric. Popular fonts are easier to be considered and may be less interesting to users. We add an IDF weight [29] (the popular the lower) to each font as:


where weighted_TP is the sum of all the IDF weights of true positive fonts. weighted_GT is the sum of all the IDF weights of ground truth fonts.

5.2.2 Binary classification

Following the experimental settings from clothing pairing works [12, 11], we formalize evaluation in terms of binary classification. Given a header and body font pair, we want to classify whether or not it is a good pairing.

We regard all the font pairs we extracted from PDFs as positive samples. The test set is formed with positive samples and negative samples of equal proportion. Thus, chance performance is 50% for all experiments. The negative samples are randomly-sampled pairs, excluding all positive pairs, following [12, 11]. In “Quality Verification” section, we have shown that both designers and average users generally prefer the pairings extracted from PDF documents than random chances.

5.3 Top- Recommendation Results

Figure 7 shows the performance of top- recommendation on header/ body pairing and header/sub-header pairing under 6 methods. We show the performance of each method under two metrics: top- precision and recall and weighted top- precision and recall. The number of recommended fonts is shown in x-axis and the corresponding top- precision and recall are shown in y-axis. The best results of each method are shown in the figure and the parameters are set based on cross-validation.

In all the cases, ASML achieves the highest performance among 6 methods. It demonstrates the effectiveness of regarding visual font pairing as the asymmetric metric learning problem. Also, ASML outperforms SML, which demonstrates the effectiveness of the asymmetric constraint. ASML could automatically learn various font pairing rules and outperform existing rules such as similarity and contrast similarity.

Figure 8 shows the visualization of top recommended sub-header fonts by comparison methods, given a query header font “NewBaskerville - BoldSC”. The most left column shows the pairings extracted from PDFs. Popularity method tends to recommend fonts with the highest number of frequency shown in parentheses. Same Family method randomly picks fonts from the same font family. Generally, the font pairings in PDF include many cases that fonts are from the same font family, since it is easy to implement. While Same Family and Popularity method would hit more of PDF extractions, it is very easy for users to pick font with family by themselves, so that the users especially designers may be not interested in these recommendation. Other concern is that it may fail to find same family font for some less popular fonts. S-

NN shows the pairings with the smallest visual distances. In DS-kNN(ours) and ASML(ours) methods, we are able to recommend font pairings that are both interesting and unpopular, and meanwhile achieve the coordination of pairing.

5.4 Binary Classification Results

Table II shows binary classification results on header/body and header/sub-header paring under settings: (1) with all the font (full) and (2) removing top 50 popular body/sub-header fonts (non-popular) as described in Table I. Classification thresholds are set by cross-validation with training data in each method respectively.


task header/body header/sub-header
setting full non-popular full non-popular


Popularity 73.60 55.29 68.04 61.35
S-NN 52.87 60.43 62.32 67.82
ConSim [26] 55.81 67.28 61.32 65.79
SML [25] 60.80 67.34 67.61 72.69
DS-NN 76.93 59.28 71.30 63.46
ASML 64.97 68.23 68.43 73.41


TABLE II: Performance on binary classification of header/body and header/sub-header pairing under two settings as Table I.

In setting “full”, DS-NN achieves the highest performance in both header/body and header/sub-header pairing. In header/body, Popularity achieves the second highest performance. It is consistent with the phenomenon shown in Figure 5 that popular body fonts take a large proportion of head/body pairs in PDF designs. The main reason is that there are dominant popular fonts in “full” setting. Thus, in DS-NN, the popular fonts will appear frequently in the candidate body set, and have more chances to hit the ground truth. In header/sub-header, ASML achieves the second highest performance.

To decrease the effects of dominant popular fonts, we also conduct the experiment under “non-popular” setting. In setting “non-popular”, ASML achieves the highest performance, followed by SML. Tables II demonstrates the effectiveness of DS-NN and ASML in both regular and non-popular font pairing tasks.

An interesting observation is that DS-NN performs much better in binary classification than in top- recommendation. In top- recommendation, according to the evaluation metric, the top recommended fonts have higher weights than the bottom ones. It means that if the top recommended fonts are not the same as ground truth, it is hard to achieve a high top- accuracy. In DS-

NN, since we rank all the test bodies with the similarity score compared with candidate bodies fonts, it has a high probability that fonts similar to the ground truth are ranked higher than the ground truth itself. It degrades the top-

precision and recall score of DS-NN. In binary classification, however, the performance would not degrade due to the order of recommendation.

5.5 Subjective Evaluation

Besides the quantitative evaluations, we also conduct subjective evaluation through user study on AMT and Upwork, which are crowdsourcing platforms targeting average people and professionals respectively.

The study comprises a set of paired comparisons. One is either from DS-NN or ASML, the other is from one of the compared methods. The users are only shown the sub-page contains the text as shown in Figure 8. We evaluate 500 comparisons and each comparison receives at least 11 ratings by average users and 3 ratings by designers.

Before describing the evaluation results, we firstly analyze the consistency of users’ rating. If the users have consistent opinions about which pair is superior than the other, the ratings are more convincing and could be applied to the following studies. It is important to analyze the rating consistency. If the users’ ratings of two pairs are divergent on most of the comparisons, it shows that users do not have consistent opinions on font pairing task. Very likely the font pairing task is too subjective and could not be learnable. On the contrary, if the users have consistent opinions about which pair is superior than the other, the ratings are more convincing and could be applied to the following studies.

As described, we evaluate about 500 comparisons on AMT and Upwork and each comparison receives at least 11 ratings by average users and 3 ratings by designers. There are almost 150 average users and 10 designers in total.

Suppose that there are comparisons, and for the -th comparison, we denote the hits of pair1 as and the hits of pair2 as . The normalized difference of the -th comparison is as:


For example, assuming there are 11 ratings for one comparison, if the ratio of hits of two methods are 5:6, . The value of is between 0 to 1. The higher the normalized difference, the higher the consistency is.

To justify the users’ ratings are consistent, we compare the distribution of users’ rating with the distribution of pure random, and use hypothesis testing to test whether the two distributions are significantly distinct.

We firstly introduce the rating consistency for average user on AMT. There are three steps. In the first step, we turn the continuous comparison results into binned data by grouping the comparisons into specified ranges according to . We evenly divide [0,1] into six ranges from lowest to highest. The pdf of the normalized difference of users’ ratings is shown Figure 9 (a).

Fig. 9: pdf of the normalized difference of average users’ ratings (a) and pdf of pure random ratings (b). There are six bins for both pdfs. The x-axis from left to right demonstrates the consistency from highest to lowest.

In the second step, we calculate the pdf of pure random choice analytically, shown in Figure 9 (b).

In the third step, we apply hypothesis testing to test whether these two distributions are significantly distinct. Suppose that is the number of events observed in the th bin, and that is the number expected according to random distribution. The statistic is calculated as:


Any term with should be omitted from the sum. The average is 717.43 when we sum to . For testing, it is also suggested to omit the bins in which . In most cases, is very small in random distribution. Thus, we also calculate regarding to . The average is 117.32. According to distribution table, = 16.750 under 6 bins and = 14.860 under 5 bins 888 mga/401/tables/Chi-square-table.pdf. Thus we could safely draw the conclusion that the users’ ratings are consistent and significantly different () with pure random distribution.

We also analyze designer’s rating consistency in the same way. In about 43 pairs, all the designers make the same choice (highest consistency). When calculating the pdf of pure random choices, only 25 pairs are with the highest consistency. In hypothesis testing, when comparing the pdf of designers’ ratings and pure random ratings, we could safely make conclusion that the designers’ choices are consistent and significantly different () from pure random distribution.

5.5.1 User Study Results

Fig. 10: Subjective evaluation score of six methods by average user in (a) and designer in (b) by Bradley-Terry method.

We apply the Bradley-Terry models 999 dhunter/code/btmatlab/ to get rankings for pairwise comparisons of ASML and DS-NN to PDF, Random, Popularity and Family. The ranking scores of each methods based on average users’ and designers’ ratings are shown in Figure 10. For average user, the ranking results of these methods are ASML, PDF, DS-kNN, Popularity, Family and Random. For designer, the ranking results are Popularity, ASML, PDF, DS-kNN, Family, Random.

Average user’s ratings demonstrate that ASML outperforms hand-craft methods or even the pairs extracted from PDFs. Designers would prefer popular fonts most. We analyze that designers are more familiar with these popular fonts. However, as discussed before, only recommending popular fonts maybe less interesting to designers. Other ranks of designers are the same as average users’.

5.6 Users’ Rating Prediction

In this section, we want to evaluate the performance of predicting users’ preference between pair1 (header A/sub-header B), and pair2 (header A/sub-header C).

5.6.1 Experimental settings

For each comparison, the ground-truth label is the pairing which receives a higher rating from user study. We predict users’ choices by each method and compare the results with ground-truth labels as prediction accuracy of each method. For both average user and designer, we only use the ratings with the highest rating consistency as the evaluation set, which are more convincing.

We compare the performance of Popularity, S-NN, ConSim, SML, DS-NN and ASML. In Popularity, we compare the popularity of two sub-headers, and choose the more popular sub-header as the result. In S-NN, we calculate the distance between header and each sub-header. We choose the pair with smaller distance as the result for S-NN. The performance of ConSim, DS-NN, SML, ASML are evaluated in a similar way as S-NN, but with different scoring functions for calculating the distance between header and sub-header fonts.

5.6.2 Rating prediction results

Table III shows the accuracy of average users’ and designers’ ratings prediction with comparison methods under the highest consistency level.

For predicting average users’ ratings, DS-NN and ASML achieve the highest and second highest performance respectively. For predicting designers’ ratings, DS-NN, Popularity and ASML achieve the top 3 highest performance. It is generally consistent with the user study results in Section 5.5.

When looking into S-NN under average users’ and designers’ ratings, it is interesting to see that average users prefer the pairing with similar header and sub-header fonts, while designers prefer the pairing with contrast header and sub-header fonts. It shows the hardness of predicting both tasks in the uniform scoring function. However, ASML could achieve the second and the third highest performance in both tasks, which shows the effectiveness of the learned scoring function in ASML.


average user designer


Popularity 55.56 57.89
S-NN 55.33 45.45
ConSim 54.32 52.15
SML 56.22 54.59
DS-NN(ours1) 68.18 59.81
ASML(ours2) 58.67 56.94


TABLE III: Accuracy of predicting average users’ and designers’ ratings with comparison methods.

6 Conclusion

In this paper, we introduced the problem of visual font pairing. To our best knowledge, it is the first time automatic font pairing has been addressed in multimedia and computer vision field. We introduced a new database called FontPairing, from millions of PDF documents on the Internet. We automatically extracted header/sub-header, header/body pairs from PDF pages. We proposed two automatic font pairing methods through learning fine-grain visual relationships from large-scale human-generated font pairs: dual-space -NN and asymmetric similarity metric learning. Comparisons are conducted against several baseline methods based on rules from professional designers. Experiments and user studies demonstrate the effectiveness of our proposed dataset and methods.


  • [1] X. Yang, T. Mei, Y.-Q. Xu, Y. Rui, and S. Li, “Automatic generation of visual-textual presentation layout,” ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), vol. 12, no. 2, p. 33, 2016.
  • [2] P. O’Donovan, J. Lībeks, A. Agarwala, and A. Hertzmann, “Exploratory font selection using crowdsourced attributes,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, p. 92, 2014.
  • [3] Z. Wang, J. Yang, H. Jin, J. Brandt, E. Shechtman, A. Agarwala, Z. Wang, Y. Song, J. Hsieh, S. Kong et al., “Deepfont: A system for font recognition and similarity,” in Proceedings of the 23rd ACM international conference on Multimedia.   ACM, 2015, pp. 813–814.
  • [4] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, and T. S. Huang, “Deepfont: Identify your font from an image,” in Proceedings of the 23rd ACM international conference on Multimedia.   ACM, 2015, pp. 451–459.
  • [5] N. Zhao, Y. Cao, and R. W. Lau, “Modeling fonts in context: Font prediction on web designs,” Computer Graphics Forum (Proc. Pacific Graphics 2018), vol. 37, 2018.
  • [6] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi, “A diagram is worth a dozen images,” in ECCV.   Springer, 2016, pp. 235–251.
  • [7] N. Siegel, Z. Horvitz, R. Levin, S. Divvala, and A. Farhadi, “Figureseer: Parsing result-figures in research papers,” in ECCV.   Springer, 2016, pp. 664–680.
  • [8] C. Galleguillos, A. Rabinovich, and S. Belongie, “Object categorization using co-occurrence, location and appearance,” in

    IEEE Conference on Computer Vision and Pattern Recognition

    .   IEEE, 2008, pp. 1–8.
  • [9] L. Ladicky, C. Russell, P. Kohli, and P. H. Torr, “Graph cut based inference with co-occurrence statistics,” in European Conference on Computer Vision.   Springer, 2010, pp. 239–253.
  • [10] L. Feng and B. Bhanu, “Semantic concept co-occurrence patterns for image annotation and retrieval,” IEEE transactions on pattern analysis and machine intelligence, vol. 38, no. 4, pp. 785–799, 2016.
  • [11] J. McAuley, C. Targett, Q. Shi, and A. van den Hengel, “Image-based recommendations on styles and substitutes,” in Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval.   ACM, 2015, pp. 43–52.
  • [12] A. Veit, B. Kovacs, S. Bell, J. McAuley, K. Bala, and S. Belongie, “Learning visual clothing style with heterogeneous dyadic co-occurrences,” in ICCV, 2015, pp. 4642–4650.
  • [13] J. McAuley, R. Pandey, and J. Leskovec, “Inferring networks of substitutable and complementary products,” in Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.   ACM, 2015, pp. 785–794.
  • [14] V. Jagadeesh, R. Piramuthu, A. Bhardwaj, W. Di, and N. Sundaresan, “Large scale visual recommendations from street fashion images,” in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2014, pp. 1925–1934.
  • [15] S. Liu, J. Feng, Z. Song, T. Zhang, H. Lu, C. Xu, and S. Yan, “Hi, magic closet, tell me what to wear!” in Proceedings of the 20th ACM international conference on Multimedia.   ACM, 2012, pp. 619–628.
  • [16] L.-F. Yu, S.-K. Yeung, D. Terzopoulos, and T. F. Chan, “Dressup! outfit synthesis through automatic optimization,” ACM TOG, 2012.
  • [17] D. Bonneville, The Big Book of Font Combinations.   BonFX Press, 2010.
  • [18] G. Chen, J. Yang, H. Jin, J. Brandt, E. Shechtman, A. Agarwala, and T. X. Han, “Large-scale visual font recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 3598–3605.
  • [19] Z. Wang, J. Yang, H. Jin, E. Shechtman, A. Agarwala, J. Brandt, and T. S. Huang, “Real-world font recognition using deep network and domain adaptation,” 2015.
  • [20] R. Kumar, A. Satyanarayan, C. Torres, M. Lim, S. Ahmad, S. R. Klemmer, and J. O. Talton, “Webzeitgeist: Design mining the web,” in Proc. CHI, 2013.
  • [21] T. Liu, A. Hertzmann, W. Li, and T. Funkhouser, “Style compatibility for 3d furniture models,” ACM TOG, 2015.
  • [22] Y.-Y. Ahn, S. E. Ahnert, J. P. Bagrow, and A.-L. Barabási, “Flavor network and the principles of food pairing,” Scientific Reports, vol. 1, p. 196, dec 2011. [Online]. Available:{#}supplementary-information
  • [23] A. Aizawa, “An information-theoretic perspective of tf–idf measures,” Information Processing & Management, vol. 39, no. 1, pp. 45–65, 2003.
  • [24] K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,”

    Journal of Machine Learning Research

    , vol. 10, no. Feb, pp. 207–244, 2009.
  • [25]

    Q. Cao, Y. Ying, and P. Li, “Similarity metric learning for face recognition,” in

    Proceedings of the IEEE International Conference on Computer Vision, 2013, pp. 2408–2415.
  • [26] “Font pairing made simple,”, [Online].
  • [27] S. Jiang, X. Qian, J. Shen, Y. Fu, and T. Mei, “Author topic model-based collaborative filtering for personalized poi recommendations,” IEEE Transactions on Multimedia, vol. 17, no. 6, pp. 907–918, 2015.
  • [28] A. Gunawardana and G. Shani, “A survey of accuracy evaluation metrics of recommendation tasks,” Journal of Machine Learning Research, vol. 10, no. Dec, pp. 2935–2962, 2009.
  • [29] G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,” Information processing & management, vol. 24, no. 5, pp. 513–523, 1988.