In natural scenes and documents, the color of text can have a correlation to the meaning of the text. For example, as shown in Fig. 1, the word, “hot,” is often printed in red and the word, “cold,” is often blue. This correlation is caused by various reasons. The color of the object described by a text will be a reason; for example, words relating to plants are often printed in green. Visual saliency (i.e., visual prominence) is another reason; for example, cautions and warnings are often printed in red or yellow for a higher visual saliency. Another important reason is the impression of color; for example, red gives an excitement impression and pink gives a feminine impression, according to color psychology .
Based on the above observations, one might imagine that the correlation between a word and its color is useful for word embeddings. Fig. 2
demonstrates this concept. Word embedding is the technique to convert a word into a vector that represents the meaning of the word. Word2vec[9, 10] is a standard and popular technique to create word embeddings. In addition, recently, BERT 
and its variants have become popular. Although those word embedding methods have made large progress in natural language processing (NLP), further improvement is still necessary. For example, word2vec often gives similar vectors for antonyms (such as hot and cold) since they are often grammatically interchangeable in sentences. We, therefore, expect that color usage would be helpful to have better vector representations for such cases, as shown in Fig.2.
However, we must note that the correlation between a word and its color is not always strong. For example, the word, “food,” will be printed in various colors depending on the type of food. Moreover, the color of the word, “the,” will not have any specific trend, except for the general trend that words are often printed in achromatic color .
The purpose of this paper is to understand what kinds of words have a positive or negative or no effect in their vector representation by the incorporation of color information. For this purpose, we first determine the color distribution (i.e., color histogram) of each word by using word images collected from book titles of about 210,000 book cover images. Since book cover images (especially titles) are carefully created by professional designers, we can expect that the word images from them are more helpful to find the meaningful correlation between a word and its color than word images with thoughtless color usage. We then analyze the color histograms to find the words whose color histogram is very different from the standard one.
After those observations, we develop a new word-embedding method that can reflect the color usage of each word. On one hand, we expect positive effects using the colors; for example, antonyms with contrasting color usages, like “hot” and “cold,” will have more discriminative semantic vectors (i.e., vectors that represent the meanings of the words) than the standard word embeddings. On the other hand, we expect negative effects, since there are cases with no strong word-color correlation. Accordingly, we observe the improved or degraded cases from the standard word2vec embeddings.
The main contributions of this paper are summarized as follows:
To the authors’ best knowledge, this is the first trial of an objective analysis of the correlation between color and meaning of words, by using a large amount of real word images.
We discover words and word classes whose color usage is very particular.
We proposed a novel word-embedding method that reflects the color usage of the word. If a pair of words have similar color usages, their semantic vectors become more similar through the proposed method.
Since synonyms and antonyms do not always have similar and dissimilar color usages, the effect of word-embedding was not always positive. However, we also revealed that many synonyms and antonyms have positive effects from the proposed word-embedding method.
2 Related Work
. For word2vec, two types of models, Continuous-Bag-of-Words (CBoW) and Skip-gram, have been proposed. In both models, it is assumed that words with similar meanings in sentences are used in similar contexts. In other words, interchangeable words in a sentence can be said to have similar meanings. Word2vec uses this assumption to derive a semantic vector of the word by using a neural network.
Although word2vec gives reasonable semantic vectors for most words, it still have a drawback caused by the above assumption. Considering two sentences, “I drink cold water” and “I drink hot water”, the antonym words, “cold” and “hot” are interchangeable and will have similar semantic vectors in spite of their opposite meanings. Similar situations often happen for (especially, adjective and adverb) antonyms and their semantic difference is underestimated.
To make semantic vectors more discriminative, a lot multimodal word embedding methods have been proposed to capture word semantics from multiple modalities. For example, Multimodal Skip-gram (MMSG) . In normal Skip-gram, the co-occurrence of words in sentences is used to predict the context of a word. However, in MMSG, in addition to the co-occurrence of words, image features related to the word are simultaneously used. This makes it possible to generate word embeddings by taking into account the visual information of the object. Similarly, Sun et al.  propose Visual Character-Enhanced Word Embeddings (VCWE) that extract features related to the composition of Chinese character shapes and applies them to a Skip-gram framework. A Chinese word is composed of a combination of characters containing a wealth of semantic information, and by adding those shape-features to the word embedding, a richer Chinese word expression can be obtained.
In addition to directly using features from modalities outside the text, a method using some “surrogate labels” has also been proposed. Visual word2vec 
uses embedded expressions based on the visual similarity between words. In Visual word2vec, first, cartoon images with captions are clustered based on the accompanying image features as meta information. Next, using the obtained cluster labels as surrogate labels, the word-embeddings are retrained. During this, the label of the cluster to which the image belongs is predicted from the words in the caption of the image using a CBoW framework. As a result, words belonging to the same cluster have their similarity increased. For example, words with high visual relevance, such as “eat” and “stare,” have their embedded expressions changed to more similar values. A similar technique can be used to learn word-embeddings based on any modality. In Sound-word2vec, surrogate labels obtained by clustering the sounds are predicted from word tags, thereby acquiring word embedding that takes into account relevance in sounds.
In addition, there are methods to acquire multimodal word embeddings based on Autoencoders. In Visually Enhanced Word Embeddings (ViEW), an autoencoder is used to encode linguistic representations into an image-like visual representation encoding. Next, a hidden layer vector is used as a multimodal representation of the word.
The purpose of this paper is to investigate the relationship between words observed in real environments and their colors and to analyze how the color information can affect the recognition of word semantics. Although there were several data-driven trials on font color usages [3, 13], there is no trial to correlate the font color usage to the meaning of words, to the authors’ best knowledge. Throughout this trial, we first observe the color usages on words and word categories, then propose a word-embedding method with color information, and finally reveal how the font color usage is useful and not useful for word embedding.
3 Collecting Word Images from Book Covers
The book covers sourced for the experiment are from the Book Cover Dataset111https://github.com/uchidalab/book-dataset . This dataset consists of 207,572 book cover images from Amazon Inc. Fig. 1 shows some example book cover images.
In order to extract the words from the book cover images, the Efficient and Accurate Scene Text detector (EAST)  is used. EAST is a popular text detector that implements a Fully-Convolutional Network (FCN) 
combined with locality-aware Non-Maximum-Suppression (NMS) to create multi-oriented text boxes. After detecting the words using EAST, a Convolutional Recurrent Neural Network (CRNN) is used to recognize the text boxes into text strings. Note that book title information is given as the meta-data of the Book Cover Dataset and thus it is possible to remove the words that are not contained in the book title. Consequently, we collected 358,712 images of words. Their vocabulary size (i.e., the number of different words) was 18,758. Each word has 19 images on average, 4,492 at maximum, and 1 at minimum. Note that stop words, proper nouns, digits, compounded words (e.g., “don’t” and “all-goals”) are removed from the collection222From 579,240 images of words given by EAST, 492,363 images remain after removing non-title words and misrecognized words. Finally, 358,712 images remain after removing stop words, etc.. In addition, we identify word variants (e.g., “meets” = “meet”), by using a lemmatizer333www.nltk.org/_modules/nltk/stem/wordnet.html.
4 Color Usage of a Word as a Histogram
In this paper, we represent the color of a word in the CIELAB color space. Each color is represented by a three-dimensional vector , where is lightness (intensity) and and are color channels. The advantage of using CIELAB is that the Euclidean distance in the CIELAB color space is approximately relative to human’s perceptual difference.
CIELAB color values of each word in the detected bounding box are determined as follows. The first step is the separation of the character pixels from the background pixels. Fortunately, most word images on book covers have a high contrast between the characters and the background (possibly due to better legibility). Therefore, a simple Otsu binarization technique was enough for the separation. As the second step, the CIELAB color vector is determined for a word by taking the average of all character pixels.
In order to analyze the distribution of colors, we implement a color histogram. Specifically, all colors are quantized into one of 13 basic colors (pink, red, orange, brown, yellow, olive, yellow-green, green, blue, purple, white, gray, and black), which are defined by as basic colors in the ISCC-NBS system. The color quantization is performed using the nearest neighbor rule with Euclidean distance in CIELAB space. If a word occurs times, in the dataset, we initially have a color histogram with votes for bins. We then normalize it to be its total votes equal to one, in order to remove the difference of .
In the following analysis, we introduce the following two conditions. First, we do not use the word images whose word color is achromatic (i.e., white or black or gray) and background color is chromatic (i.e., one of the 10 chromatic colors). A survey of color usages of scene texts and their background  proved that if the background color is chromatic and the foreground text is achromatic then the designer typically uses the background color to portray meaning. We, therefore, extend this idea to book cover title colors and remove the word images in this case. Note that if both title and background colors are achromatic, we do not remove the image. For the second condition, we removed the histogram for words which occur less than five times to limit the analysis to reliable color usages.
Consequently, we have color histograms of 6,762 words. Fig. 3 shows several examples, where each bar of the color histogram is painted by the corresponding basic color. The leftmost histogram, “indonesian,” shows a peak on a specific color, “Orange,” whereas the second and the third show rather flat usages. As shown in the rightmost histogram, we sometimes find histograms with frequent use of achromatic colors.
5 Words with Particular Color Usages
5.1 How to Find Words with Particular Color Usage
Our interest is to find the words with particular color usage rather than words with standard color usage because particular color usage might show specific correlations between the meanings and the color of the word. Fig. 4
shows a scatter plot for finding such words. Each dot corresponds to a word and the horizontal axis is the color variance, where is the (normalized) color histogram of the word and . When this variance is large for a word , the histogram is not flat but has some peaks at specific colors; that is, the word has a more particular color usage. The vertical axis is the distance from the average color histogram , i.e., , where , where is the vocabulary size. When this distance is large for word , the histogram is largely deviated from the average color histogram and word has a more particular color usage. Note again, .
By splitting each axis by its average, we have four regions in Fig. 4. Among them, we call the region with red color the first quadrant. Similarly, orange, green, and blue regions as the second, third, and fourth quadrants, respectively. Each word belonging to the first quadrant will have a particular color usage. In contrast, the words in the third quadrant will have a flat and standard color usage. The first, second, third, and fourth quadrants contain 1,973, 1,081, 3,381, and 327 words, respectively. This suggests that words with particular color usages are not dominant but exist to a certain amount. Specifically, although words have an average and flat color usage and their color usage is not reflecting their meanings, the remaining words have a particular color usage which might reflect their meanings. In addition, the first quadrant with words have a very particular color usage.
5.2 Words with Particular Color Usage
Fig. 5 shows a word-cloud of the words in the first quadrant of Fig. 4, colored by each word’s most frequent basic color. Different from typical word-clouds, the size of the word is proportional to the particularity of the word, which is calculated as the product of “the distance from the average color histogram” and “variance of basic color frequency” in Fig. 4. As a general tendency, achromatic colors are the most frequent color for most words and it can be seen that the meaning and color of words are not necessarily related. On the other hand, some relationships between the color and the meaning of words can be found. For example, “premier” is reminiscent of position and victory and is yellow while “amour” means love and frequently appears in pink, etc.
5.3 Word Category with Particular Color Usages
Words can be categorized into hierarchical groups, thus, in this section, we analyze the relationship between word category and color usage. Fig. 6 shows the ratio of the four quadrants for 45 word categories, which are defined as lexnames444wordnet.princeton.edu/documentation/lexnames5wn in WordNet. In this figure, the categories are ranked by the ratio words in the first quadrant. Note that a word can belong to multiple categories by the hierarchical structure of WordNet.
The category “noun.plant” is high-ranked; namely, words relating to plants have particular color usages. This is because words related to plants tend to appear in green or their fruit color. Fig. 7 shows the color histograms of “lemon” and “strawberry” and sample books of the dominant colors. In particular, “lemon” often appears as yellow or olive and “strawberry” frequently is in red. It is noteworthy that the rank of “noun.plant” is much higher than the rank of “noun.food.” This is because of the large variations of foods which result in the large variations of color usages.
The category “verb.emotion” is also high-ranked. As shown in Fig. 7, the color histograms of “anger” shows a high frequency of warm reddish colors. Compared to “anger”, the word “sorrow” is not printed with red but more with pale colors. This may be because the impression of colors affects the word color usages.
It should be emphasized that categories of adjectives such as “adj.all” (all adjective clusters) and “adj.pert” (relational adjectives, i.e., so-called pertainyms) are also high-ranked. These categories contain many antonym adjective pairs, like “hot” and “cold,” as well as synonym adjective pairs. Fig. 7 also shows the color histograms of pairs of antonyms, “intelligent” and “stupid”, and “guilty” and “innocent”. The word, “intelligent,” tends to be printed in paler color than “stupid.” The word, “guilty,” was printed in purple more than “innocent” and did not show up as a warm color.
6 Word Embeddings Grounded in Color
Based on the analysis results of Section 5, we attempt to create a word embedding grounded on word color usage using a neural network. Fig. 8 shows the network structure of the proposed word embedding method. This network is inspired by Sound-Word2vec . Specifically, this network is similar to an autoencoder or the skip-gram of word2vec. Its input is the -dimensional one-hot representation of the input word and its output should be close to the -dimensional vector representing the color histogram of . The value is the number of bins of the color histogram. Accordingly, its output is not a -dimensional one-hot vector but a -dimensional vector representing a color histogram with bins. The -dimensional output vector from the hidden layer is the semantic vector representation of input word .
The neural network is trained to minimize the following loss function:
where the matrices and are the weight matrices to be trained. The function is softmax. The network tries to predict a color histogram through and the softmax from , which is the semantic vector representation of the word . The matrix is initialized as which is a word-embedding matrix trained by a word-embedding method, such as word2vec. The matrix is initialized with random values.
By updating and along with the minimization of , semantic vectors become more different (similar) between words with different (similar) color usages. In other words, the proposed method retrains from for giving more different (similar) semantic vectors for a pair of words, if they have different (similar) color usages. As noted above, typical word embedding methods, such as word2vec, often give similar semantic vectors even for a pair of antonyms. We, therefore, can expect that they have more different vectors if they have different color usages, as shown in Fig. 2. More specifically, for a pair of antonyms, and , we can expect , where
is the cosine similarity of the semantic vectors forand without color usage information, and is with color usage information.
To analyze the positive or negative effect of using the color usage information of words, we focused pairs of synonyms and antonyms and evaluated their similarity before and after integrating the color usage information. For the initial matrix , GoogleNews-vectors (word2vec trained with Google News corpus) were used555The vocabulary assumed in GoogleNews-vectors is slightly different from that of our book title words. Specifically, 198 words in the book titles do not appear in GoogleNews-vectors. We, therefore, used 6,762-198 = 6,514 words in the experiment in this section. The data of GoogleNews-vectors is provided by https://code.google.com/archive/p/word2vec/.. We used synonym pairs and antonym pairs defined in Wordnet.
6.2.1 Effect of Word Color on Similarity in Synonyms and Antonyms
As a quantitative evaluation, we observed the distributions of and for all of the synonym pairs and antonym pairs. Fig. 9 shows the similarity distributions with and without color. The highlight of this result is that the similarity between antonym pairs is successfully decreased when using color. This suggests that antonym pairs tend to have different color usages. However, this decrease in similarity also happens for the synonym pairs. Thus, we cannot assume that synonyms are always printed in similar colors. Fig. 9 also shows that the range of the similarity values for synonym pairs is similar to antonym pairs, even without color usage, that is, even by the original word2vec. (Ideally, synonym pairs should have higher similarities than antonym pairs.) This clearly shows the difficulty of the word embedding task.
A more detailed analysis is shown in Fig. 10. This plot shows for each synonym and antonym pair. Each point corresponds to a word pair. If a point is on the diagonal line, the corresponding word pair is not affected by the use of the color information in their semantic vector representation. The color of each point represents the similarity of the color information (i.e., the similarity of the color histograms of the paired words). It is observed that the proposed neural network works appropriately. In fact, pairs with different color usages (blue dots) are trained to have more different semantic vectors, that is, . In contrast, pairs with similar color usages (red dots) are trained to have more similar vectors, that is,
Another observation of Fig. 10 reveals that there are many blue points for synonym pairs and red points for antonym pairs. Namely, synonym pairs often have different color usages and antonym pairs have similar usages. As we anticipated, the correlations between color usages and meanings for words are not always consistent and not always strongly correlated. This fact coincides with the quantitative evaluation result of Fig. 9.
Examples of synonyms and antonyms with an improved similarity between two words are shown in the Tables 2 and 2. In Table 2, improvement is defined by a case where the similarity of the word embedding is increased for synonyms. Conversely, in Table 2, improvement is defined by a case where the similarity is reduced for antonyms. The lexnames in these tables indicate the category in which the two words are treated as synonyms or antonyms. From these, it can be seen that the word pairs used in verbs occupy the top of synonyms and word pairs of adjectives are common in antonyms. Next, Fig. 11 shows examples of color histograms of synonym and antonym word pairs. For the synonym pairs, the colors tend to be achromatic. This is because words, such as verbs, do not necessarily associate with color impressions and the visual characteristics of verbs tend to achromatic. As a result, it can be seen that the word embedding with achromatic colors tended to be updated as more similar under the proposed method. In contrast, for antonyms, frequent chromatic differences between the word pairs can be confirmed. The correlation between the psychological impression of the color and the meaning of the word is strong. Thus, it can be said that for these words, namely adjectives, it is a good example of compensating for the differences in opposite words.
To the authors’ best knowledge, this paper is the first trial to analyze the relationship between the meaning and color of words, by using a data-driven (i.e., totally non-subjective) approach. For this analysis, 358,712 word images were collected from the titles of 207,572 book cover images. First, we observed the words and the word categories with their color usages. Our analysis revealed the existence of words with particular, i.e. specific, color usage. For example, words relating to plants have more particular color usages than words for foods. On the other hand, we also found there are many words whose color usage is not particular.
We then observed the effect of the color usage information has on the word embedding, i.e. semantic vector representation of words. To do this, we developed a novel word-embedding method, which is inspired by Sound-Word2vec . The method modifies the word-embedding matrix given by word2vec to reflect color usages of individual words. Specifically, the similarity of the words with different color usages become more different and vise versa. We confirmed that the proposed method can provide a reasonable semantic vector representation. By using antonym pairs and synonym pairs, we also confirmed that color information has “positive” or “negative” effects for the semantic vector representation. This is because there are synonym pairs with similar color usages but also synonyms with different color usages. The same situation happens to antonym pairs.
Since this is the first trial on a new research topic, we have many future works. The first and most important work is a more detailed analysis of the “class” of the words that have positive or negative effects from their color usage information. As we found through the analysis in this paper, there are two classes of antonyms; antonyms with different color usages and those with similar color usages. The former class will be improved by color information and the later will have degradation. If we can find trends in these classes and use the proposed word-embedding method only for the former class, we can improve the performance of word embedding by color. This classification can also be done in a more general way, instead of focusing only on antonyms and synonyms. Of course, in addition to book covers, we can also introduce more word-color samples from the internet.
Another future work is to consider a loss function specialized for color histogram in the proposed word-embedding method. Currently, our loss function is the distance between the output and the actual color usage. This means we treat 13 basic colors independently and thus the affinity between the basic colors is not considered. For example, red and orange are treated independently but a word often with red can be printed with orange. Moreover, the current results depend too much on the choice of 13 colors and their color space (e.g. CIELAB). Therefore, it will be meaningful to find a more suitable way for color usage representation.
This work was supported by JSPS KAKENHI Grant Number JP17H06100.
-  (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1.
-  (2014) Color psychology: effects of perceiving color on psychological functioning in humans. Annual Review of Psychology 65 (1), pp. 95–120. Cited by: §1.
-  (2015) True color distributions of scene text and background. In International Conference on Document Analysis and Recognition, pp. 506–510. Cited by: §1, §2, §4.
-  (2017) Incorporating visual features into word embeddings: a bimodal autoencoder-based approach. In International Conference on Computational Semantics—Short papers, Cited by: §2.
-  (2017) Judging a book by its cover. arXiv preprint arXiv:1610.09204. Cited by: §3.
-  (2016) Visual word2vec (vis-w2v): learning visually grounded word embeddings using abstract scenes. In , pp. 4985–4994. Cited by: §2.
-  (2015) Combining language and vision with a multimodal skip-gram model. In Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Cited by: §2.
-  (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on computer vision and Pattern Recognition, pp. 3431–3440. Cited by: §3.
Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §1, §2.
-  (2013) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §2.
-  (1979) A threshold selection method from gray-level histograms. IEEE Trans. Systems, Man, and Cybernetics 9 (1), pp. 62–66. Cited by: §4.
-  (2016) An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (11), pp. 2298–2304. Cited by: §3.
-  (2019) Serif or sans: visual font analytics on book covers and online advertisements. In International Conference on Document Analysis and Recognition, Cited by: §2.
-  (2019) VCWE: visual character-enhanced word embeddings. arXiv preprint arXiv:1902.08795. Cited by: §2.
-  (2017) Sound-word2vec: learning word representations grounded in sounds. In Conference on Empirical Methods in Natural Language Processing, Cited by: §2, §6.1, §7.
-  (2017) EAST: an efficient and accurate scene text detector. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 5551–5560. Cited by: §3.