Emojich – zero-shot emoji generation using Russian language: a technical report

This technical report presents a text-to-image neural network "Emojich" that generates emojis using captions in Russian language as a condition. We aim to keep the generalization ability of a pretrained big model ruDALL-E Malevich (XL) 1.3B parameters at the fine-tuning stage, while giving special style to the images generated. Here are presented some engineering methods, code realization, all hyper-parameters for reproducing results and a Telegram bot where everyone can create their own customized sets of stickers. Also, some newly generated emojis obtained by "Emojich" model are demonstrated.



There are no comments yet.


page 1

page 3

page 5


Generating Training Data with Language Models: Towards Zero-Shot Language Understanding

Pretrained language models (PLMs) have demonstrated remarkable performan...

Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic

Recent text-to-image matching models apply contrastive learning to large...

HTLM: Hyper-Text Pre-Training and Prompting of Language Models

We introduce HTLM, a hyper-text language model trained on a large-scale ...

A Fistful of Words: Learning Transferable Visual Models from Bag-of-Words Supervision

Using natural language as a supervision for training visual recognition ...

Fine-Tuning BERT for Schema-Guided Zero-Shot Dialogue State Tracking

We present our work on Track 4 in the Dialogue System Technology Challen...

RuCLIP – new models and experiments: a technical report

In the report we propose six new implementations of ruCLIP model trained...

Adaptive filter ordering in Spark

This report describes a technical methodology to render the Apache Spark...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

The transformer models which are normally pretrained on a large amount of data, have proved their ability to cope successfully with the various downstream tasks or adapt to new data related to the specific domain – and this is achieved through fine-tuning. Examples of this approach are abundant in the fields of Natural Language Processing

[DBLP:conf/coling/GriesshaberMV20], [DBLP:journals/corr/abs-2106-10199], [DBLP:conf/iclr/MosbachAK21]

and Computer Vision

[DBLP:journals/corr/abs-2109-13925], [DBLP:journals/corr/abs-2110-05270]. However, multimodal models are currently at the cutting edge of ML research, with text-to-image pretrained transformer models [DBLP:conf/icml/RameshPGGVRCS21], [DBLP:journals/corr/abs-2105-13290], [rudalle_github] being one of the brightest examples in the field. As for improving the fine-tuning process of such models, there is still considerable latitude for the experiments.

It is rather time-consuming and challenging procedure111https://unicode.org/emoji/proposals.html

– to prepare and submit a proposal for a new emoji to the Unicode Emoji Subcommittee; most importantly, the success is not guaranteed. Emojis – with its expressiveness, universality and ease of use – has infiltrated into our everyday digital communication. It would be a great option, to generate unique and customize emojis on the fly, especially since some messengers (such as Telegram) allow users to create their own sets of icons. In this regard, recently introduced text-to-image models which compute joint distribution over texts and images and generate the latter by the former, seem very appealing. Similar efforts have already been undertaken in

[DBLP:journals/corr/abs-1712-04421], however, some limitations can be observed there: the proposed approach is based on DC-GANs and word2vec embeddings, whereas the transformer architecture of the recent models provides high-quality representations which depend on the context and capture the meaning much better (for instance, in [DBLP:journals/corr/abs-1712-04421] the samples generated on more than 1 word embedding or some new unexpected text prompts were very noisy), and allows for more deep interaction between modalities; the data used for training is rather poor and restricted to 82 emotive faces; the resulted model cannot synthesize new emojis (and the best prospect for it, as seen by the authors, is to generate icons which are the combination of the existing ones). In view of the above, we would like to fine-tune ruDALL-E model in order to test its ability to generate emojis based on the textual descriptions – just like an artist doing commissions to delight the customers.

Ii Dataset

The proposed Emoji dataset is collected by web scrapping: emoji icons with the corresponding text captions in Russian are retrieved. The emoji pictures that are naturally in RGBA format with the last channel (alpha) being an opacity channel, are converted to RGB: pixels with opacity value less than 128 are assigned white color code. Small emoji pictures are scaled to 256 px using super-resolution approach Real-ESRGAN

[wang2021realesrgan], that has been trained by Sber AI with “4” scaling [realesrgan_sberai].

The full version of the dataset is available on Kaggle222https://www.kaggle.com/shonenkov/russian-emoji. The dataset contains 2749 images and 1611 unique texts. Such mismatch is due to the following reasons: there are sets, within which emojis differ only in color (for instance, images of hands with different skin tones); some elements are homonyms in Russian (for example, the descriptions for the images of a medieval castle and a padlock are identical – ‘‘замок’’). The length of text descriptions varies from 1 to 7 words (about 67% of all samples are one- or two-word descriptions).

Figure 2: Distribution of word count values per description in the Emoji dataset

Iii Fine-Tuning

The main goal of fine-tuning is to keep the “knowledge of the world” of ruDALL-E Malevich (XL) model while adjusting the style of generated images to emoji’s. ruDALL-E Malevich is a big multimodal pretrained transformer, which learns the correlations between images and texts [rudalle_github]. The freezing of the feedforward and self-attention layers in a pretrained transformer has demonstrated high performance in a multimodal setup [lu2021pretrained], [bakshandaeva2021heads]. Thus, by freezing our model we will definitely avoid catastrophic forgetting, increase the efficiency of the process and reduce GPU memory usage.

The proposed Emoji dataset has limited variations of captions (see chapter II for more details), therefore the model has all chances to be overfitted on the text modality and lost generalization. To deal with this issue, the coefficient of the weighted cross-entropy loss is increased to

for image representations (the codebook vectors). No data augmentations are used.

8-bit Adam [dettmers20218bit]

is used for optimization. This realization reduces the amount of GPU memory needed for gradient statistics and provides more stable training with fp16 precision. One cycle learning rate is chosen as a scheduler with the following parameters: start lr 4e-7, max lr 1e-5, final lr 2e-8, warmup 0.1, epochs 40, batch size 2, gradient clipping 1.0. Training time is 5h 30m using 1xA100.

The source code used for fine-tuning is available on Kaggle333https://www.kaggle.com/shonenkov/emojich-rudall-e.

(a) Training loss
(b) Learning rate
Figure 3: Some measurements during fine-tuning.

Iv Evaluation

A regular evaluation is an essential part of the fine-tuning process (this holds true, without a doubt, for pretraining as well) as it indicates when the fine-tuning should be stopped. In case of image generation, the most popular metrics is FID [DBLP:conf/nips/HeuselRUNH17]

which is used to evaluate the quality of Generative Adversarial Networks (GANs) performance. After obtaining the representations for both sets of artificial and real images via the Inception Net, two multivariate Gaussians fitted to these sets of feature embeddings are compared by computing Fréchet distance between them. It is rightly noted

[DBLP:journals/corr/abs-2105-13290] that this metrics is suited for evaluation of general-domain unconditional generation from simple distributions, thereby it is not the best option in our case.

In [DBLP:journals/corr/abs-2105-13290]

another metric is proposed – Caption Loss. It involves fine-tuning the model for the inverse task – image captioning – and further self-reranking using cross-entropy values for the text tokens. This metric is worth considering and will be used in further experiments.

In the current experiments, human evaluation is used: every 6 thousandths iterations of fine-tuning, 64 images generated by the model are assessed using such criteria as the degree of generality and abstraction, conformity to the emoji aesthetics and overall quality of the image (for instance, smoothness and absence of glitches/artefacts). Figure 4 demonstrates generated image samples obtained by the model after every 2 thous. iterations of fine-tuning: starting with the original ruDALL-E Malevich (XL) (in the upper left corner) and finishing with the model that has been fine-tuning for 68 thous. iterations in total (in the lower right corner). It can be observed that initially images are realistic photos that do not correspond to the emoji style; later they are modified and resemble drawings; after about 12 thous. iterations images start to look like emojis more and more (although there could be some features or artefacts that are out of style – a pose, for example) while maintaining the recognizable traits of Einstein; after about 56 thous. iterations, however, the model is consistently loosing the “knowledge of the world”: the images are becoming uniform and impersonal – Einstein is transforming to an unknown man with a moustache, the standard emojis are starting to prevail in the generation results.

Figure 4: From the top left corner to the lower right corner: images generated by the model after every 2 thous. iterations of fine-tuning using the text prompt ‘‘Эйнштейн улыбается’’ (“Einstein is smiling”)

V Emoji generation

The emoji generation starts with a text prompt that describes the desired emoji content. When the tokenized text is fed to Emojich, the model generates the remaining codebook vectors auto-regressively. Every codebook vector is selected item-by-item from a predicted multinomial probability distribution over the image latents using nucleus top-p and top-k sampling with a temperature

[DBLP:journals/corr/abs-1904-09751] as a decoding strategy. The image is rendered from the generated sequence of latent vectors by the decoder of the dVAE, which is pretrained VQ-GAN [esser2020taming] with Gumbel Softmax Relaxation [kusner2016gans]. Also, the decoder is modified with the inverse discrete wavelet transform (IDWT) used in the neural network MobileStyleGAN [DBLP:journals/corr/abs-2104-04767]; it allows restoring 512512 (instead of 256256) images almost without loss of quality. The authors would like to thank @bes-dev for his help with the implementation of this modification444https://github.com/bes-dev/vqvae_dwt_distiller.pytorch.

Vi Results

In the appendix, some emoji pictures created through the proposed methods are demonstrated. All examples are generated automatically (without manual cherry-picking) with the following hyper-parameters: seed 42, batch size 16, top-k 2048, top-p 0.995, temperature 1.0, GPU A100. To obtain emojis of better quality we advise to use more attempts () and select the best one manually. Remember, for the great art makers creating just one masterpiece is enough to become “great”.

Moreover, we propose a segmentation procedure based on U-Net to crop the background of the generated images of the emojis. It is necessary to create looking good stickers. Thus everyone can generate their own customized sets of stickers using this Telegram bot: https://t.me/rudalle_emojich_bot.


The authors would like to thank Sber AI, SberDevices and SberCloud for granting the GPU-resources for the experiments.