A significant part of the information perceived by a person and required for making even the simplest everyday decisions is presented in multiple modalities, that is, with the help of different types of “input information”, requiring the use of various senses and types of knowledge. Visual information requires visual perception, processing natural language texts presupposes the knowledge of the language, auditory information implies the perception and analysis of sound, and so on. Each of these modalities is handled by separate, sometimes overlapping areas of machine learning and artificial intelligence: computer vision, natural language processing, speech processing, video processing, and so on.††*Both authors contributed equally to this research.
However, a successful solution to emerging problems often cannot be obtained by analyzing data coming from only one modality, just as it is not always sufficient for a human being to use only sight or only hearing to make a rational decision. In such cases, information required to solve such problems can be divided into several “input types”, called data modalities, all of which should be taken into consideration to make successful decisions.
Multi-task learning has a long history mostly in the natural language processing domain. One of the possible reasons is that having the correct representation and thus “understanding” of text passage, one can solve many downstream tasks: sentiment analysis, question answering, language translation etc. One of the most widely used approaches here is to have the lower (encoding) layers shared for all tasks, while having the upper layers (also called “heads”) task-specific and learned separately[liu2019multi].
It is only recently that scientists have proposed to combine multi-modality and multi-task in one model, taking the joint approach: using different encoders for different modalities, then combining different types of information during middle processing, and completing the process with task-specific heads - e.g. the UniT [hu2021unit] approach, where visual and textual modalities are used, and 7 tasks of computer vision (e.g. object detection), text processing (e.g. sentiment analysis) and vision-and-language (e.g. visual question answering) fields are solved.
The problem of training large pretrained multi-modal and multi-task models can be separated into 2 subtasks: 1) How to combine modalities, and 2) How to combine tasks.
As for the first question, the current state-of-the-art research in the multi-modal processing is mostly focusing on the questions of the stage at which modalities should be fused (“early”, “middle” or “late” fusion) and the ways to implement this fusion (through iterative processing or by a modality bottleneck) [liang2018multimodal, li2019visualbert, das2020detecting, savchenko2020ad]. The important approaches for modality fusion are Perceiver [jaegle2021perceiver] and Perceiver IO [jaegle2021perceiverio]
, where the modality-specific information serves as the key-value for iterative cross-attention and is later processed by GPT-2-like[radford2019language] transformer. Another interesting and promising example of sharing the modality information is the so-called multimodal bottleneck transformer (MBT) [nagrani2021attention], where the fusion of the modalities is done: a) closely to the top of the transformer layers; b) only through a very small number
of multimodal neurons (in the workis used) making the cross-modality sharing only through a small bottleneck, which proves to be very efficient. Finally, incorporation of different modalities (like RGB and OpticalFlow) inside the single model via mutual modality learning can be used [komkov2020mutual].
The combination of tasks can also be implemented in different ways. An approach similar to above-mentioned UniT is the so-called frozen pretrain transformer (FPT) technique [lu2021pretrained], which is a source of inspiration for our proposed baseline. However, such multi-task pipeline, when different tasks/modalities are processed through separate heads, is not the only one. The more interesting approaches use the more sophisticated ways of dealing with multiple tasks: for instance, they incorporate either the task-specific adapters [houlsby2019parameter, pfeiffer2020adapterfusion] between the frozen layers or the fully learnable (trainable) task representation (embedding) that can be later propagated in the non-trivial way through the major part of the model (see Perceiver IO, HyperGrid [tay2020hypergrid] or conditionally adapted approach [pilault2020conditionally]).
The corresponding research in the field of information retrieval (IR) is also worth mentioning. For now, however, it seems that quite straightforward solutions are used for IR, e.g. the combination of all task-specific datasets for training NLP model for multiple tasks [maillard2021multi], or the processing of multi-modal data with the single transformer using the representations obtained by modality-specific encoders as the inputs for the multi-modal retrieval [gabeur2020multi, dzabraev2021mdmmt].
We aim to promote the development of such promising and challenging field as multi-modal and multi-task research. Our main contributions are the following:
prepared the data, task statement and leaderboard for the Fusion Brain Challenge;
proposed the specialized as well as the overall metric to evaluate the models;
created the simple yet efficient baseline which combines multi-modal as well as multi-task approach.
Within the competition we proposed to solve 4 subtasks:
Code2code translation (C2C),
Handwritten text recognition (HTR),
Zero-shot object detection (ZsOD),
Visual question answering (VQA).
In the following subsections we will discuss each of these subtasks in more details.
Ii-a Subtask 1 - Code2code Translation
Among the various problems within ML4Code field, the task of translating code snippets from one programming language (PL) to another was chosen. Even though source code can be attributed to text modality, it is definitely more structured than natural language, thus we would like to distinguish between them. The proposed task not only adds “code modality” to the challenge but also imposes the requirement for the model to be multilingual since it has to understand and generate code in two PLs.
Our C2C task requires a model to translate code snippets from Java to Python. The choice of such a pair of PLs induces extra complexity to the problem since translation between statically- and dynamically-typed languages is more intricate than translation between PLs with the same type checking.
For training we proposed to use a dataset presented in [avatar]
. AVATAR is a parallel corpus that consists of solutions written in Java and Python for 8,506 programming problems collected from competitive programming sites, online platforms, and open source repositories. We used solutions of 6,807 tasks from AVATAR for train, leaving 1,699 examples for the public part of the test set. The private test dataset was designed as follows: at first, Python snippets with a length corresponding to that of the 90th percentile of AVATAR test set part written in Python (up to 282 tokens obtained after tokenization[pytok]) were retrieved from CodeNet [codenet] dataset; these code snippets were translated to Java by three annotators and then cross-checked; at the final stage, Java functions (not longer than 356 tokens, which matches the 90th percentile of the public test requests’ lengths) were back-translated to Python and cross-checked as well to ensure that Python snippets generate the same outputs as source functions when given the same inputs. The resulting number of Java-Python pairs is 322.
is selected as an evaluation metric for this task.
Ii-B Subtask 2 - Handwritten Text Recognition
Handwritten Text Recognition is the task that naturally combines image and text modalities; the model is given an image with a handwritten piece of text in Russian or English and is required to transcribe it into digital text as an output. The dataset for this task was manually collected and annotated; it is composed of the examples from school notebooks. The training data consist of 66,599 images of words written in Russian language (participants of the Challenge may use an open datasets with forms of handwritten English text, e.g., IAM Handwriting Database [htrdb]). The public test set includes 14,973 images: 5,973 in English and 9,000 in Russian. The private test part consists of 12,556 images, 5,494 of which are in English and 7,062 – in Russian. In total, our new handwritten dataset contains 82,661 images of Russian words, which makes it the largest Russian handwritten dataset in the world so far. We have also released this dataset [htrdatasets] for the benefit of the research community.
The evaluation metric for this task is string accuracy - the proportion of cases in which the predicted text (string) coincides with the ground truth transcription.
Ii-C Subtask 3 - Zero-shot Object Detection
ZsOD task sets the following problems to the model: firstly, the model should accurately predict bounding boxes for various objects depicted in the images, given the descriptions of these objects in natural language [xiuye2021ZsOD]. In our case, such a common computer vision task as object detection is complicated by the fact that there is no set of predefined classes to choose from – a model is expected to detect classes not present in the training set (i.e. in a zero-shot regime). During inference, a model receives image-query pairs; a query is formatted as a list of textual descriptions (in Russian or English) of objects to detect. The query may contain entities that are absent in the image; a model should predict an empty list as a bounding box for such objects.
The public test dataset is formed from a part of the VisualGenome [visualgenome] dataset (1,000 examples); the set of classes in it was hidden from the participants. Region descriptions from VisualGenome are used as positive classes (descriptions are normalized: reduced to lowercase; non-printable characters are removed, etc.; boxes related to the same entity are combined under a single description); negative classes are formed by replacing some objects/attributes in the description with those that are missing in the photo. For example, “a grey chair” is replaced by “a pink chair”. Also, descriptions of objects belonging to the same domain as the correct classes are used as negative examples: if the photo shows a street, then as negative examples there may be, for instance, descriptions such as “tall green bricks wall”, “shingled home in distance”, “food stand in the street” (provided, of course, that the described objects are not in the photo). The images for the private test set were either extracted from YFCC100M dataset [yfcc] or crawled from the Internet. In total, 827 images were attributed with positive (the descriptions of objects which are present on the photo) and negative (the descriptions of missing objects) labels by 10 annotators. The number of positive classes varies from 7 to 10 – the same held true for the negative ones. For a specific image, descriptions can be either in English or in Russian. There can be more than one bounding box for a particular description in the queries, a perfect model should predict all of them.
The F1-score metric is used for evaluation. Refer to the section VII-A for more details.
Ii-D Subtask 4 - Visual Question Answering
VQA is a classical multi-modal task that requires model to understand a textual question and generate an answer to it based on the corresponding image. The peculiarity of the problem is that the questions are not homogeneous: a correct answer can either consist of several words, or be monosyllabic (a “yes/no” answer) or be a number. It is assumed that only one answer per question is required. As with other tasks, the model should be bilingual in order to perform well, since questions can be expressed in both English and Russian and the answer is expected to be in the same language except when the question concerns the text on the image. For example, when the question is “What is written on the T-shirt?” the answer should be in the same language in which the text is written.
The public test dataset consists of questions in both Russian and English: the Russian-language part is translated examples from the first 10 thousand samples of the validation part of the VQA v2 dataset, the English part - next 10 thousand original samples from the same dataset. The public test set size is 5,446 examples. The private test set was compiled similarly to the one for ZsOD task, except for the nature of annotation: for each image (in total, 1,000 images), 6 questions in Russian or English and corresponding answers were formulated, resulting in 6,000 samples. The intersection with the private test set for ZsOD task is 724 images.
The evaluation metric for this task is accuracy. Each question has a list of possible correct answers; if the prediction matches at least one of the ground truth answers, it is considered true positive.
We provide a concept [fbconcept] of a single model that is trained on several tasks related to different modalities (visual, audio and text). The concept is inspired by a work [lu2021pretrained] that examines the ability of pretrained language models based on the Transformer architecture to form qualitative representations of arbitrary data sequences – thus, generalizing to other modalities with minimal finetuning. The basis of the architecture proposed in the concept is the pretrained GPT-2 [radford2019language] language model; experiments are carried out both with a “frozen” model (with only output layer being finetuned), and with a model in which all layers are trained on three modalities simultaneously.
We build our baseline solution also on top of Frozen Pretrained Transformer. The overall architecture can be seen on Figure 2. The core, the “shared brain” of the whole pipeline is GPT-2 Large, pretrained on natural language; each type of data for a particular task undergoes its specific transformations in order to match the GPT-2’s input format, and also has its specific head to generate predictions in accordance with the task. The input and output layers for each of the subtasks are described below.
It is worth mentioning that one can use any of the so-called foundation model (see, e.g., in-depth report [foundationmodels]) instead of GPT-2 as Fusion Brain Core (see Figure 1). Following the researchers from Stanford University CRFM we define foundation models as models trained on broad data at scale such that they can be adapted to a wide range of downstream tasks. Pretty nice examples of such models are BERT [devlin2019bert], BART [lewis2019bart], T5 [raffel2020exploring], GPT-3 [brown2020language], CLIP [clip_open_ai], DALL-E [ramesh2021zeroshot].
Iii-a C2C (code)
As code is similar to natural language (although it is certainly more structured; the problem of choosing the best representation of source code goes beyond the scope of this work), no major transformations are needed in order to prepare the data for processing with GPT-2. The task is solved in decoder-only machine translation manner: during training, the source sequence (code snippet in Java) is concatenated with the target one (in Python) through the SEP token; the resulting sequence is fed into the GPT-2 with LM head on top in order to minimize the Categorical Cross-Entropy (CCE) loss [Rubinstein99thecross-entropy]. When trained, the model auto-regressively generates Python code given Java function.
Iii-B HTR (image)
It is somewhat remarkable that images can also be processed using a language model and the proposed method. At first, raw images are subjected to smart resizing with proportions being preserved and empty space being padded; these resized images are then converted into vertical patches with full height and width equal to 8 pixels:. Image patch features are extracted with a linear projection layer in order to match the size of the GPT-2 embedding space (1280) before being processed with GPT-2. The transformer outputs are then passed through LSTM and linear layers. The training process is based on the Connectionist Temporal Classification (CTC) loss [ctc] that shows high performance in handwritten text recognition task [shonenkov2021stackmix, de2019no, michael2019evaluating, DBLP:journals/corr/abs-2103-09354].
Iii-C VQA and ZsOD (image + text)
The proposed pipelines for solving VQA and ZsOD tasks are similar. Raw images are resized, processed with a small convolutional backbone (ResNet-18) [he2015deep], Conv2D layer with a kernel size equal to and Flatten layer in order to match the size of the embedding space before processing with GPT-2: . Texts are converted to tokens with the pretrained GPT-2 tokenizer, processed with token and position embeddings. The transformer outputs (both for text tokens and image feature map) are then projected with a linear layer to a shared semantic space using InfoNCE loss [oord2019representation] like in CLIP [clip_open_ai]. The interaction of projected multimodal features takes place in the Multi-Modality Cross-Attention (MMCA) mechanism [Wei_2020_CVPR]. The processing described above, as well as weights, is common for both tasks, but InfoNCE loss is used only for text-image pairs from ZsOD input.
In case of VQA, text projections are used as queries (Q), image feature map projections are used as keys (K) and values (V). The output of MMCA blocks is passed through the linear layer in order to get a projection corresponding to the dimension of the vocabulary. CCE loss is used when adjusting model weights during training. The answer is generated auto-regressively.
In case of ZsOD it is vice versa: image feature map projections are used as Q, text projections are used as K and V. The output of MMCA blocks is passed through the adaptive max pool layer to reduce the amount of resulting bounding boxes per text query to 8 items. The bounding box predictions in the format of (x, y, w, h, probability score) are generated using MLP layers with Binary Cross-Entropy (BCE) loss[Rubinstein99thecross-entropy], Generalized Intersection over Union [Rezatofighi_2018_CVPR] loss and L1 loss [mae_article].
The main goal of our experiments is to compare metrics of models trained separately for each task and model trained on all tasks at once (Fusion). In Fusion experiments task type balance sampler is used for avoiding unbalanced learning, but samples from different tasks can appear in the mini-batch before performing the back propagation step. AdamW optimizer and OneCycleLR scheduler are used for optimization. All parameters for all experiments (single and fusion tasks) are equal: warmup 0.1, initial lr 4e-6, max lr 4e-5, final lr 2e-7, weight decay 1e-2, beta coefficients (0.9,0.999), 10 millions samples, batch size 4, 16xV100 32Gb GPUs.
The results of our experiments are introduced in Table I. Total score is the sum of scores for four subtasks, with the exception for CodeBLEU metric which is multiplied by 0.01 (refer to the section VII-B for more details.). An interesting observation is that Fusion experiment exposed less over-fitting problems.
|training setup||C2C CodeBLEU||HTR Acc||ZsOD F1||VQA Acc||Overall|
Iv-a Emissions reduction
Recently, reporting energy and carbon metrics of training deep learning models has become common practice to promote energy-efficient research[reportco2henderson, carbonpatterson]. In [mlco2]
the Machine Learning Emissions Calculator (ML CO2) is proposed, which estimates carbon emissions based on GPU type, hours spent on training, cloud provider, and region. This approach is very useful as it does not require reproducing the training process[aigambit]. According to ML CO2, we estimate (see Table II) that training the model in the fusion setup generates almost one third less CO2eq (carbon-dioxide equivalent) than when training in a single-task regime, thus proving multi-task learning to be more energy-efficient and climate-friendly.
|training setup||Training time (hours)||Training params||CO2 (kg)|
In this paper we have presented the AI Journey 2021 Challenge called Fusion Brain [fbchallenge] – a competition that is dedicated to the creation of the unified architecture which could deal with different modalities and solve 4 tasks for vision, language and programming code: Code2code Translation, Handwritten Text recognition, Zero-shot Object Detection, and Visual Question Answering. To test the participants’ submissions, the datasets for each task were created; we also have described how the data were prepared. To date, the Russian part of the proposed dataset for HTR task is the largest Russian handwritten dataset in the world. We also came up with a task statement and created a leaderboard for the Fusion Brain Challenge. Actually, there were 41 teams that took part in the competition and made at least one submission, and 513 submissions in total (refer to [dsworks] and section VII-C). Moreover, according to our estimations, the proposed multi-task fusion approach proves to be more energy-efficient and therefore provides a CO2 emissions reduction.
We would like to thank Sber and SberCloud for granting the GPU-resources to experiment with different architectures and for supporting the Fusion Brain Challenge.
Vii-a F1-score for ZsOD
To assess the quality of the detection model we use an F1-score:
The F1-score is calculated based on Recall and Precision, which, in turn, depend on a set of prediction statistics - true positive (TP), false positive (FP) and false negative (FN):
In our non-trivial case of multi-label object detection we calculate these statistics as follows:
FN – for a given label the model has not predicted or predicted not all required bounding boxes;
TP – a bounding box predicted by the model has IoU-score (intersection-over-union) with at least one of the ground truth bounding boxes for considered label higher than 0.5;
FP – a predicted bounding box has IoU score less than 0.5 with all ground truth bounding boxes or there is no object of the given label on the image, yet model has predicted boundaries for it instead of returning empty list.
Vii-B Overall metric
Total score is the sum of scores for four subtasks. Since all tasks are scored from 0 to 1 (the only exception is the CodeBLEU metric: it may take values within the range from 0 to 100 – with a view to normalize it, the metric is multiplied by 0.01), final result can range from 0 to 4.
Vii-C Private leaderboard
We provide the private leaderboard on the Fusion Brain Challenge (see Figure 3). Metrics of the winner of the competition are the following:
|team name||C2C CodeBLEU||HTR Acc||ZsOD F1||VQA Acc||Total|