Efficient Attribute Injection for Pretrained Language Models

09/16/2021 ∙ by Reinald Kim Amplayo, et al. ∙ NAVER Corp. 0

Metadata attributes (e.g., user and product IDs from reviews) can be incorporated as additional inputs to neural-based NLP models, by modifying the architecture of the models, in order to improve their performance. Recent models however rely on pretrained language models (PLMs), where previously used techniques for attribute injection are either nontrivial or ineffective. In this paper, we propose a lightweight and memory-efficient method to inject attributes to PLMs. We extend adapters, i.e. tiny plug-in feed-forward modules, to include attributes both independently of or jointly with the text. To limit the increase of parameters especially when the attribute vocabulary is large, we use low-rank approximations and hypercomplex multiplications, significantly decreasing the total parameters. We also introduce training mechanisms to handle domains in which attributes can be multi-labeled or sparse. Extensive experiments and analyses on eight datasets from different domains show that our method outperforms previous attribute injection methods and achieves state-of-the-art performance on various datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction


Yelp Review


Text: My boyfriend’s fav. place and the stein of beers are priced pretty good. Game nights get super packed so go early to save a seat. Kitchen closes at midnight which is too early when your buzz kicks in around 1am.
— User: n6LeAoIuDR3NfIBEsmL_zg
— Product: 7TMf1NuuAdvhG7IojZSKnw


Paper Abstract


Text: We present new and improved fixed-parameter algorithms for computing maximum agreement forests (MAFs) of pairs of rooted binary phylogenetic trees. The size of such a forest for two trees corresponds to their subtree prune-and-regraft distance and, if the agreement forest is acyclic, to their hybridization number …
— Authors: Chris Whidden, Robert G. Beiko, Norbert Zeh
— Research Areas: q-bio.PE, cs.DS


Figure 1: Examples of a Yelp review and an arXiv paper abstract and their corresponding attributes. Texts in typewriter font are attribute labels.

Neural-based NLP models are powered by large-scale textual datasets, which are mostly crawled from the web Denoyer and Gallinari (2006); Sandhaus (2008); Zhu et al. (2015); Ni et al. (2019); Raffel et al. (2020). Web texts usually are attached with metadata, i.e. attributes that describe the texts. For example, product reviews have user and product IDs, as well as their ratings, while research papers on arXiv have author lists and research areas as metadata attributes (see Figure 1). While most of recent models disregard them and focus more on ungrounded language understanding (understanding language on its own, e.g., GLUE; Wang et al., 2018, inter alia), prior work has shown that incorporating these attributes into our model increases not just its performance but also its interpretability and customizability Tang et al. (2015); Chen et al. (2016); Kim et al. (2019). This work explores the task of attribute injection Amplayo (2019), which aims to effectively use attributes to improve the performance of NLP models.

Previous methods Tang et al. (2015) for attribute injection involve two steps: (a) designing an architecture that accepts both texts and attributes, and (b) training the model from scratch using task-specific datasets. Chen et al. (2016) and subsequent work Zhu et al. (2015); Ma et al. (2017); Amplayo et al. (2018); Wu et al. (2018)

modify the pooling module of the classifier to inject attributes, while few explore different locations such as additional memory

Dou (2017); Long et al. (2018) and other parts of the classifier Kim et al. (2019); Amplayo (2019). However, these methods of modifying different modules of the model can be non-trivial when applied to pretrained language models (PLMs; Devlin et al., 2019; Liu et al., 2019; Qiu et al., 2020). For one thing, the use of PLMs disallows designing new and specialized architectures for different domains. More recent work on language model customization and controllability make use of textual prompts Brown et al. (2020); Schick and Schütze (2021), specialized tokens Fan et al. (2018); Keskar et al. (2019), and additional neural modules Wang et al. (2019); Liu et al. (2021) to introduce additional contexts, such as style, topic, and end task. Unfortunately, these techniques do not generalize to all kinds of attributes, such as those that are non-textual (e.g., user IDs that are not text-translatable), multi-labeled (e.g., multiple authors of a paper), and with large vocabularies (e.g., thousands of products available).

In this paper, we propose a method to inject attributes applicable to PLMs. Specifically, we make use of adapters Houlsby et al. (2019), i.e. feed-forward modules inserted between layers of PLMs that are tiny in size, and extend them such that attributes are injected as additional inputs to the model. We introduce two kinds of injection methods, which either incorporate attributes independently of or jointly with the text representation. A naive implementation of the latter would substantially increase the parameters, especially when the attribute vocabulary is large, thus we use ideas from low-rank matrix approximations as well as parameterized hypercomplex multiplications Zhang et al. (2021); Mahabadi et al. (2021) to significantly decrease the fine-tuned parameters by up to 192 for a default base-sized BERT Devlin et al. (2019) setting. We also use two mechanisms, attribute dropout and post-aggregation, to handle attribute sparsity and multi-labeled attributes, respectively. Finally, our use of adapters enables us to parameter-efficiently train our model, i.e. by freezing pretrained weights and only updating new parameters at training time.

We perform experiments on five widely used benchmark datasets for attribute injection Tang et al. (2015); Yang et al. (2018); Kim et al. (2019), as well as three new datasets introduced in this paper on tasks where attributes are very important. These datasets contain attributes that have different properties (sparse vs. non-sparse, single-labeled vs. multi-labeled, etc.). Results show that our method outperforms previous approaches, as well as competitive baselines that fully fine-tune the pretrained language model. Finally, we also conduct extensive analyses to show that our method is robust to sparse and cold-start attributes and that it is modular with attribute-specific modules transferrable to other tasks using the same attributes. We make our code and dataset publicly available.

2 Related Work

Prior to the neural network and deep learning era, traditional methods for NLP have relied on feature sets as input to machine learning models. These feature sets include metadata attributes such as author lists and publication venue of research papers

Rosen-Zvi et al. (2004); Joorabchi and Mahdi (2011); Kim et al. (2017), topics of sentences Ramage et al. (2009); Liu and Forss (2014); Zhao and Mao (2017), as well as spatial Yang et al. (2017) and temporal Fukuhara et al. (2007) metadata attributes found in tweets. Attributes are mostly used in the area of sentiment classification Gao et al. (2013), where most of the time textual data includes freely available user and product attributes. These methods rely on manually curated features that would represent the semantics of user and product information.

Deep neural networks gave rise to better representation learning Bengio et al. (2013); Mikolov et al. (2013)

, which allows us to learn from scratch semantic representation of attributes in the form of dense vectors

Tang et al. (2015). The design of how to represent attributes has evolved from using attribute-specific word and document embeddings Tang et al. (2015) and attention pooling weights Chen et al. (2016); Ma et al. (2017); Amplayo et al. (2018); Wu et al. (2018), to more complicated architectures such as memory networks Dou (2017); Long et al. (2018) and importance matrices Amplayo (2019). These designs are model- and domain-dependent and can be non-trivial to apply to other models and datasets. Our proposed method, on the other hand, works well on any pretrained language model which are mostly based on Transformer Vaswani et al. (2017); Devlin et al. (2019).

Our work is closely related to recent literature on controlled text generation, where most of the work use either specialized control tokens concatenated with the input text Sennrich et al. (2016); Kikuchi et al. (2016); Ficler and Goldberg (2017); Fan et al. (2018); Keskar et al. (2019), or textual prompts that instructs the model what to generate Brown et al. (2020); Schick and Schütze (2021); Gao et al. (2021); Zhao et al. (2021). While these methods have been successfully applied to pretrained language models, the attributes used to control the text are limited to those that are text-translatable (e.g., topics such as “Technology” or tasks that are described in text) and those with limited vocabulary (e.g., “positive” or “negative” sentiment). In contrast, our method is robust to all kinds of attributes and performs well on all kinds of domains.

3 Modeling Approach

Let denote the input text of tokens, is a task-specific output, and is a discriminative model that predicts given . Suppose there exists a set of non-textual and categorical attributes that describe text (e.g., user and product IDs of product reviews). These attributes can be multi-labeled, i.e. (e.g., multiple authors of a research paper) and use a finite yet possibly large vocabulary , i.e. . The task of attribute injection aims to build a model that additionally incorporates as input such that the difference in task performance between and is maximized. In our setting, is a pretrained language model (PLM) fine-tuned to the task, while is a PLM that also takes as additional input.

Our method can be summarized as follows. We extend adapters Houlsby et al. (2019)

, which are tiny feed-forward neural networks plugged into pretrained language models, such that they also accept attributes

as input. Attributes can be represented as additional bias parameters or as perturbations to the weight matrix parameter of the adapter, motivated by how attributes are used to classify texts. We decrease the number of parameters exponentially using low-rank matrix approximations and parameterized hypercomplex multiplications Zhang et al. (2021). Finally, we introduce two training mechanisms, attribute dropout and post-aggregation, to mitigate problems regarding attribute sparsity and multi-label properties.

The advantages of Injectors over previous methods are three-fold. Firstly, injecting attributes through adapters allows the model to leverage attribute information on all intermediate layers of PLMs, in contrast to previous methods where attributes can only be injected either at the beginning or at the end of PLMs. Secondly, our use of adapters opens the possibility of parameter-efficient fine-tuning Houlsby et al. (2019) where only a tiny percentage of parameters is fine-tuned. Finally, we can transfer attribute representations learned from one task to another effectively by plugging in adapters to another model. Figure 2 illustrates an overview of our proposed method, which we call Injectors.

3.1 Preliminary: Adapters

We first briefly describe adapters. Let be the output hidden vector from a multi-head attention or feed-forward layer in a Transformer block. An adapter layer is basically two feed-forward networks that projects into vector with a much smaller dimension :


where , and are learned weight and bias parameters of FFNet, is a non-linear function, and the addition represents a residual layer.

Adapters are inserted every after multi-head attention and feed-forward layers for all Transformer blocks. PLMs with adapters are trained such that only the adapter parameters are updated while the original pretrained weights are left untouched. This makes training more efficient memory-wise compared to fully fine-tuning Houlsby et al. (2019)

and more robust towards different hyperparameter settings

Han et al. (2021).

3.2 Our Method: Injectors

In this section, we describe our method Injectors in detail. Injectors are multi-adapter modules that transforms hidden vector into attribute-injected hidden vector . These are inserted right after the multi-head attention and feed-forward layers of the pretrained language model, as shown in Figure 2.

Figure 2: Architecture of the Injector module when integrated into one block of a Transformer model (see left of figure). Injector starts with a task-specific adapter, followed by attribute-specific adapters, one for each attribute given in the task (see right of figure). Green-colored modules are trained and fine-tuned, while gray-colored modules are fixed.

Task-specific Adapter

Injectors start with a task-specific adapter that uses Equation 3.1 to transform the previous hidden vector to . The use of a separate task-specific adapter is essential to make our method modularizable and learned attributes on one task transferrable to another. We show extensive analyses on the modularity of our method in the later sections.

Attribute-specific Adapters

Attributes are injected through attribute-specific adapters, where they are used in two different ways. Firstly, they are used as bias parameters independent of the text representation. This is motivated by the fact that attributes can have prior disposition regardless of what is written in the text. For example, a user may tend to give lower review ratings than average. Secondly, they are also used as weight parameters. This allows our method to jointly model attributes with the text representation. This is motivated by how attributes can change the semantics of the text. For example, one user may like very sweet food while another user may dislike it, thus the use of the word sweet in the text may mean differently for them.

More formally, for each attribute , we sequentially transform the previously attribute-injected vector to attribute-injected vector using the following equation:


where from the output of task-specific adapter. Unlike standard adapters, the attribute-specific weight matrix and bias parameter of the down-project feed-forward network are not learned from scratch, but instead are calculated as follows.

The calculation of the bias parameter

is trivial; we perform a linear transformation of the attribute embedding



where is a linear projection, is a learned vector, and is the attribute embedding size.

We also define as:


where is a learned matrix. The function

, however, cannot be defined similarly as a linear projection. This would require a tensor parameter of size

to linearly project to . Considering the fact that we may have multiple attributes for each domain, the number of parameters would not scale well and makes the model very large and difficult to train. Inspired by Mahabadi et al. (2021), we use ideas from low-rank matrix decomposition and parameterized hypercomplex multiplications (PHMs; Zhang et al., 2021) to substantially decrease the number of parameters.

Figure 3: An illustration of how attribute embedding is transformed into weight matrix . The colored tensors are learned parameters, while the gray ones are derived. By using a set of tiny parameters and , we are able to obtain large matrices. When there are multiple labels for attribute , we process them separately and aggregate the resulting large matrices.

Specifically, we first transform attribute embedding into vectors in hypercomplex space with dimensions, i.e.:


where is a linear projection in the th dimension. A hypercomplex vector with dimensions is basically a set of vectors with one real vector and “imaginary” vectors.111Following Tay et al. (2019) and Zhang et al. (2021), we remove the imaginary units of these vectors to easily perform operations on them, thus these vectors are also in the real space.

For each dimension , we first define a small rank-one matrix as an outer product between and a learned vector :


and then define a large matrix as the Kronecker product, denoted by between two matrices and a learned matrix , followed by a reshape and the hyperbolic tangent function:


Finally, we add the large matrices of each dimension. To sum up, we define as:


and Figure 3 shows an illustration.

Low-rank (Eq. 7) and PHMs (Eqs. 8-3.2) are both necessary to achieve a high performance with decreased parameters. While low-rank in itself reduces the most parameters, it also reduces the expressive power of the model since it outputs rank-one matrices. PHMs mitigate this by performing a sum of Kronecker products, increasing the rank of the matrix to potentially at most . Finally, this process effectively reduces the number of parameters from to , since the parameters in dominate the other parameters (see Appendix for a detailed parameter analysis).

Attribute Dropout and Post-Aggregation

For cases where attributes are sparse and multi-labeled, we use the following mechanisms. Firstly, we add a dropout mechanism that randomly masks out attributes from training instances with a rate . This replicates how instances at test time would look like, where some attributes are not found in the vocabulary.

Secondly, when there are more than one labels of an attribute, instead of aggregating them first before processing, as in Kim et al. (2019), we perform aggregation post hoc (as also shown in Figure 3 for ), i.e.:


Aggregating attribute embeddings reduces their individual representation power, while our post-aggregation mechanism preserves this since a sum of non-linear transformations is injective Xu et al. (2019).

4 Experimental Setup



Dataset #Train #Dev #Test #Words/Input #Classes #Attrs #Attr. Vocab %Sparse Multi-label?


Yelp 2013 62.5K 7.8K 8.7K 210 166 5 2 3.3K 0.0%
Yelp 2014 183.0K 22.7K 25.4K 218 175 5 2 9.0K 0.0%
IMDB 67.4K 8.4K 9.1K 425 278 10 2 2.9K 0.0%
AAPR 33.5K 2.0K 2.0K 97 36 2 2 51.6K 97.8%
PolMed 4.5K 0.5K 38 62 9 4 0.5K 63.8%
Food.com 162.4K 20.3K 20.3K 101 65 16 3 40.5K 80.0%
Goodreads 714.7K 10.0K 10.0K 132 72 2 3 43.8K 34.4%
Beeradvocate 1.5M 10.0K 10.0K 133 56 4 9 3 98.0K 75.3%


Table 1: Dataset statistics. The second block reports new datasets introduced in this paper. Beeradvocate is a multi-task dataset, with nine classes for each of four given aspects. %Sparse is the percentage of attributes with less than 10 training examples. Multi-label attributes include lists of authors and research areas for AAPR, and lists of ingredients and tags for Food.com.

We performed experiments on a total of eight datasets. Five of them are widely used datasets for attribute injection, which include the following:

  1. Yelp 2013 Tang et al. (2015): A review rating prediction dataset where we are tasked to predict the rating of a review given two attributes, the user and the product.

  2. Yelp 2014 Tang et al. (2015): A dataset similar to Yelp 2013, but larger in size.

  3. IMDB Tang et al. (2015): A dataset similar to Yelp 2013, but with ten rating scales and longer reviews.

  4. AAPR Yang et al. (2018): A dataset for classifying whether an arXiv paper is accepted to a conference or not, with two attributes, a list of authors and a list of research areas.

  5. PolMed Kim et al. (2019): A message type classification dataset in the political domain, where the goal is to classify a tweet into one of nine classes, with four attributes, the politician who wrote the message, the media source, the audience, and the political bias.

We also introduce three new benchmark datasets with larger size, where attributes are crucial to effectively solve the task, are sparser, and have large vocabularies.

  1. Food.com Majumder et al. (2019)

    : A dataset where given a recipe of a food and three attributes, the user, a list of ingredients, and a list of tags, we are tasked to predict the estimated number of minutes it takes to make the food, rounded down to the tens.

  2. Goodreads Wan and McAuley (2018): A spoiler prediction dataset where we classify whether a book review contains spoiler or not, with three attributes, the user, the book, and the rating of the review.

  3. Beeradvocate McAuley et al. (2012): A multi-aspect rating prediction dataset where given a beer review and three attributes, the user, the beer, and the overall review rating, we are tasked to predict the ratings of four aspects, or properties that influence user satisfaction, of the beer: appearance, aroma, palate, and taste.

Table 1 reports statistics of all the datasets.

Training Configuration

For our PLM, we used weights and settings of bert-base-uncased Devlin et al. (2019), available in the HuggingFace library Wolf et al. (2020). We set the dimensions of all parameters as follows: , , and . Using this setting and our parameter-saving method, we are able to decrease the parameters by the naive method. We set both the general and attribute dropout rates to and the batch size to . We used Adam with weight decay Loshchilov and Hutter (2019) to optimize our models with a learning rate of and training steps, with the first steps used to warm-up training linearly.

To train our models, we added a logistic classifier which transforms the [CLS]

token into logits. The weights here are updated during training. We then used a cross entropy loss to train the models on all datasets except for Goodreads and Beeradvocate. The Goodreads dataset is very imbalanced towards the negative class (i.e., not spoiler). We thus put more importance to detecting the spoiler class and used a weighted cross entropy loss with

weight on the negative class and weight on the positive class. For Beeradvocate, we treat the task as a multi-task problem, where each aspect rating prediction is a separate task. Thus, we used multiple classifiers, one for each aspect, and aggregate the losses from all classifiers by averaging. For PolMed where there is no available development set, we performed 10-fold cross-validation, following Kim et al. (2019).

Comparison Systems

We compared our method with several approaches, including the following no-attribute baselines:

  1. BERT-base (): The base model used in our experiments.

  2. + Adapters: Extra tiny parameters are added to the base model and are used for training instead of the full model.

Baselines with attributes injected include the following models. We use the same base model for all baselines for ease of comparison.

  1. + Tokens: Following work from controlled text generation Sennrich et al. (2016), the attributes are used as special control tokens prepended in front of the input.

  2. + UPA: Short for User-Product Attention Chen et al. (2016)

    , attributes are used as additional bias vectors when calculating the weights of the attention pooling module.

  3. + CHIM: Short for Chunk-wise Importance Matrices Amplayo (2019), attributes are used as importance matrices multiplied to the weight matrix of the logistic classifier.

Finally, we also included in our comparisons the state-of-the-art from previous literature whenever available.

5 Experiments

Main Results


Model Yelp 2013 Yelp 2014 IMDB AAPR PolMed Food.com Goodreads BeerAdvocate


BERT-base () 67.97 68.07 48.10 63.70 41.82 41.89 41.98 50.48
+ Adapters 66.47 67.44 46.41 62.85 44.24 42.02 48.92 50.71
+ Tokens 67.87 67.98 48.00 64.85 42.63 41.23 44.79 50.25
+ UPA 68.38 68.82 48.90 64.40 42.83 43.97 43.96 51.98
+ CHIM 68.71 68.56 49.36 65.30 43.64 43.35 43.58 52.29
+ Injectors (ours)   70.76   71.35   55.13 67.10   47.27 45.01   57.78   57.69
Prev. SOTA 67.8 69.2 56.4 66.15 41.89


Table 2: Performance (F1-score on Goodreads, Accuracy otherwise) of competing methods on the eight datasets. The first block includes PLMs without injected attributes, while the second block includes those with injected attributes. Attribute injected PLMs that perform worse than the base model are colored red. Numbers for previous SOTA are copied from Amplayo (2019) for Yelp and IMDB datasets and from Kim et al. (2019) otherwise. Best systems are shown in bold. Asterisk (*) means there is a significant difference between our model and the CHIM model (paired bootstrap resampling; ).

We evaluated system outputs with accuracy for all datasets except Goodreads, where we used F1-score. For brevity in Beeradvocate, we took the average of the accuracy of all sub-tasks. Our results are summarized in Table 2. The non-attribute baselines perform similarly except in PolMed and Goodreads, where Adapters improve the base model significantly, which aligns to previous findings on the robustness of adapters Han et al. (2021). Overall, Injectors outperforms all baselines on all datasets. When compared with the previous state-of-the-art, our model outperforms on all datasets except IMDB. We account the performance decrease on the length limit of PLMs since most reviews in IMDB are longer than 512 words (see Table 1, not considering subwords), whereas the previous state-of-the-art Amplayo (2019) used a BiLSTM with no length truncation as base model.

Ablation Studies


Model Food Good Beer


+ Injectors 45.01 57.78 57.69
bias injection 44.86 57.06 57.67
weight injection 44.74 57.61 57.29
task adapter 44.30 56.57 57.28
attribute drop 44.48 56.62 57.41
post-aggregation 43.30 57.78 57.69
low-rank OOM OOM OOM
PHM 43.89 55.99 56.51


Table 3: Performance on Food.com, Goodreads, and BeerAdvocate of Injectors and versions thereof without some of our proposed components (second block), training mechanisms (third block), and parameter-saving methods (fourth block). OOM denotes the model does not run on our experimental setup due to out of memory error.

We present in Table 3 various ablation studies on the three new datasets (see Appendix for the other datasets), which assess the contribution of different model components. Our experiments confirm that the use of both bias and weight injection as well as the addition of task adapter improve performance. Interestingly, some datasets prefer one injection type over the other. Goodreads, for example, prefers bias injection, that is, using attributes as prior and independent of the text (e.g., the tendency of the user to write spoilers). Moreover, our training mechanisms also increase the performance of the model. This is especially true for post-aggregation on the Food.com dataset since two of its attributes are multi-labeled (ingredients and tags). Finally, we show that, on their own, the parameter-saving methods either perform worse or do not run at all.

On Attribute Sparsity

Figure 4: Performance plots per sparsity level of BERT-base (gray), CHIM (red), and Injectors (blue) for each attribute in the Goodreads dataset. The x-axis with a lower value has a higher sparsity level (0 is the most sparse).


Yelp 2013
Model 20% sparse 50% sparse 80% sparse


BERT-base () 64.69 62.65 56.83
+ CHIM 64.63 62.43 56.01
+ Injectors 67.22 63.51 58.14
HCSC 63.6 60.8 53.8



Model 20% sparse 50% sparse 80% sparse


BERT-base () 43.49 37.45 29.82
+ CHIM 43.24 35.72 31.48
+ Injectors 50.74 43.68 34.02
HCSC 50.5 45.6 36.8


Table 4: Performance on the sparse versions of Yelp 2013 and IMDB. Best systems are shown in bold. Models perform worse than are colored red.

As shown in Table 1, most datasets contain attributes that are sparse. In this section, we analyze the ability of Injectors on sparse attributes using two experiments. In our first experiment, we looked at the model performance at different sparsity levels. That is, for each attribute in the dataset, we equally divided the test set into ten bins arranged according to the attribute sparsity. Figure 4 shows attribute-specific plots of the performance for each bin of the base model, CHIM, and Injectors on the Goodreads dataset (see Appendix for the plots on all datasets). For sparse attributes such as user and book, the performance difference of Injectors from the base model increases as the sparsity level increases, showing that our method mitigates attribute sparsity well. CHIM, on the other hand, has a uniform performance increase all throughout.

In our second experiment, we checked the performance of the models when trained using synthetically created sparse versions of Yelp 2013 and IMDB provided in Amplayo et al. (2018), where the datasets are downsampled such that the attributes become 20/50/80% more sparse than the original. We compared the performance of BERT-base, CHIM, and Injectors. Table 4 shows their performance on the datasets, along with HCSC Amplayo et al. (2018), which is a BiLSTM- and CNN-based model with a sparsity-aware attribute injection method similar to the UPA method Chen et al. (2016). As can be seen, our method still performs the best on these datasets, while CHIM underperforms and is worse than the base model on all cases except on IMDB 80% sparse. Our method performs better than HCSC on Yelp 2013, and competitively on IMDB where input texts are longer than 512 tokens.

On Model Modularity




CHIM 52.6 52.5 (-0.2%) 51.3 51.2 (-0.2%)
Injectors 56.3 56.0 (-0.5%) 54.0 54.5 (+0.9%)


Table 5: Performance of models on single-task Beeradvocate (A: appearance, R: aroma, P: palette, T: taste). The arrow () indicates that the attribute-specific adapters of the model in the right-hand side (i.e., A and R) are initialized using parameters of the left-hand side model (i.e., RPT and APT) and are frozen during training.

Since Injectors are basically a sequence of adapters, which are known to be self-contained modular units Pfeiffer et al. (2020), modular composition across different models is also effective in our setting. We verify this using the following experiment. We first divide the Beeradvocate dataset, which is a multi-aspect rating prediction task with four aspects, into two subsets: (1) a single-task target dataset and (2) a 3-task source dataset. We train our model using the source dataset, obtaining attribute-specific parameters. We then transfer these fixed parameters when training the model using the target dataset, and only fine-tune parameters of the task adapter and the classifier.

We arbitrarily chose the first two aspects alphabetically, appearance (A) and aroma (R), as target tasks. We split the training dataset into four parts, one for each aspect, to remove biases from overlapping training datasets. We combined the non-target datasets as the source dataset. We experimented with CHIM and Injectors, and report the results in Table 5. When compared to the same model trained directly on the target task (see A and R columns), both methods are able to achieve very minimal performance loss (see RPTA and APTR columns), with a marginal increase on the R target task when using Injectors. This confirms the results of previous work on the transferability of attribute embeddings in CHIM Amplayo (2019), as well as the modularity of adapter-based modules Pfeiffer et al. (2020).

6 Conclusions

We considered the use of attributes as additional contexts when fine-tuning PLMs for NLP tasks. We proposed the Injector module, an extension of adapters that also accepts attributes as input. Our method considers two kinds of injection strategies, uses parameter-saving techniques, and introduces training mechanisms to account for sparse and multi-labeled attributes. Experiments on eight datasets of various classification tasks showed that our method improves substantially over previous methods. Finally, we conducted extensive analyses on how Injectors handle attribute sparsity and to verify their modularity. In the future, we plan to apply our methods to real world data where there are millions of attributes. We also plan to explore the use of attribute injection methods to text generation tasks, i.e. injecting attributes when generating texts instead of during modeling.


We would like to thank Jaewook Kang and other members of the NAVER AI Lab for their insightful comments. Reinald is supported by a Google PhD Fellowship.


  • R. K. Amplayo, J. Kim, S. Sung, and S. Hwang (2018) Cold-start aware user and product attention for sentiment classification. In ACL, pp. 2535–2544. Cited by: §1, §2, §5.
  • R. K. Amplayo (2019) Rethinking attribute representation and injection for sentiment classification. In EMNLP-IJCNLP, pp. 5602–5613. Cited by: §1, §1, §2, item 5, §5, §5, Table 2.
  • Y. Bengio, A. C. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, pp. 1798–1828. Cited by: §2.
  • T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. Amodei (2020) Language models are few-shot learners. In NeurIPS, Cited by: §1, §2.
  • H. Chen, M. Sun, C. Tu, Y. Lin, and Z. Liu (2016) Neural sentiment classification with user and product attention. In EMNLP, pp. 1650–1659. Cited by: §1, §1, §2, item 4, §5.
  • L. Denoyer and P. Gallinari (2006) The wikipedia xml corpus. In INEX, Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §1, §2, §4.
  • Z. Dou (2017)

    Capturing user and product information for document level sentiment analysis with deep memory network

    In EMNLP, pp. 521–526. Cited by: §1, §2.
  • A. Fan, D. Grangier, and M. Auli (2018) Controllable abstractive summarization. In NMT@ACL, Cited by: §1, §2.
  • J. Ficler and Y. Goldberg (2017) Controlling linguistic style aspects in neural language generation. ArXiv abs/1707.02633. Cited by: §2.
  • T. Fukuhara, H. Nakagawa, and T. Nishida (2007) Understanding sentiment of people from news articles: temporal sentiment analysis of social events. In ICWSM, Cited by: §2.
  • T. Gao, A. Fisch, and D. Chen (2021) Making pre-trained language models better few-shot learners. In ACL-IJCNLP, Cited by: §2.
  • W. Gao, N. Yoshinaga, N. Kaji, and M. Kitsuregawa (2013) Modeling user leniency and product popularity for sentiment classification. In IJCNLP, Cited by: §2.
  • W. Han, B. Pang, and Y. N. Wu (2021)

    Robust transfer learning with pretrained language models through adapters

    In ACL, Cited by: §3.1, §5.
  • N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. de Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly (2019) Parameter-efficient transfer learning for nlp. In ICML, Cited by: §1, §3.1, §3, §3.
  • A. Joorabchi and A. Mahdi (2011) An unsupervised approach to automatic classification of scientific literature utilizing bibliographic metadata. Journal of Information Science 37, pp. 499 – 514. Cited by: §2.
  • N. S. Keskar, B. McCann, L. R. Varshney, C. Xiong, and R. Socher (2019) Ctrl: a conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858. Cited by: §1, §2.
  • Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura (2016) Controlling output length in neural encoder-decoders. In EMNLP, Cited by: §2.
  • J. Kim, R. K. Amplayo, K. Lee, S. Sung, M. Seo, and S. Hwang (2019) Categorical metadata representation for customized text classification. TACL 7, pp. 201–215. Cited by: §1, §1, §1, §3.2, item 5, §4, Table 2.
  • J. Kim, D. Kim, and A. H. Oh (2017) Joint modeling of topics, citations, and topical authority in academic corpora. Transactions of the Association for Computational Linguistics 5, pp. 191–204. Cited by: §2.
  • S. Liu and T. Forss (2014) Web content classification based on topic and sentiment analysis of text. In KDIR, Cited by: §2.
  • X. Liu, Y. Zheng, Z. Du, M. Ding, Y. Qian, Z. Yang, and J. Tang (2021) GPT understands, too. arXiv:2103.10385. Cited by: §1.
  • Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov (2019) RoBERTa: a robustly optimized bert pretraining approach. ArXiv abs/1907.11692. Cited by: §1.
  • Y. Long, M. Ma, Q. Lu, R. Xiang, and C. Huang (2018) Dual memory network model for biased product review classification. In WASSA@EMNLP, pp. 140–148. Cited by: §1, §2.
  • I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In ICLR, Cited by: §4.
  • D. Ma, S. Li, X. Zhang, H. Wang, and X. Sun (2017) Cascading multiway attentions for document-level sentiment classification. In IJCNLP, pp. 634–643. Cited by: §1, §2.
  • R. K. Mahabadi, J. Henderson, and S. Ruder (2021) Compacter: efficient low-rank hypercomplex adapter layers. ArXiv abs/2106.04647. Cited by: §1, §3.2.
  • B. P. Majumder, S. Li, J. Ni, and J. McAuley (2019) Generating personalized recipes from historical user preferences. In EMNLP/IJCNLP, Cited by: item 1, item 6.
  • J. McAuley, J. Leskovec, and D. Jurafsky (2012) Learning attitudes and attributes from multi-aspect reviews. In ICDM, Cited by: item 3, item 8.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) Efficient estimation of word representations in vector space. In ICLR, Cited by: §2.
  • J. Ni, J. Li, and J. McAuley (2019) Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In EMNLP-IJCNLP, pp. 188–197. Cited by: §1.
  • J. Pfeiffer, A. Rücklé, C. Poth, A. Kamath, I. Vuli’c, S. Ruder, K. Cho, and I. Gurevych (2020) AdapterHub: a framework for adapting transformers. In EMNLP, Cited by: §5, §5.
  • X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang (2020)

    Pre-trained models for natural language processing: a survey

    Science China Technological Sciences, pp. 1–26. Cited by: §1.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR 21, pp. 1–67. Cited by: §1.
  • D. Ramage, D. Hall, R. Nallapati, and C. D. Manning (2009) Labeled lda: a supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, Cited by: §2.
  • M. Rosen-Zvi, T. Griffiths, M. Steyvers, and P. Smyth (2004) The author-topic model for authors and documents. In UAI, Cited by: §2.
  • E. Sandhaus (2008) The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: §1.
  • T. Schick and H. Schütze (2021) It’s not just size that matters: small language models are also few-shot learners. In NAACL, Cited by: §1, §2.
  • R. Sennrich, B. Haddow, and A. Birch (2016)

    Controlling politeness in neural machine translation via side constraints

    In NAACL, Cited by: §2, item 3.
  • D. Tang, B. Qin, and T. Liu (2015) Learning semantic representations of users and products for document level sentiment classification. In ACL-IJCNLP, pp. 1014–1023. Cited by: §1, §1, §1, §2, item 1, item 2, item 3.
  • Y. Tay, A. Zhang, A. T. Luu, J. Rao, S. Zhang, S. Wang, J. Fu, and S. C. Hui (2019) Lightweight and efficient neural natural language processing with quaternion networks. In ACL, Cited by: footnote 1.
  • A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, Cited by: §2.
  • M. Wan and J. McAuley (2018) Item recommendation on monotonic behavior chains. In RecSys, Cited by: item 2, item 7.
  • A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman (2018) GLUE: a multi-task benchmark and analysis platform for natural language understanding. In ICLR, Cited by: §1.
  • Y. Wang, Y. Wu, L. Mou, Z. Li, and W. Chao (2019) Harnessing pre-trained neural networks with rules for formality style transfer. In EMNLP, Cited by: §1.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2020) Transformers: state-of-the-art natural language processing. In EMNLP, Cited by: §4.
  • Z. Wu, X. Dai, C. Yin, S. Huang, and J. Chen (2018) Improving review representations with user attention and product attention for sentiment classification. In AAAI, Cited by: §1, §2.
  • K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How powerful are graph neural networks?. In ICLR, Cited by: §3.2.
  • M. Yang, J. Mei, H. Ji, W. Zhao, Z. Zhao, and X. Chen (2017) Identifying and tracking sentiments and topics from social media texts during natural disasters. In EMNLP, Cited by: §2.
  • P. Yang, X. Sun, W. Li, and S. Ma (2018)

    Automatic academic paper rating based on modularized hierarchical convolutional neural network

    In ACL, Cited by: §1, item 4.
  • A. Zhang, Y. Tay, S. Zhang, A. Chan, A. Luu, S. C. Hui, and J. Fu (2021) Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. In ICLR, Cited by: §1, §3.2, §3, footnote 1.
  • R. Zhao and K. Mao (2017) Topic-aware deep compositional models for sentence classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, pp. 248–260. Cited by: §2.
  • T. Zhao, E. Wallace, S. Feng, D. Klein, and S. Singh (2021) Calibrate before use: improving few-shot performance of language models. In ICML, Cited by: §2.
  • Y. Zhu, R. Kiros, R. Zemel, R. Salakhutdinov, R. Urtasun, A. Torralba, and S. Fidler (2015) Aligning books and movies: towards story-like visual explanations by watching movies and reading books. In ICCV, pp. 19–27. Cited by: §1, §1.

Appendix A Appendix

a.1 Training Configurations for Reproducibility

Our model is implemented in Python 3, and mainly uses the following dependencies: torch as the machine learning library, nltk for text preprocessing, transformers for their BERT implementation, and numpy for high-level mathematical operations in CPU. During our experiments, we used machines with a single GeForce GTX 1080Ti GPU, 4 CPUs and 16GB of RAMs. The training times for all datasets are less than a day. The total number of parameters depends on the number of attributes, each of which has their own attribute-specific adapters. In our experiments, excluding the embedding matrices and classifiers that vary a lot across datasets and tasks, BERT-base with Injectors can have a total of 105M parameters with 19M (18%) trained for tasks with two attributes, or a total of 121M parameters with 36M (29%) trained for tasks with four attributes. Using the accuracy of the model on the development set, we tuned the learning rate (from , , , and ), the adapter size (from , , , and ), and the hypercomplex dimensions (from 2, 4, 6, and 8).

a.2 Descriptions of Newly Introduced Datasets

This section describes how we procured the three datasets we introduce in this paper:

  1. Food.com: We used the dataset gathered in Majumder et al. (2019)

    , which was used as a personalized recipe generation dataset. We repurposed the dataset for a new classification task and used the recipes as input text and the duration (in minutes) as output class. We removed instances with outliers: (1) recipes that took less than 5 minutes and more than 150 minutes; (2) recipes with more than 500 tokens or less than 10 tokens; and (3) tags with more than 50 labels. We also removed from the attribute vocabulary tags that explicitly indicate the recipe duration (e.g.,

    60-minutes-or-less) and those that are used on almost all instances (e.g., time-to-make). We shuffled the data and used 10% each for the development and test sets, and the remaining 80% for the training set.

  2. Goodreads: We used the review corpus gathered in Wan and McAuley (2018), which was also used for spoiler detection. Since the split is unfortunately not publicly shared, we created our own split. We first removed very short (less than 32 tokens) and very long (more than 256) reviews as they were outliers. We then divided the data into three splits, with two 10K splits as the development and test sets, and the remaining split as the training set.

  3. Beeradvocate: We used the review corpus gathered in McAuley et al. (2012). We removed outliers and split the dataset into three using the same method we did with Goodreads.

a.3 Parameter Analysis of Weight-based Injection

Recall that we define as follows:


In a naive setting, we can trivially use a projection function as our , which would linearly transform into the shape . This would need a weight tensor of size , which can be prohibitively large. This parameter dominates all the other parameters in the module, thus the overall parameter of the naive method is .

Our parameter-saving methods remove this large tensor, but instead use three smaller parameters in hypercomplex space: the transform function that is basically a linear transformation with a projection matrix of size (Eq. 6), the vector of size , and the matrix of size . Since we have dimensions in our hypercomplex space, we have a total of , which we can reduce as follows:


given that and that we can treat as a constant ( in our experiment). Thus the overall parameter when using our parameter-saving method is . We emphasize that this is a huge improvement since the PLM hidden size is usually the largest dimension.

The output weight has a rank of at most , i.e., (1) the low-rank method (Eq. 7) outputs a matrix of rank ; (2) the Kronecker product (Eq. 8) returns a matrix of rank ; and finally, (3) the sum of multiple matrices (Eq. 3.2) has a rank .

a.4 Full Ablation Studies

Table 6 reports various ablation studies on all datasets which assess the contribution of the different components of our models. We can see very similar observations in this table and the table shown in the main text of our paper.


Model Yelp 2013 Yelp 2014 IMDB AAPR PolMed Food.com Goodreads BeerAdvocate


+ Injectors 70.76 71.35 55.13 67.10 47.27 45.01 57.78 57.69
 – bias injection 70.33 71.24 55.06 66.45 46.67 44.86 57.06 57.67
 – weight injection 70.51 71.27 54.82 66.85 45.66 44.74 57.61 57.29
 – task adapter 69.21 69.68 54.03 65.55 46.89 44.30 56.57 57.28
 – attribute drop 69.29 70.94 54.01 65.55 46.33 44.48 56.62 57.41
 – post-aggregation 70.76 71.35 55.13 64.42 47.27 43.30 57.78 57.69
 – PHM 68.27 69.05 54.15 65.15 46.16 43.89 55.99 56.51


Table 6: Performance of Injectors and versions thereof without some of our proposed components (second block), training mechanisms (third block), and parameter-saving methods (fourth block). OOM denotes the model does not run on our experimental setup due to out of memory error.

a.5 Performance Plots per Sparsity Level

Finally, we show the performance plots per sparsity level for all datasets in Figures 56. Overall, when fitting the plots into a line, Injectors outperform CHIM on all datasets and sparsity levels. For very sparse attributes (e.g., AAPR authors, Goodreads user, etc.), we can clearly see that the increase in performance is substantially larger in the sparser levels.

Figure 5: Performance plots per sparsity level of BERT-base (gray), CHIM (red), and Injectors (blue) for each attribute in the Food.com, Goodreads, and Beeradvocate datasets. The x-axis with a lower value has a higher sparsity level (0 is the most sparse).
Yelp2013 Yelp2014
Figure 6: Performance plots per sparsity level of BERT-base (gray), CHIM (red), and Injectors (blue) for each attribute in the Yelp2013, Yelp2014, IMDB, AAPR, and PolMed datasets. The x-axis with a lower value has a higher sparsity level (0 is the most sparse).