TX-Ray: Quantifying and Explaining Model-Knowledge Transfer in (Un-)Supervised NLP

12/02/2019 ∙ by Nils Rethmeier, et al. ∙ Københavns Uni 20

While state-of-the-art NLP explainability (XAI) methods focus on supervised, per-instance end or diagnostic probing task evaluation[4, 2, 10], this is insufficient to interpret and quantify model knowledge transfer during (un-) supervised training. By instead expressing each neuron as an interpretable token-activation distribution collected over many instances, one can quantify and guide visual exploration of neuron-knowledge change between model training stages to analyze transfer beyond probing tasks and the per-instance level. This allows one to analyze: (RQ1) how neurons abstract knowledge during unsupervised pretraining; (RQ2) how pretrained neurons zero-shot transfer knowledge to new domain data; and (RQ3) how supervised tasks reorder pretrained neuron knowledge abstractions. Since the meaningfulness of XAI methods is hard to quantify [11, 4], we analyze three example learning setups (RQ1-3) to empirically verify that our method (TX-Ray): identifies transfer (ir-)relevant neurons for pruning (RQ3), and that its transfer metrics coincide with traditional measures like perplexity (RQ1). We also find, that TX-Ray guided pruning of supervision (ir-)relevant neuron-knowledge (RQ3) can identify `lottery ticket'-like [9, 40] neurons that drive model performance and robustness. Upon inspecting pruned neurons, we find that task-relevant neuron-knowledge (`tickets'), appear (over-)fit, while task-irrelevant neurons lower overfitting, i.e. TX-Ray identifies neurons that generalize, transfer or specialize model-knowledge [25]. Finally, through RQ1-3, we find that TX-Ray helps to explore and quantify dynamics of (continual) knowledge transfer and that it can shed light on neuron-knowledge specialization and generalization, to complement (costly) supervised probing task procurement and established `summary' statistics like perplexity, ROC or F scores.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Continual and Transfer Learning have gained importance across fields and especially in NLP, where the de facto standard approach is to pretrain a sequence encoder and fine-tune it to a set of supervised end-tasks

[DBLP:conf/acl/RuderH18, TuneNotTune19]. Knowledge transfer in NLP is currently focused on ‘decision understanding’ [CSI19] by analyzing supervised probing tasks [belinkov-glass-2019-analysis] using either performance metrics [Senteval, Glue] or laborious per-instance explainability [arrasACL19, 2019-errudite]. Such approaches analyze input-output relations for ‘decision understanding’, treating models as a black box; while grey box, ‘model-understanding’ methods [CSI19] like DeepEyes or Activation Atlas [DeepEyes18, carter2019activation] visualize interpretable model abstractions learned via supervision.

Figure 1: Example uses of TX-Ray: for transfer learning and model interpretability. Left: pre-train a sequence encoder on corpus and collect token-activation distributions (§ 2.1, red bars) over input features (e.g. words) .333Features are discrete inputs like tokens or POS tags. Middle: apply, but not re-train, the encoder to new domain inputs and observe the changed neuron activation (green). Similarities in red and green reveal zero-shot forward transfer potential or data match between and according to . Right: fine-tune encoder on supervision labels to reveal ‘backward’ transfer of supervision knowledge into the encoder’s knowledge abstractions.444‘Backward transfer’ since changes, while labels do not.

Unfortunately, supervised probing annotation is costly, but not guaranteed to be reliable under domain shifts and can only evaluate expected knowledge absorption, while unforeseen, perhaps more important model-knowledge properties remain hidden [BERTsFeather, RightWrong19] – i.e., “We have to remember that what we observe is not nature herself, but nature exposed to our method of questioning” – W. Heisenberg [heisenberg1958physics]. In fact, hypothesis testing, while useful for question verification, has little utility in identifying whether the questions are the correct or interesting ones to begin with. Despite NLP’s heavy reliance on pretraining transfer, the field currently lacks methods that allow one to explore, analyze, and quantify a model’s nuanced knowledge transfer mechanisms during pretraining or supervised fine-tuning of even basic sequence encoders like LSTMs. We argue that transfer learning cannot be understood in depth using a supervised evaluation setting only [BERTsFeather], because this merely reveals a limited, expectation-biased fraction of how model neurons build, use, transfer and catastrophically forget their knowledge; especially when analyzing continual transfer, model knowledge generalization [BERTsFeather, RightWrong19, Wallace2019Triggers], or low-resource learning. To thoroughly understand and select the best models, we need to start developing XAI methods that can reveal and quantify (unsupervised) knowledge loss, shift or gain, at both neuron (detail) and model (overview) level to enable us to discover unforeseen, potentially fundamental knowledge transfer mechanics through explorative analysis and minimize Clever Hans model optimization [BERTsFeather, RightWrong19, Wallace2019Triggers].

Contributions and research questions: In this work, we introduce a simple, yet general, token-activation distribution based model knowledge analysis method for NLP, inspired by recent activation maximization based vision XAI methods [olah2018the, carter2019activation, DeepEyes18]. We show that this enables: exportable, quantifiable neuron-level knowledge transfer analysis in a unified view of unsupervised and supervised model analysis, while remaining framework agnostic, using an overview-detail analysis [DBLP:journals/csur/CockburnKB08] approach that supports exploration of expected and unforeseen model learning effects. This puts a human evaluator’s world understanding and knowledge at the center of the analysis to scale model and learning understanding beyond probing task evaluation. To answer the overarching research question of “how to exhaustively explore and analyze knowledge transfer”, we demonstrate our method, TX-Ray, in the form of three digestible research question (RQ1-3) – i.e. three common learning setups.

RQ1 unsupervised knowledge absorption: How can TX-Ray visualize and quantify neuron knowledge abstraction building and change during early and late (unsupervised) pretraining stages of a sequence encoder? Do TX-Ray’s transfer (knowledge change) metrics coincide with standard measures like loss and perplexity?

RQ2 unsupervised to zero-shot transfer: When applying pretrained knowledge to a new domain without re-training, e.g. in a zero-shot setting, where and how much knowledge is transferable?

RQ3 un- to supervised transfer: Does knowledge transfer ‘backwards’ from supervision labels into a pretrained encoder? Does TX-Ray successfully identify knowledge-neurons that become (ir)relevant due to supervision and does pruning these neurons affect generalization and specialization performance as it indicates?

Through RQ1-3, we gain instructive insights about the knowledge interplay between unsupervised and supervised learning, where knowledge is added and lost by supervision, and find through pruning experiments (RQ3), that TX-Ray successfully identifies task (ir)relevant neurons. We also find evidence suggesting that (catastrophic) forgetting, expressed in neuron (de-)specialization, is more informative within unsupervised settings compared to solely supervised ones

[PARISI201954].

2 Approach

While for ‘decision understanding’ methods [arrasACL19, belinkov-glass-2019-analysis], input (token) importance is related to prediction strength, we instead provide ‘model understanding’ [CSI19] by recording input token importance for neuron activation strength. We adapt the broadly used activation maximization (AM) method [olah2018the, carter2019activation, DeepEyes18]

for discrete inputs in NLP by recording for each input sequence token which neuron maximally activates (responds). Thus we can express individual neurons as activation probability distributions over discrete features (see

§ 2.1) such that each neuron describes input features it prefers, i.e. maximally responds to.

This enables our method called TX-Ray, Transfer eXplainability as neuron Response555We use the term ‘activation’ and its neuroscience analog ‘response’ synonymously, since both describe a neuron’s behavior under changing ‘stimuli’ and ‘environments’ [responseNeuroScience] and since a like-minded methodology to TX-Ray has recently been used to analyze and quantify knowledge change during task learning in the prefrontal cortex of rats [singh2019medial] (§ 2.2). Aggregation analYsis, to visualize and quantify neuron feature distribution changes, using Hellinger distance (see § 2.2), and hence analyze transfer between un- and supervised models and to new domains.

The main benefit of token-activation distributions is that they condense ‘explanations’ over an entire corpus instead of creating them per-instance [belinkov-glass-2019-analysis], thereby reducing the cognitive load and work load when analyzing knowledge transfer in neural models. Moreover, changes in a neuron’s token-activation distribution can be used to measure knowledge transfer within training stages and models, allowing us to automatically propose interesting starting points (see Fig. 6, 8) for nuanced, per-neuron evaluation (see Fig. 7 and 9).

2.1 Neurons as token-activation distributions:

We expresses each neuron as a distribution of features with activation probabilities (Fig. 1) that have been aggregated over an entire corpus to construct each distribution as follows.

(1) Record maximally activated features for neurons: Given: a corpus , text sequences , input features (tokens) , a sequence encoder , and hidden layer neurons , for each input token feature in the corpus sequences , we calculate its: encoder neuron activations ; along with ’s maximally active neuron and (maximum) activation value ; to then record a single feature’s activation

row vector

. If the encoder is part of a classifier model

, we also record the sequence’s class probability and true class as a longer vector . For analyses in RQ1-3, we also record part-of-speech tags (POS, see § 3.1) in the row vectors. This produces a matrix of neuron feature activations that we aggregate to express each neuron as a probability distribution over feature activations in step (2).

(2) Per neuron token-activation distributions via aggregation: From rows , we generate for each neuron its discrete feature activation (response) distribution , where each is a feature the neuron maximally activated on, and is the mean (maximum-)activation of that feature in . We then turn each activation (response) distribution into a probability distribution by calculating the sum of its feature activation means and dividing each by

to produce the normalized distribution

, where, each is now the activation probability of a feature . Finally, for neurons in a model, describes their per-neuron activation distributions .

Each

can be an n-gram, though we only use word uni-grams to focus on transfer understanding. Nevertheless, like

[arrasACL19, carter2019activation], TX-Ray’s aggregation works for deeper models. Using more maximum activations per token would provide another, denser learning analysis, but also multiply computation and cognitive load, while blurring instructive insight about neuron change, over- and under-specialization (§ 3.3.1, 3.3.2 and 3.3.3).

2.2 Quantify neuron change – as transfer:

We use Hellinger distance [hellinger1909neue] and neuron distribution length to quantify differences between discrete feature probability distributions and of two neurons and as follows:

We chose neuron length because it tells us the number of (unique) features in a token-activation distribution 

, and Hellinger distance because it is symmetric, unlike the Kullback-Leibler divergence

[Hellinger15]. Importantly, when one of the token-neuron distributions or is empty, i.e. has zero features, we return the resulting Hellinger distance as ill-defined. That way, we can use Hellinger distance to easily identify how many neurons were ‘alive’, i.e., actively used, and ‘dead’ (ill-defined), when analysing training stages and transfer. Hellinger distance provides an easily quantifiable measure of neuron differences in terms of distributional shift of features, which we use to compare how the activation of a neuron differs during pre-training (RQ1), zero-shot transfer (RQ2), and supervised fine-tuning (RQ3). A similar use of Hellinger distance to analyze neuron feature distribution changes during supervised learning (RQ3) was recently explored by [singh2019medial], who “measured changes in neuron response dictionaries after task learning success” in the prefrontal cortex of rats. Additionally, by analyzing distribution length changes over time

or during training epochs and stages (RQ1-3), we can identify whether a neuron specializes to a small set of features or more generally responds (generalizes) to a broad feature set.

3 Experiments and Results

We showcase TX-Ray’s usefulness for interpreting, analyzing and quantifying transfer in answering the previously stated research questions. For RQ1, we pretrain an LSTM sequence encoder 666Due to a lack of computational resources, we do not train costly architectures, such as BERT, though this would be possible. Instead we focus on demonstrating TX-Ray’s analytical versatility, which especially benefits true low-resource scenarios, where large scale pre-training is not available. with hidden units on WikiText-2 similarly to [DBLP:conf/iclr/MerityX0S17, merityRegOpt, DBLP:conf/acl/RuderH18], and apply (RQ2) or fine-tune it (RQ3) on IMDB [maas-EtAl:2011:ACL-HLT2011], so we can analyze its zero-shot and supervised transfer properties. Each research question, its experimental setup and results are detailed in the respective subsections.

3.1 RQ1: sequence encoder pretraining

In this experiment, we explore how pretraining builds knowledge abstractions. To this end, we first analyze neuron abstraction shift between early and late training epochs, and then verify that Hellinger distance and neuron length changes converge similar to measures like training loss.

We pretrain a single layer LSTM paragraph encoder on paragraphs from the WikiText-2 corpus using a standard language modeling setup until loss an perplexity converge, resulting in 50 training epochs. We save model states at Epoch 1, 48 and 49 for later analysis. To produce neuron activation distributions (gray), (pink) and (red) we feed the first 400.000 tokens of WikiText-2 into the Epoch 1, 48 and 49 model snapshots each to compare their neuron adaptation and incremental abstraction building using Hellinger distance and distribution length. Additionally, we record POS feature activation distributions using one POS tag per token, to later group tokens activations by their word function to better read, analyze and compare token-activation distributions – see Fig. 3, 5, 7 or 9. POS tags are produced by the state-of-the-art Flair tagger [akbik-etal-2019-flair] using the Penn Treebank II66footnotetext: https://www.clips.uantwerpen.be/pages/mbsp-tags tag set.

We use this experiment to verify the feasibility of using a token-activation distribution approach, since comparing Epochs 1 vs. 48 is expected to reveal large changes to neuron abstractions, while Epoch 48 and 49 should reflect few changes. The resulting changes in terms of Hellinger distance, amount of ‘alive’ or active neurons, and neuron feature distribution length can be seen in Fig. 2.

Figure 2: Pretraining neuron length shifts: where neuron length (token variety) becomes; longer (blue ), shorter (red ), unchanged (black ) for epoch 1, 48, 49. Token variety settles () in later epochs.

While the Epoch 1 vs. 48 comparison produced 544 ‘alive’ neurons, the later 48 vs. 49 comparison shows 1335 ‘alive’ (§ 2.2) neurons. This means that, the model learned during pretraining to distribute maximum input activations across increasingly many neurons, which is also reflected in the fact that more neurons become longer (blue lines), and fewer neurons become shorter (red lines). As expected, for Epochs 48 and 49 we see almost unchanged neuron length – seen as dotted vertical (:) lines between epochs. Additionally, in later training stages, shorter neurons are more frequent than longer ones, reflected in the opacity of dotted vertical bars decreasing for longer neurons. This is further confirmed by the average length of alive neurons dropping from 944.76 in epoch 1 to 524.55 and 519.34 in Epochs 48 and 49.

Since neuron lengths in terms of POS distributions change significantly in the early epochs, we also analyze whether the encoders activations , actually learned to represent the original POS tag frequency distribution of WikiText-2. To do so, we express both corpus POS tag frequencies and encoder activation masses as proportional (relative) frequencies per token. In Fig. 3, we see relative corpus POS tag frequencies (black), compared with encoder POS activation percentages for Epoch 1 (dark grey) and 49 (red). We see that the encoder already learns a good approximation of the original distribution (black) after the first epoch (dark grey), which is consistent with findings by [LM_learn_POS_first], who showed that: “language model pretraining learns POS first”, and that “during later epochs (49) the encoder POS representation changes little”, but ultimately almost perfectly replicates the original POS distribution. We thus see that POS are well represented by the encoder, and that neuron adaptation and length shifts converge in later epochs in accordance with the quality of the POS match. This also tells us that TX-Ray, compared with more involved, task-specialized analysis methods [LM_learn_POS_first], can reveal comparably deep insights into the mechanisms of unsupervised, while being simpler and more versatile (RQ1-3).

Figure 3: Encoder learns POS fast: Black: Corpus tag frequencies (y-axis) vs. encoder activations in %-per-tag. Black: corpus frequencies via FLAIR. Grey: epoch 1 encoder. Red: fully trained encoder. POS is learned early, i.e. in epoch 1, confirming findings in [LM_learn_POS_first].
Figure 4: Pretraining Hellinger epoch change: Hellinger distance reduces expectedly when comparing later epochs (48 vs. 49, red line) vs. earlier epochs (1:48, black curve).

A similar analysis about neuron feature distribution changes stabilizing at later training stages can be made using Hellinger distances, as seen in Fig. 2. When visualizing distances in Fig. 4, we see that distances shrink as expected by on average in later epochs and that neuron distance comparisons concentrate on medium length distributions of 10-200 features each. For short (specialized) neuron distribution comparisons, we notice a trend of higher Hellinger distances than for longer, broadly responding neurons. Since distances over different neuron lengths are not directly comparable, nor should they be, this visualization provides an explorable overview of neuron distances of different lengths, used to identify and examine interesting neurons in detail.

Figure 5: High vs. low Hellinger neurons: Neuron token () and POS activation probabilities (bars) for epochs 1, 48, 49. Neurons with high (296) and low (38) between epochs 1:48.
66footnotetext: https://plot.ly/python/bar-charts

To run such a detail analysis we pick 2 neurons from the figure for closer inspection of their feature distribution changes between Epochs 1, 48 and 49. We pick neuron 296 from the top 10 (head) most distant Epoch 1 vs. 48 neurons, and Neuron 38 from the 10 least changed ones (tail) – in Fig. 5. As expected from Neuron 296’s high Hellinger distances between Epoch 1 and 48, we see that its token and POS distribution for Epoch 1, i.e., an outlined grey bar and the word ‘condition’ (), are very different from the Epoch 48 and 49 distributions (, ), which show no significant change in token and POS distribution. Equally expected from Neuron 38’s low Hellinger distance for Epoch 1 and 48; we see that it keeps the exact same token, ‘with’, and POS, ‘IN‘, across all three epochs. This demonstrates that Hellinger distance identifies neuron change, and that later epochs, as expected, lead to small neuron abstraction changes, while earlier ones, also as expected, experience larger changes.

3.2 RQ2: Zero-shot transfer to a new domain

In this section, we analyze where and to what extent knowledge is zero-shot transferred when applying a pretrained encoder to a new domain’s text without re-training.

To do so, we apply the trained encoder , in prediction-only mode, to both its original corpus IMDB, , and to the new domain WikiText-2 corpus , to generate token-activation distributions  and from the encoders’ hidden layer, as before. We also record activation distributions for POS, which despite the FLAIR tagger being SOTA across several datasets and tasks, had noticeably low quality on the noisy IMDB corpus. However on the WikiText-2 corpus, tagging produced comparatively sensible results. By comparing neuron token and tag activations (new domain) vs. 

using Hellinger distances for the same neuron positions as in RQ1, we can now analyze zero-shot transfer as distribution shifts. Put differently, we estimate domain transfer between the pretrained model abstractions and text input from a new domain. High distances between the same neurons in

and tell us that the pretrained neuron did not abstract the new domain texts well, resulting in low transfer and poor cross-domain generalization. When comparing and in terms of Hellinger distances vs. neuron lengths in Fig. 6, we see that 1323 out of 1500 pretrained neurons () are ‘alive’ when applying to the IMDB domain. Some drop in the amount of alive neuron compared to the RQ1 analysis, though at 1335 to 1323 small, is expected since the pretraining corpus covers a broader set of domains.

Figure 6: Hellinger distance vs. token-activation distribution length: 1323 ‘alive’ Hellinger distances between neuron activation distributions and on WikiText-2 and IMDB.

However, to gain a detailed view of model abstraction behavior and zero-shot transfer, we analyze activation differences between (green) and (red) for two specific neurons, visualizing one each from the 10 most (head) and 10 least (longtail) Hellinger-distant neurons. In Fig. 7 (up), we see Neuron 637, which has high Hellinger distance when comparing token feature distributions (, ) and, as expected, responds very differently on its pretraining corpus compared to the new domain data .

Figure 7: Low vs. high zero-shot transfer neurons: Neuron 637 transferred little, while the ‘but-no’ neuron 1360 transferred (applied) well from pretraining to the new IMDB domain.

In fact, the distance in Neuron 637 is high in terms of both POS classes (word function semantics) and non-synonymous tokens – see x-axis annotated with POS tags and tokens sorted by POS class. Overall, we see very little knowledge transfer between data sets within Neuron 637 due to its feature over-specialization, which is also observable through its short distribution length – only up to 2 features activate. When looking at the low Hellinger distance Neuron 1360 in Fig. 7 (lower plot), we see that the neuron focuses on tokens such as ‘no’ on both datasets and ‘but’ on IMDB, suggesting that its pretrained sensitivity to disagreement (red), is useful when processing sentiment in the new domain dataset. Furthermore, we see that IMDB specific tokens have many strong activations for movie terms like ‘dorothy’ or ‘shots’ (green). We thus conclude that Neuron 1360 is both able to apply (zero-shot transfer) its knowledge to the new domain, as expected from the low Hellinger distance, while also being adaptive to the new domain inputs, despite not being fine-tuned to do so, which is more surprising.

In summary, we find that during zero-shot application of an encoder to new domain data , the pretrained encoder exhibits broad transfer, indicated by almost equal amounts of alive neurons between pretraining (1335) and application to the new domain data (1323). This result provides a baseline that exhibits broad transfer, compared to the supervised setting in RQ3, which, as we will describe below, shows much less transfer, as expected.

3.3 RQ3: Supervised transfer via classification fine-tuning

In this experiment, we analyze whether transfer constitutes more phenomena than just a high level observation like catastrophic forgetting. Here, we want to see if knowledge also transfers ‘backwards’ from supervised annotations to a pretrained encoder. Specifically, we analyze whether knowledge is added or discarded in two experiments. In Experiment 1, we demonstrate how TX-Ray can identify knowledge addition or loss induced by supervision at individual neuron level (§ 3.3.1). In Experiment 2, we verify our understanding of neuron specialization and generalization by first pruning neurons that add or lose knowledge during supervision, and then measuring end-task performance changes (§ 3.3.2). Finally, we show how neuron activity increasingly sparsifies over RQ1-3 to gain overall insights about model-neuron specialization and generalization during unsupervised and supervised transfer (§ 3.3.3).

For this RQ, we extend the pretrained encoder with a shallow, binary classifier777One fully connected layer with sigmoid activation that is fed by end-of-sequence hidden state. to classify IMDB reviews as positive or negative while fine-tuning to create a domain-adapted encoder . To guarantee a controlled experiment, we freeze the embedding layer weights and do not use a language modeling objective, such that model re-fitting is exclusively based on supervised feedback – i.e., on knowledge encoded into the labels. We tune the model to produce roughly on the IMDB test set, to be able to analyze the effects of even moderate amounts of supervised fine-tuning before task (over-)fitting occurs. To produce token-activation distributions , we feed the IMDB corpus to the newly fine-tuned encoder – i.e. using the same IMDB text input. We also once more record POS tags for tokens. This time, since POS distributions are compared on the same corpus, their distances are more consistent than in RQ2. Analyzing Hellinger distance and neuron length change when comparing vs.  will tell us which neuron abstractions were changed the most due to supervision – i.e., show us ‘backward knowledge transfer’.

In Fig. 8, we notice that only 675 neurons were ‘alive’ compared to 1323 neurons in the zero-shot transfer setting (Fig. 6). In other words, supervision ‘erased’ approximately half the sequence-encoder neurons. We can deduce this erasure because produces at least 1323 ‘alive’ neurons, which only leaves as the source of ‘dead’ or retired neurons.

Figure 8: Hellinger distances vs. neuron lengths after supervision: 675 ‘alive’ Hellinger distances for neuron distributions before and after supervised encoder fine-tuning – dropped from 1323.

3.3.1 Supervision adds new alive neurons

Somewhat surprisingly, supervision not only erased neurons, but also added distributions for 85 new neurons into that had previously empty distributions in . We analyzed these neurons and found that they represent new supervision task specific feature detectors. Below in Tab. 1, we show token features for the top three strongest firing neurons and the three least activating neurons out of the 85 – i.e. supervision-specific neurons with the highest or lowest overall activation magnitude. Note: we removed stop-words like ‘the’ or ‘a’ as well as spelling duplicates from the table’s feature lists to remain brief. Features are sorted by decreasing activation mass from left to right. Without reading too much into the results, we see that the first three highly active neurons roughly encode movie-related nouns and entities as well as sentiment terms like ‘clever’ or ‘great’, though they seem unspecialized (general), fitting many genres.

#neuron : activation sum, features
200 : 1307.42 great, james, superb, famous, strange, possible, french, english, grand, indian, kate, final, guy, solid, huge, disappointing, gorgeous, imaginary, legendary, short, wooden, … (141 features total)

1210 : 501.97 original, overall, good, real, some, dear, french, british, black, odd, italian, entire, many, cold, railway, henderson, dvd, perfect, crap, japanese, united, bach … (161 features total)

125 : 299.12 more, two, best, one, few, most, three, nice, four, fellow, films, somewhat, lot, favorite, rare, movie, eight, … (77 features total)
1289 : 7.92: terrific, dull, essential, celia, unbelievable, gentle, melancholy, intended, shaggy, unremarkable, amateurism, … (14 features)
372 : 4.18: walter
688 : 0.48: archer
Table 1: 6 supervision neurons added by supervised fine-tuning: 3 highly active ones (top 3), 3 seldomly active ones (bottom 3).

When looking at the three least activating ‘supervision’ neurons, we find more specialized feature lists. Some of them are short and very specialized to a specific feature – e.g. the 372 ‘walter’ neuron seems to be a ‘Breaking Bad’ review detector, while ‘archer’ (688) may detect the animated show of the same name. Somewhat surprisingly, Neuron 1289, despite only having a low activation sum, is comprised of many features that focus on sentiment like ‘terrific’ or ‘unremarkeable’, making the neuron more specialized than the top three. This suggests that ‘supervision’ neurons with low activation mass, somewhat independent of their feature variety, are more specialized than the highly active ones – which is reflected in their lower ‘neuron length’, i.e. fewer features.

Additionally, as done in explainability methods, we can generate a rough approximation of how important neuron features are for the classification prediction by (re-)weighting features by task importance through multiplying each feature with the recorded class prediction probability score. When doing so, the features of the 85 neurons reorder to show less review score irrelevant terms like numeric expressions (neuron 125) or ‘guy’ – which is then expected. Detailed, preliminary ‘discoveries’ like this reinforce our motivation, that an exploration-investigation approach can reveal detailed insights about a model’s inner workings if ‘drilled-down’888A fundamental visualization techniques design pattern used to describe incrementally more focused analysis. far enough, which we showcase here to further underline our methods’ application potential for gaining insights.

3.3.2 Pruning dead and alive neurons

To understand how much the ‘dead’ and ‘alive’ neurons, as well as the 85 ones added by supervision, actually affect predictive task performance, we run four pruning experiments (A-D) that remove specific neurons and then measure the relative change from the unpruned score in – i.e., a drop from 80 to 77 is . Experiment (A) cuts ‘dead’ neurons from the encoder , i.e., neurons with distribution length zero after supervision, resulting in cutting 740 such neurons. Experiments B and C cut the 20 least and most active ‘alive’ neurons as measured by their activity mass – i.e. the sum of (max) activation masses these 20 neurons produced relative to (in ) the sum of (max) activation masses (AM) over all 760 ‘alive’ supervision neurons. In the last pruning experiment (D), we prune the 85 neurons that became alive after (due to) supervision – i.e., were dead in . The relative changes in training and test set caused by pruning (A-D) can be seen in Tab. 2. We also show what percentage of the encoders (max) activation mass (AM) each pruning affected.

Table 2: Pruning dead, alive and supervision-added neurons: After supervised fitting of the encoder, (A) prunes dead (empty) neurons, (B,C) prune the least and most used alive neurons, and (D) prunes 85 neurons added by supervision – i.e., not alive in the pretrained encoder. Colors represent relative score change in from original – drops (red,), score gains (blue).

For pruning experiment (A), we see that removing dead neurons not only does not drop performance as commonly observed when dropping irrelevant neurons [transformerheads19, 16heads], it actually increases both training and test set performance by 3.65 and 2.80 respectively, resulting in better generalization – at least as far as test set scores reflect generalization. In Experiment (B), when removing seldomly activated supervision neurons, as indicated by the low activation mass percentage of , we lose significant training performance (), but no test set performance, telling us that those neurons were over-specialized or over-fit to the training set. It also tells us that these neurons were likely short (over-specialized), similar to those in Tab. 1 that have low activation mass (372, 688). When we checked this intuition, we found that each of the 20 neurons has a length of exactly one – i.e. is over-specialized. When pruning the 20 most heavily used supervision neurons (C) with (max) activation mass, we see the largest drop in training set performance out of all experiments (A-D). This confirms that TX-Ray successfully identifies important neurons, which were over-fit to the training data, while showing a significantly smaller change on the test set, similar to observations in experiment (B). Thus, Experiments (B, C) indicate that cutting supervision specific neurons after training can help preserve generalization performance, i.e., reduce generalization loss. Lastly, for (D), when pruning the 85 neurons that became ‘alive’ due to supervision and were ‘dead’ in the pretrained encoder, both training and test performances drop by equal amounts. Since these 85 supervision-only neurons experienced no transfer from pretraining – they were not ‘alive’ yet – this seems to indicate that neurons with pretraining exposure, as seen in experiments B and C, suffer less from overfitting on new (test set) data, even when pruned. We reason that neurons that were exposed to pretraining (B, C) have their knowledge partially duplicated in other neurons, while the 85 neurons added only after supervision (D) have no such backups. (Neuron) generalization, specialization: These observations are not only consistent with known effects of pretraining on generalization [TuneNotTune19, DBLP:conf/acl/RuderH18], but also show that TX-Ray can identify and distinguish at individual neuron level

, which parts of a neural network improve or preserve generalization (A, B) and which do not (C, D). Moreover; though the results in Experiment (A, B) initially contradict established views on pruning

[transformerheads19, 16heads], i.e. that it should lead to a slight performance drop, they are perfectly consistent with the notions of neuron specialization an generalization used throughout the analysis with TX-Ray – and also demonstrate the method’s effectiveness in identifying neurons that affect generalization and specialization.

Figure 9: Low and high transfer to supervision: Neuron 47 saw no transfer, while Neuron 877 transferred better before (zero-shot) and after supervised fine-tuning on IMDB.

To analyze what individual neurons actually learned, as was done in RQ1-2, we inspect neurons with high and low Hellinger distances between encoder activations before (green) and after supervision (blue) . In Fig. 9, we show Neuron 47 (up), from the top 10 highest Hellinger distances. We see that the neuron is over-specialized and changed in both POS and token distributions after supervision, which suggests catastrophic forgetting, or supervised reconfiguration, for Neuron 47. For the low Hellinger distance Neuron 877 (down), we see some POS and token distribution overlap before and after supervision, and that a few movie review related terms (green ) become relevant, compared to noticeably war related tokens before supervision (green ). This shows the neuron’s semantic shift (POS, token) due to supervision – i.e., limited knowledge transfer occurred despite the low Hellinger distance. Moreover, distribution length changed for this neuron from 9 before to 15 tokens after supervision, which may indicate a lack of transfer. Furthermore, we recall that in the zero-shot case more neurons were active than after supervision, 1323 vs. 675 (Fig. 6 vs. Fig. 8), which should be reflected in the overall activation magnitude produced by encoder before and after supervision.

3.3.3 Supervision sparsifies neural activity (knowledge)

To investigate the distribution length shift and activation sum hypotheses formulated above, we visualize the shift of neuron length before and after supervision (Fig. 10 and Fig. 11), as well as the activation mass for the three research questions: (RQ1) pretraining, (RQ2) zero-shot, and (RQ3) supervision.

Figure 10: Neuron length before () and after supervision (). After supervision neurons are shorter, are longer, and are unchanged.

In Fig. 10, we show neurons that shortened after supervision (red lines, ) and neurons that got longer (blue lines, ). Overall, supervision both shortens and lengthens neuron distribution length. At token level, neurons actually slightly lengthen by on average over the 675 shared neurons,999Over the entire 1500 neurons, neuron token length shortens by after supervision. while at POS level, we observed severe shortening at (not shown). Similar neuron lengthening, ‘feature coverage extension’, from supervision, was already apparent in neuron 877 (Fig. 9), where supervision appeared to have specialized and extended a previously unspecific neuron into a movie sentiment detector101010Again, without deeper analysis, we are not claiming that this is the case, only that such points for investigation and new, interesting hypotheses can be identified via TX-Ray..

In Fig. 11, we see that the activation mass – i.e., the sum of activation values – differs across corpora and encoder activation distributions , and . A much more peaked activation mass is produced after the encoder has been fine-tuned via supervision and then again applied to IMDB (blue, ) compared to before supervision (green), which is a strong indicator that supervision sparsified the neuron activation and therefore the representations in the encoder.

Figure 11: Sorted neuron activation masses, for the pretrained (large, ), zero-shot (middle, ), and supervision tuned encoder (small, ). Supervision sparsifies activations – i.e. head peaks, tail shortens.

The activation mass of the pretrained encoder on its pretraining corpus (WikiText-2, red ) is, unsurprisingly, the broadest, while the same encoder responds less strongly to the same amount of text (400k tokens) on the IMDB text (green, ), due to the mismatch of domains between pretrained encoder and the new data domain – as previously detailed in RQ2.

4 Related Work

From summarizing recent model analysis and explainability methods [CSI19, belinkov-glass-2019-analysis, ExplainingExplanations], two kinds of approaches emerge: supervised ‘model-understanding (MU)’ and ‘decision-understanding (DU)’. DU treats models as black boxes, by visualizing interactions in the input and output space to understand model decisions. MU, enables a grey-box view by visualizing the model abstraction space to understand what knowledge a model learned. Both DU and MU heavily focus supervision analysis, while understanding transfer learning in un- and supervised models remain open challenges. Supervised ‘DU’: techniques use probing tasks to hypothesis test models for language properties like syntax and semantics [Senteval], or language understanding [Glue, Decathlon, DiagnosticClassifiers]. DU also uses per-instance, supervised explainability [arrasACL19, ExplainingExplanations] for model decision analysis [belinkov-glass-2019-analysis] by highlighting predition decision relevant input words per instance [arras-etal-2017-explaining]. ‘Model understanding’: techniques like Activation Atlas, or Summit [carter2019activation, hohman2019summit] enable exploration of supervised model knowledge in vision, while NLP or medical methods like Seq2Seq-Vis or RetainVis [SEQ2SEQVIS, RetainVis] compare models using many per-instance explanations. However, these methods produce a high cognitive load, showing many details, which makes it harder to understand overarching learning phenomena. (Un-)supervised ‘model and transfer understanding’: TX-Ray extends on these ideas by guiding exploration through quantifying interesting starting points for analyzing neuron change, specialization and generalization during transfer learning. Surprisingly, a similarly spirited approach [singh2019medial] “calculates Helliger distances over ‘neuron feature dictionaries‘ to measure neuron adaptation during task learning” in the prefrontal cortex of rats – similar to our RQ3. Measuring both changes in neuron feature distributions and lengths enables fine-grained analysis of neuron (de-)specialization and model knowledge transfer in RQ1-3. TX-Ray thus addresses a surprising lack of (un-)supervised transfer interpretability [belinkov-glass-2019-analysis, ExplainingExplanations], that supports getting a deeper understanding of transfer in current and future (continual) pretraining methods [DBLP:conf/acl/RuderH18, TuneNotTune19, radford2019language, RuderEpisodic2019] as well as discovery of unforeseen hypotheses to help scale learning analysis beyond probing tasks.

5 Conclusion and future work

We present TX-Ray, a simple, yet powerfully nuanced model knowledge XAI method for analyzing how neuron level knowledge transfer affects models during pretraining (RQ1), zero-shot knowledge application (RQ2), and supervised fine-tuning (RQ3). We showed that, through TX-Ray’s explorative analysis, one can reveal fine-grained, sometimes unforeseen, insights about: loss and addition of neural knowledge due to supervision (RQ3), neuron specialization or generalization (RQ1-3), how pretraining builds knowledge abstractions (RQ1), and that zero-shot and supervised transfer differ greatly in terms of knowledge preservation and variety (RQ2 vs. RQ3). TX-Ray’s design focuses on reducing computational and cognitive load while remaining flexible and scalable to future extensions. While we consciously focused cognitive load reduction by using only the maximum activation per token to build token-activation distributions  this is a strong assumption for dense activation architectures like LSTMs, despite empirical usefulness (RQ3) and proven correctness in pooling architectures [hohman2019summit, carter2019activation]. Thus, in the future, we plan to extend TX-Ray to select and use maximum activations to run analysis at different density levels. Furthermore, we plan to extend and refine TX-Ray to more advanced transfer, task, activation and model settings and integrate its visualizations and extensions with OpenAI’s Weights & Biases.

References