Understanding and interpreting how deep neural networks process natural language is a crucial and challenging problem. While deep neural networks have achieved state-of-the-art performances in neural machine translation (NMT)(Sutskever et al., 2014; Cho et al., 2014; Kalchbrenner et al., 2016; Vaswani et al., 2017), sentiment classification tasks (Zhang et al., 2015; Conneau et al., 2017)
and many more, the sequence of non-linear transformations makes it difficult for users to make sense of any part of the whole model. Because of their lack of interpretability, deep models are often regarded as hard to debug and unreliable for deployment, not to mention that they also prevent the user from learning about how to make better decisions based on the model’s outputs.
An important research direction toward interpretable deep networks is to understand what their hidden representations learn and how they encode informative factors when solving the target task. Some studies includingBau et al. (2017); Fong & Vedaldi (2018); Olah et al. (2017, 2018) have researched on what information is captured by individual or multiple units in visual representations learned for image recognition tasks. These studies showed that some of the individual units are selectively responsive to specific visual concepts, as opposed to getting activated in an uninterpretable manner. By analyzing individual units of deep networks, not only were they able to obtain more fine-grained insights about the representations than analyzing representations as a whole, but they were also able to find meaningful connections to various problems such as generalization of network (Morcos et al., 2018), generating explanations for the decision of the model (Zhou et al., 2018a; Olah et al., 2018; Zhou et al., 2018b) and controlling the output of generative model (Bau et al., 2019).
Since these studies of unit-level representations have mainly been conducted on models learned for computer vision-oriented tasks, little is known about the representation of models learned from natural language processing (NLP) tasks. Several studies that have previously analyzed individual units of natural language representations assumed that they align a predefined set of specific concepts, such as sentiment present in the text(Radford et al., 2017), text lengths, quotes and brackets (Karpathy et al., 2015). They discovered the emergence of certain units that selectively activate to those specific concepts. Building upon these lines of research, we consider the following question: What natural language concepts are captured by each unit in the representations learned from NLP tasks?
To answer this question, we newly propose a simple but highly effective concept alignment method that can discover which natural language concepts are aligned to each unit in the representation. Here we use the term unit to refer to each channel in convolutional representation, and natural language concepts to refer to the grammatical units of natural language that preserve meanings; i.e. morphemes, words, and phrases. Our approach first identifies the most activated sentences per unit and breaks those sentences into these natural language concepts. It then aligns specific concepts to each unit by measuring activation value of replicated text that indicates how much each concept contributes to the unit activation. This method also allows us to systematically analyze the concepts carried by units in diverse settings, including depth of layers, the form of supervision, and data-specific or task-specific dependencies.
The contributions of this work can be summarized as follows:
We show that the units of deep CNNs learned in NLP tasks could act as a natural language concept detector. Without any additional labeled data or re-training process, we can discover, for each unit of the CNN, natural language concepts including morphemes, words and phrases that are present in the training data.
We systematically analyze what information is captured by units in representation across multiple settings by varying network architectures, tasks, and datasets. We use VDCNN (Conneau et al., 2017) for sentiment and topic classification tasks on Yelp Reviews, AG News (Zhang et al., 2015), and DBpedia ontology dataset (Lehmann et al., 2015) and ByteNet (Kalchbrenner et al., 2016) for translation tasks on Europarl (Koehn, 2005) and News Commentary (Tiedemann, 2012) datasets.
We also analyze how aligned natural language concepts evolve as they get represented in deeper layers. As part of our analysis, we show that our interpretation of learned representations could be utilized at designing network architectures with fewer parameters but with comparable performance to baseline models.
2 Related Work
2.1 Interpretation of Individual Units in Deep Models
Recent works on interpreting hidden representations at unit-level were mostly motivated by their counterparts in computer vision. In the computer vision community, Zhou et al. (2015) retrieved image samples with the highest unit activation, for each of units in a CNN trained on image recognition tasks. They used these retrieved samples to show that visual concepts like color, texture and object parts are aligned to specific units, and the concepts were aligned to units by human annotators. Bau et al. (2017) introduced BRODEN dataset, which consists of pixel-level segmentation labels for diverse visual concepts and then analyzed the correlation between activation of each unit and such visual concepts. In their work, although aligning concepts which absent from BRODEN dataset requires additional labeled images or human annotation, they quantitatively showed that some individual units respond to specific visual concepts.
On the other hand, Erhan et al. (2009); Olah et al. (2017); Simonyan et al. (2013) discovered visual concepts aligned to each unit by optimizing a random initial image to maximize the unit activation by gradient descent. In these cases, the resulting interpretation of each unit is in the form of optimized images, and not in the natural language form as the aforementioned ones. However, these continuous interpretation results make it hard for further quantitative analyses of discrete properties of representations, such as quantifying characteristics of representations with layer depth (Bau et al., 2017) and correlations between the interpretability of a unit and regularization (Zhou et al., 2018a). Nevertheless, these methods have the advantage that the results are not constrained to a predefined set of concepts, giving flexibility as to which concepts are captured by each unit.
In the NLP domain, studies including Karpathy et al. (2015); Tang et al. (2017); Qian et al. (2016); Shi et al. (2016a) analyzed the internal mechanisms of deep models used for NLP and found intriguing properties that appear in units of hidden representations. Among those studies, the closest one to ours is Radford et al. (2017), who defined a unit as each element in the representation of an LSTM learned for language modeling and found that the concept of sentiment was aligned to a particular unit. Compared with these previous studies, we focus on discovering a much wider variety of natural language concepts, including any morphemes, words, and phrases all found in the training data. To the best our knowledge, this is the first attempt to discover concepts among all that exist in the form of natural language from the training corpus. By extending the scope of detected concepts to meaningful building blocks of natural language, we provide insights into how various linguistic features are encoded by the hidden units of deep representations.
2.2 Analysis of Deep Representations Learned for NLP Tasks
Most previous work that analyzes the learned representation of NLP tasks focused on constructing downstream tasks that predict concepts of interest. A common approach is to measure the performance of a classification model that predicts the concept of interest to see whether those concepts are encoded in representation of a input sentence. For example, Conneau et al. (2018); Adi et al. (2017); Zhu et al. (2018)
proposed several probing tasks to test whether the (non-)linear regression model can predict well the syntactic or semantic information from the representation learned on translation tasks or the skip-thought or word embedding vectors.Shi et al. (2016b); Belinkov et al. (2017) constructed regression tasks that predict labels such as voice, tense, part-of-speech tag, and morpheme from the encoder representation of the learned model in translation task.
Compared with previous work, our contributions can be summarized as follows. (1) By identifying the role of the individual units, rather than analyzing the representation as a whole, we provide more fine-grained understanding of how the representations encode informative factors in training data. (2) Rather than limiting the linguistic features within the representation to be discovered, we focus on covering concepts of fundamental building blocks of natural language (morphemes, words, and phrases) present in the training data, providing more flexible interpretation results without relying on a predefined set of concepts. (3) Our concept alignment method does not need any additional labeled data or re-training process, so it can always provide deterministic interpretation results using only the training data.
We focus on convolutional neural networks (CNNs), particularly their character-level variants. CNNs have shown great success on various natural language applications, including translation and sentence classification(Kalchbrenner et al., 2016; Kim et al., 2016; Zhang et al., 2015; Conneau et al., 2017). Compared to deep architectures based on fully connected layers, CNNs are natural candidates for unit-level analysis because their channel-level representations are reported to work as templates for detecting concepts (Bau et al., 2017).
Our approach for aligning natural language concepts to units is summarized as follows. We first train a CNN model for each natural language task (e.g. translation and classification) and retrieve training sentences that highly activate specific units. Interestingly, we discover morphemes, words and phrases that appear dominantly within these retrieved sentences, implying that those concepts have a significant impact on the activation value of the unit. Then, we find a set of concepts which attribute a lot to the unit activation by measuring activation value of each replicated candidate concept, and align them to unit.
3.1 Top Activated Sentences Per Unit
Once we train a CNN model for a given task, we feed again all sentences in the training set to the CNN model and record their activations. Given a layer and sentence , let denote the activation of unit at spatial location . Then, for unit , we average activations over all spatial locations as , where is a normalizer. We then retrieve top training sentences per unit with the highest mean activation . Interestingly, some natural language patterns such as morphemes, words, phrases frequently appear in the retrieved sentences (see Figure 1), implying that those concepts might have a large attribution to the activation value of that unit.
3.2 Concept Alignment with Replicated Text
We propose a simple approach for identifying the concepts as follows. For constructing candidate concepts, we parse each of top sentences with a constituency parser (Kitaev & Klein, 2018). Within the constituency-based parse tree, we define candidate concepts as all terminal and non-terminal nodes (e.g. from sentence John hit the balls, we obtain candidate concepts as John, hit, the, balls, the balls, hit the balls, John hit the balls). We also break each word into morphemes using a morphological analysis tool (Virpioja et al., 2013) and add them to candidate concepts (e.g. from word balls, we obtain morphemes ball, s). We repeat this process for every top sentence and build a set of candidate concepts for unit , which is denoted as , where is the number of candidate concepts of the unit.
Next, we measure how each candidate concept contributes to the unit’s activation value. For normalizing the degree of an input signal to the unit activation, we create a synthetic sentence by replicating each candidate concept so that its length is identical to the average length of all training sentences (e.g. candidate concept the ball is replicated as the ball the ball the ball…). Replicated sentences are denoted as , and each is forwarded to CNN, and their activation value of unit is measured as , which is averaged over entries. Finally, the degree of alignment (DoA) between a candidate concept and a unit is defined as follows:
In short, the DoA111We try other metrics for DoA, but all of them induce intrinsic bias. See Appendix A for details. measures the extent to unit ’s activation is sensitive to the presence of candidate concept . If a candidate concept appears in the top sentences and unit’s activation value is responsive to a lot, then gets large, suggesting that candidate concept is strongly aligned to unit .
Finally, for each unit , we define a set of its aligned concepts as candidate concepts with the largest DoA values in . Depending on how we set , we can detect different numbers of concepts per unit. In this experiment, we set to 3.
|Dataset||Task||Model||# of Layers||# of Units|
|AG News||Ontology Classification||VDCNN||4||[64, 128, 256, 512]|
|DBpedia||Topic Classification||VDCNN||4||[64, 128, 256, 512]|
|Yelp Review||Polarity Classification||VDCNN||4||[64, 128, 256, 512]|
|WMT17’ EN-DE||Translation||ByteNet||15|| for all|
|WMT14’ EN-FR||Translation||ByteNet||15|| for all|
|WMT14’ EN-CS||Translation||ByteNet||15|| for all|
|EN-DE Europarl-v7||Translation||ByteNet||15|| for all|
4.1 The Model and The Task
We analyze representations learned on three classification and four translation datasets shown in Table 1. Training details for each dataset are available in Appendix B. We then focus on the representations in each encoder layer of ByteNet and convolutional layer of VDCNN, because as Mou et al. (2016) pointed out, the representation of the decoder (the output layer in the case of classification) is specialized for predicting the output of the target task rather than for learning the semantics of the input text.
4.2 Evaluation of concept alignment
To quantitatively evaluate how well our approach aligns concepts, we measure how selectively each unit responds to the aligned concept. Motivated by Morcos et al. (2018), we define the concept selectivity of a unit , to a set of concepts that our alignment method detects, as follows:
where denotes all sentences in training set, and is the average value of unit activation when forwarding a set of sentences , which is defined as one of the following:
replicate: contains the sentences created by replicating each concept in . As before, the sentence length is set as the average length of all training sentences for fair comparison.
one instance: contains just one instance of each concept in . Thus, the input sentence length is shorter than those of others in general.
inclusion: contains the training sentences that include at least one concept in .
random: contains randomly sampled sentences from the training data.
In contrast, is the average value of unit activation when forwarding , which consists of training sentences that do not include any concept in . Intuitively, if unit ’s activation is highly sensitive to (i.e. those found by our alignment method) and if it is not to other factors, then gets large; otherwise, is near 0.
Figure 2 shows the mean and variance of selectivity values for all units learned in each dataset for the four categories. Consistent with our intuition, in all datasets, the mean selectivity of the replicate set is the highest with a significant margin, that of one instance, inclusion set is the runner-up, and that of the random set is the lowest. These results support our claims that units are selectively responsive to specific concepts and our method is successful to align such concepts to units. Moreover, the mean selectivity of the replicate set is higher than that of the one instance set, which implies that a unit’s activation increases as its concepts appear more often in the input text.
4.3 Concept Alignment of Units
Figure 3 shows examples of the top sentences and the aligned concepts that are discovered by our method, for selected units. For each unit, we find the top sentences that activate the most in several encoding layers of ByteNet and VDCNN, and select some of them (only up to five sentences are shown due to space constraints). We observe that some patterns appear frequently within the top sentences. For example, in the top sentences that activate unit 124 of 0 layer of ByteNet, the concepts of ‘(’, ‘)’, ‘-’ appear in common, while the concepts of soft, software, wi appear frequently in the sentences for unit 19 of 1 layer of VDCNN. These results qualitatively show that individual units are selectively responsive to specific natural language concepts.
More interestingly, we discover that many units could capture specific meanings or syntactic roles beyond superficial, low-level patterns. For example, unit 690 of the 14 layer in ByteNet captures (what, who, where) concepts, all of which play a similar grammatical role. On the other hand, unit 224 of the 14 layer in ByteNet and unit 53 of the 0 layer in VDCNN each captures semantically similar concepts, with the ByteNet unit detecting the meaning of certainty in knowledge (sure, know, aware) and the VDCNN unit detecting years (1999, 1969, 1992). This suggests that, although we train character-level CNNs with feeding sentences as the form of discrete symbols (i.e. character indices), individual units could capture natural language concepts sharing a similar semantic or grammatical role. More quantitative analyses for such concepts are available in Appendix E.
We note that there are units that detect concepts more abstract than just morphemes, words, or phrases, and for these units, our method tends to align relevant lower-level concepts. For example, in unit 244 of the 3 layer in VDCNN, while each aligned concept emerges only once in the top sentences, all top sentences have similar nuances like positive sentiments. In this case, our method does capture relevant phrase-level concepts (e.g., very disappointing, absolute worst place), indicating that the higher-level nuance (e.g., negativity) is indirectly captured.
We note that, because the number of morphemes, words, and phrases present in training corpus is usually much greater than the number of units per layer, we do not expect to always align any natural language concepts in the corpus to one of the units. Our approach thus tends to find concepts that are frequent in training data or considered as more important than others for solving the target task.
Overall, these results suggest how input sentences are represented in the hidden layers of the CNN:
Several units in the CNN learned on NLP tasks respond selectively to specific natural language concepts, rather than getting activated in an uninterpretable way. This means that these units can serve as detectors for specific natural language concepts.
There are units capturing syntactically or semantically related concepts, suggesting that they model the meaning or grammatical role shared between those concepts, as opposed to superficially modeling each natural language symbol.
4.4 Concept Distribution in Layers
Using the concept alignments found earlier, we can visualize how concepts are distributed across layers. Figure 4 shows the concepts of the units in the 0, 1, 3 layer of VDCNN learned on AG-News dataset, and 0, 4, and 14 layer of the ByteNet encoder learned on English-to-German Europarl dataset with their number of aligned units. For each layer, we sort concepts in decreasing order by the number of aligned units and show 30 concepts most aligned. Recall that, since we align concepts for each unit, there are concepts aligned to multiple units simultaneously. Concept distribution for other datasets are available in Appendix G.
Overall, we find that data and task-specific concepts are likely to be aligned to many units. In AG News, since the task is to classify given sentences into following categories;World, Sports, Business and Science/Tech, concepts related to these topics commonly emerge. Similarly, we can see that units learned for Europarl dataset focus to encode some key words (e.g. vote, propose, environment) in the training corpus.
4.5 How does Concept Granularity Evolve with Layer?
In computer vision tasks, visual concepts captured by units in CNN representations learned for image recognition tasks evolve with layer depths; color, texture concepts are emergent in earlier layers and more abstract concepts like parts and objects are emergent in deeper layers. To confirm that it also holds for representations learned in NLP tasks, we divide granularity of natural language concepts to the morpheme, word and -gram phrase (), and observe the number of units that they are aligned in different layers.
Figure 5 shows this trend, where in lower layers such as the 0 layer, fewer phrase concepts but more morphemes and words are detected. This is because we use a character-level CNN, whose receptive fields of convolution may not be large enough to detect lengthy phrases. Further, interestingly in translation cases, we observe that concepts significantly change in shallower layers (e.g. from the 0 to the 4), but do not change much from middle to deeper layers (e.g. from the 5 to the 14).
Thus, it remains for us to answer the following question: for the representations learned on translation datasets, why does concept granularity not evolve much in deeper layers? One possibility is that the capacity of the network is large enough so that the representations in the middle layers could be sufficiently informative to solve the task. To validate this hypothesis, we re-train ByteNet from scratch while varying only layer depth of the encoder and fixing other conditions. We record their BLEU scores on the validation data as shown in Figure 7. The performance of the translation model does not change much with more than six encoder layers, but it significantly drops at the models with fewer than 4 encoder layers. This trend coincides with the result from Figure 5 that the evolution of concept granularity stops around middle-to-higher layers. This shared pattern suggests that about six encoder layers are enough to encode informative factors in the given datasets to perform optimally on the translation task. In deeper models, this may suggest that the middle layer’s representation may be already informative enough to encode the input text, and our result may partly coincide with that of Mou et al. (2016), which shows that representation of intermediate layers is more transferable than that of deeper layers in language tasks, unlike in computer vision where deeper layers are usually more useful and discriminative.
4.6 What Makes Certain Concepts Emerge More than Others?
We show how many units each concept is aligned per layer in Section 4.4 and Appendix G. We observe that the concepts do not appear uniformly; some concepts are aligned to many units, while others are aligned to few or even no units. Then, the following question arises: What makes certain concepts emerge more than others?
Two possible hypotheses may explain the emergence of dominant concepts. First, the concepts with a higher frequency in training data may be aligned to more units. Figure 7-(a) shows the correlation between the frequency of each concept in the training corpus and the number of units where each concept is aligned in the last layer of the topic classification model learned on AG News dataset.
Second, the concepts that have more influence on the objective function (expected loss) may be aligned to more units. We can measure the effect of concept on the task performance as Delta of Expected Loss (DEL) as follows:
where is a set of training sentences, and is the set of ground-truths, and
is the loss function for the input sentenceand label . is an occlusion of concept in sentence , where we replace concept by dummy character tokens that have no meaning. If sentence does not include concept , equals to original sentence . As a result, measures the impact of concept on the loss function, where a large positive value implies that concept has an important role for solving the target task. Figure 7-(b) shows the correlation between the DEL and the number of units per concept. The Pearson correlation coefficients for the hypothesis (a) and (b) are 0.732 / 0.492, respectively. Such high values implicate that the representations are learned for identifying the frequent concepts in the training data and important concepts for solving the target task.
We proposed a simple but highly effective concept alignment method for character-level CNNs to confirm that each unit of the hidden layers serves as detectors of natural language concepts. Using this method, we analyzed the characteristics of units with multiple datasets on classification and translation tasks. Consequently, we shed light on how deep representations capture the natural language, and how they vary with various conditions.
An interesting future direction is to extend the concept coverage from natural language to more abstract forms such as sentence structure, nuance, and tone. Another direction is to quantify the properties of individual units in other models widely used in NLP tasks. In particular, combining our definition of concepts with the attention mechanism (e.g. Bahdanau et al. (2015)) could be a promising direction, because it can reveal how the representations are attended by the model to capture concepts, helping us better understand the decision-making process of popular deep models.
We appreciate Insu Jeon, Jaemin Cho, Sewon Min, Yunseok Jang and the anonymous reviewers for their helpful comments and discussions. This work was supported by Kakao and Kakao Brain corporations, IITP grant funded by the Korea government (MSIT) (No. 2017-0-01772) and Creative-Pioneering Researchers Program through Seoul National University. Gunhee Kim is the corresponding author.
- Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems, 2015.
- Adi et al. (2017) Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer Lavi, and Yoav Goldberg. Fine-grained Analysis of Sentence Embeddings Using Auxiliary Prediction Tasks. ICLR, 2017.
- Bahdanau et al. (2015) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR, 2015.
- Bau et al. (2017) David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network Dissection: Quantifying Interpretability of Deep Visual Representations. In CVPR, 2017.
- Bau et al. (2019) David Bau, Jun-Yan Zhu, Hendrik Strobelt, Bolei Zhou, Joshua B. Tenenbaum, William T. Freeman, and Antonio Torralba. Visualizing and Understanding Generative Adversarial Networks. In ICLR, 2019.
- Belinkov et al. (2017) Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Hassan Sajjad, and James Glass. What do Neural Machine Translation Models Learn about Morphology? In ACL, 2017.
- Bojanowski et al. (2017) Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. Enriching Word Vectors with Subword Information. TACL, 5:135–146, 2017. ISSN 2307-387X.
- Cho et al. (2014) Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In EMNLP, 2014.
- Conneau et al. (2017) Alexis Conneau, Holger Schwenk, Loïc Barrault, and Yann Lecun. Very Deep Convolutional Networks for Text Classification. In EACL, 2017.
- Conneau et al. (2018) Alexis Conneau, Germán Kruszewski, Guillaume Lample, Loïc Barrault, and Marco Baroni. What You Can Cram into a Single Vector: Probing Sentence Embeddings for Linguistic Properties. In ACL, 2018.
- Erhan et al. (2009) Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing Higher-layer Features of a Deep Network. University of Montreal, 2009.
- Fong & Vedaldi (2018) Ruth Fong and Andrea Vedaldi. Net2Vec: Quantifying and Explaining how Concepts are Encoded by Filters in Deep Neural Networks. In CVPR, 2018.
- Kalchbrenner et al. (2016) Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural Machine Translation in Linear Time. arXiv preprint arXiv:1610.10099, 2016.
- Karpathy et al. (2015) Andrej Karpathy, Justin Johnson, and Li Fei-Fei. Visualizing and Understanding Recurrent Networks. arXiv preprint arXiv:1506.02078, 2015.
- Kim et al. (2016) Yoon Kim, Yacine Jernite, David Sontag, and Alexander M Rush. Character-Aware Neural Language Models. In AAAI, 2016.
- Kingma & Ba (2015) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
- Kitaev & Klein (2018) Nikita Kitaev and Dan Klein. Constituency Parsing with a Self-Attentive Encoder. ACL, 2018.
- Koehn (2005) Philipp Koehn. Europarl: A Parallel Corpus for Statistical Machine Translation. In MT summit, volume 5, pp. 79–86, 2005.
- Lehmann et al. (2015) Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, Sören Auer, et al. DBpedia–a Large-Scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic Web, 6(2):167–195, 2015.
- Morcos et al. (2018) Ari S. Morcos, David G.T. Barrett, Neil C. Rabinowitz, and Matthew Botvinick. On the Importance of Single Directions for Generalization. In ICLR, 2018.
- Mou et al. (2016) Lili Mou, Zhao Meng, Rui Yan, Ge Li, Yan Xu, Lu Zhang, and Zhi Jin. How Transferable are Neural Networks in NLP Applications? In EMNLP, 2016.
- Müllner (2011) Daniel Müllner. Modern Hierarchical, Agglomerative Clustering Algorithms. arXiv preprint arXiv:1109.2378, 2011.
- Olah et al. (2017) Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature Visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization.
- Olah et al. (2018) Chris Olah, Arvind Satyanarayan, Ian Johnson, Shan Carter, Ludwig Schubert, Katherine Ye, and Alexander Mordvintsev. The Building Blocks of Interpretability. Distill, 2018. doi: 10.23915/distill.00010. https://distill.pub/2018/building-blocks.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global Vectors for Word Representation. In EMNLP, 2014.
- Qian et al. (2016) Peng Qian, Xipeng Qiu, and Xuanjing Huang. Analyzing Linguistic Knowledge in Sequential Model of Sentence. In EMNLP, 2016.
- Radford et al. (2017) Alec Radford, Rafal Jozefowicz, and Ilya Sutskever. Learning to Generate Reviews and Discovering Sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Role & Nadif (2011) François Role and Mohamed Nadif. Handling the Impact of Low Frequency Events on Co-occurrence based Measures of Word Similarity. In KDIR, 2011.
- Shi et al. (2016a) Xing Shi, Kevin Knight, and Deniz Yuret. Why Neural Translations are the Right Length. In EMNLP, 2016a.
- Shi et al. (2016b) Xing Shi, Inkit Padhi, and Kevin Knight. Does String-Based Neural MT Learn Source Syntax? In EMNLP, 2016b.
- Simonyan et al. (2013) Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Deep inside Convolutional Networks: Visualising Image Classification Models and Saliency maps. arXiv preprint arXiv:1312.6034, 2013.
- Speer et al. (2017) Robert Speer, Joshua Chin, and Catherine Havasi. ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. In AAAI, 2017.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to Sequence Learning with Neural Networks. In NIPS, 2014.
Tang et al. (2017)
Zhiyuan Tang, Ying Shi, Dong Wang, Yang Feng, and Shiyue Zhang.
Memory Visualization for Gated Recurrent Neural Networks in Speech Recognition.In ICASSP, 2017.
- Tiedemann (2012) Jörg Tiedemann. Parallel Data, Tools and Interfaces in OPUS. In LREC. ELRA, 2012. ISBN 978-2-9517408-7-7.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Llion Jones, Jakob Uszkoreit, Aidan N Gomez, and Ł ukasz Kaiser. Attention is All You Need. In NIPS, 2017.
- Virpioja et al. (2013) Sami Virpioja, Peter Smit, Stig-Arne Gronroos, and Mikko Kurimo. Morfessor 2.0: Python Implementation and Extensions for Morfessor Baseline. In Aalto University publication series. Department of Signal Processing and Acoustics, Aalto University, 2013.
- Zhang et al. (2015) Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level Convolutional Networks for Text Classification. In NIPS, 2015.
- Zhou et al. (2015) Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Object Detectors Emerge in Deep Scene CNNs. In ICLR, 2015.
- Zhou et al. (2018a) Bolei Zhou, David Bau, Aude Oliva, and Antonio Torralba. Interpreting Deep Visual Representations via Network Dissection. IEEE TPAMI, 2018a.
- Zhou et al. (2018b) Bolei Zhou, Yiyou Sun, David Bau, and Antonio Torralba. Interpretable Basis Decomposition for Visual Explanation. In ECCV, 2018b.
- Zhu et al. (2018) Xunjie Zhu, Tingfeng Li, and Gerard Melo. Exploring Semantic Properties of Sentence Embeddings. In ACL, 2018.
Appendix A Other Metrics for DoA with Biased Alignment Result
In section 3.2, we define Degree of Alignment (DoA) between concept and unit as activation value of unit for replication of . We tried lots of stuff while we were working on DoA metrics, but a lot of it gives biased concept alignment result for several reasons. We here provide the things we tried and their reasons for failure.
a.1 Point-wise Mutual Information
Point-wise Mutual Information (PMI) is a measure of association used in information theory and statistics. The PMI of a pair of samples and
sampled from random variablesand
quantifies the discrepancy between the probability of their coincidence as follows:
We then define DoA between candidate concept and unit by using PMI as follow:
However, this metric has a bias of always preferring lengthy concepts even in earlier layers, which is not possible considering the receptive field of the convolution. Our intuition for this bias is consistent with Role & Nadif (2011), where it is a well-known problem with PMI, which is its tendency to give very high association scores to pairs involving low-frequency ones, as the denominator is small in such cases. If certain concept in top sentences is very lengthy, then its frequency in the corpus would get very small, and pmi(, ) would be large with regardless of correlation between and .
a.2 Concept Occlusion
We tested concept alignments with the following concept occlusion method. For each of the top sentences, we replace a by dummy character tokens which have no meaning, forward it to the model, and measure the reduction of the unit activation value. We repeat this for every candidate concept in the sentences – as a result, we can identify which candidate concept greatly reduce unit activation values. We thus define concepts aligned to each unit as the candidate concept that consistently lower the unit activation across the top sentences.
More formally, for each unit , let be top activated sentences. Since we occlude each candidate concept in sentences, we define the set of candidate concept , obtained from parsing each sentence in .
We define the degree of alignment (DoA) between a concept and a unit as:
where is a normalizing factor, and indicates the mean activation of unit , is a sentence where candidate concept is occluded, and is an indicator of whether is included in the sentence . In short, the DoA measures how much a candidate concept contributes to the activation of the unit’s top sentences. If a candidate concept appears in the top sentences and greatly reduces the activation of unit , then gets large, implying that the is strongly aligned to unit .
Unfortunately, this metric could not fairly compare the attribution of several candidate concepts. For example, consider the following two concepts hit, hit the ball are included in one sentence. Occluding might gives relatively large decrement in unit activation value than that of , since includes . For this reason, the occlusion based metric is unnecessarily dependant of the length of concept, rather than it’s attribution.
a.3 Inclusion Selectivity
Note that inclusion selectivity in section 4.2 is also used as DoA. Recall that inclusion selectivity is calculated as equation 2. In this case, is the average value of unit activation when forwarding a set of sentences , where denotes that sentences including candidate concept .
However, it induces a bias which is similar to section A.1. It always prefers lengthy phrases since those lengthy concepts occur few times in entire corpus. For example, assume that the activation value of unit for the sentence including specific lengthy phase is very high. If such a phrase occurs only one time over the entire corpus, is equal to the activation value of the sentence, which is relatively very high than
for other candidate concepts. This error could be alleviated on a very large corpus where every candidate concept occurs enough in the corpus so that estimation ofget relatively accurate, which is practically not possible.
a.4 Computing DOA Values without Replication
In Section 3.2, we replicate each candidate concept into the input sentence for computing in Eq.(1). Since each unit works as a concept detector whose activation value increases with the length of the input sentence (Section 4.2), it is essential to normalize the length of input for fair comparison of DoA values between the concepts that have different lengths one another. Without the length-normalization (i.e. each input sentence consists of just one instance of the candidate concept), the DoA metric has a bias to prefer lengthy concepts (e.g. phrases) because they typically have more signals that affect the unit activation than short candidate concepts (e.g. single words).
Appendix B Training Details
In this work, we trained a ByteNet for the translation tasks and a VDCNN for the classification tasks, both to analyze properties of representations for language. Training details are as follows.
We trained a ByteNet on the translation tasks, in particular on the WMT’17 English-to-German Europarl dataset, the English-to-German news dataset, WMT’16 English-to-French, English-to-Czech news dataset. We used the same model architecture and hyperparameters for both datasets. We set the batch size to 8 and the learning rate to 0.001. The parameters were optimized with Adam(Kingma & Ba, 2015)
for 5 epochs, and early stopping was actively used for finding parameters that generalize well. Our code is based on a TensorFlow(Abadi et al., 2015) implementation of ByteNet found in https://github.com/paarthneekhara/byteNet-tensorflow.
b.2 Very Deep CNN (VDCNN)
We trained a VDCNN for classification tasks, in particular on the AG News dataset, the binarized version of the Yelp Reviews dataset, and DBpedia ontology dataset. For each task, we used 1 temporal convolutional layer, 4 convolutional blocks with each convolutional layer having a filter width of 3. In our experiments, we analyze representations of each convolutional block layer. The number of units in each layer representation is 64, 128, 256, 512 respectively. We set the batch size to 64 and the learning rate to 0.01. The parameters are optimized using SGD optimizer for 50 epochs, and early stopping is actively used. For each of the AG News, Yelp Reviews and DBpedia datasets, a VDCNN was learned with the same structure and hyperparameters. Our code is based on a TensorFlow implementation of VDCNN found inhttps://github.com/zonetrooper32/VDCNN.
Appendix C Variants of Alignment with Different M values
In Section 3.2, we set . Although is used as a threshold to set how many concepts per unit are considered, different values have little influence on quantitative results such as selectivity in Section 4.2. Figure 8 shows the mean and variance of selectivity values with different , where there is little variants in the overall trend; the sensitivity of the replicate set is the highest, and that of one instance is runner-up, and that of random is the lowest.
Appendix D Non-interpretable Units
Whereas some units are sensitive to specific natural language concepts as shown in Section 4.3, other unites are not sensitive to any concepts at all. We call such units as non-interpretable units, which deserve to be explored. We first define the unit interpretability for unit as follows:
where is the set of training sentences, is the activation value of unit , and is the activation value of the sentence that is made up of replicating concept . We define unit as interpretable when its equals to 1, and otherwise as non-interpretable. The intuition is that if a replicated sentence that is composed of only one concept has a less activation value than the top-activated sentences, the unit is not sensitive to the concept compared to a sequence of different words.
Figure 9 shows the ratio of the interpretable units in each layer on several datasets. We observe that more than 90% of units are interpretable across all layers and all datasets.
Figure 10 illustrates some examples of non-interpretable units with their top five activated sentences and their concepts. Unlike Figure 3, the aligned concepts do not appear frequently over top-activated sentences. This result is obvious given that the concepts have little influence on unit activation. There are several reasons why non-interpretable units appear. One possibility is that several units align concepts that are out of natural language form. For example, in unit 001 in the left of Figure 10, we discover that sentence structure involves many commas in top activated sentences. Since we limit the candidate concepts to only the form of morpheme, word and phrase, such punctuation concepts are hard to be detected. Another possibility is that some units may be so-called dead units that are not sensitive to any concept at all. For example, unit 260 in the right of Figure 10 has no pattern that appears consistently in top activated sentences.
Appendix E Concept clusters
We introduce some units whose concepts have the shared meaning in Section 4.3. We here refer concept cluster to the concepts that are aligned to the same unit and have similar semantics or grammatical roles. We analyze how clusters are formed in the units and how they vary with the target task and layer depth.
e.1 Concept Clusters by Target Tasks
illustrates some concept clusters of units in the final layer learned on each task. Top and left dendrograms of each figure show hierarchical clustering results of 30 concepts aligned with the largest number of units. We use clustering algorithm ofMüllner (2011); we define the distance between two concepts as the Euclidean distance of their vector space embedding. We use fastText (Bojanowski et al., 2017) pretrained on Wikipedia dataset to project each concept into the vector space. Since fastText is a character-level -gram based word embedding, we can universally obtain the embedding for morphemes as well as words or phrases. For phrase embedding, we split it to words, project each of them and average their embeddings. The distance between two clusters is defined as the distance between their centroids.
Each central heat map represents the number of times each concept pair is aligned to the same unit. Since the concepts in the x, y-axes are ordered by the clustering result, if the diagonal blocks (concept clusters) emerge more strongly, the concepts in the same unit are more likely to have the similar meanings.
In Figure 11, the units learned in the classification tasks tend to have stronger concept clusters than those learned in the translation tasks. Particularly, the concept clusters are highly evident in the units learned in DBpedia and AG News dataset. Our intuition is that units might have more benefits to solve the task by clustering similar concepts in the classification than the translation. That is, in the classification, input sentences that have the similar concepts tend to belong to the same class label, while in the translation, different concepts should be translated to different words or phrases even if they have similar meanings in general.
e.2 Concept Clusters by Layers
We analyze how concept clusters change by layer in each task. We compute the averaged pairwise distance between the concepts in each layer. We project each concept to the vector space using the three pretrained embeddings: (1) Glove (Pennington et al., 2014), (2) ConceptNet (Speer et al., 2017), (3) fastText. Glove and fastText embeddings are pretrained on Wikipedia dataset, and ConcpetNet is pretrained based on the ConceptNet graph structure.
Figure 12 shows the averaged pairwise distances in each layer. In all tasks, there is a tendency that the concepts in the same unit become closer in the vector space as the layer goes deeper. It indicates that individual units in earlier layers tend to capture more basic text patterns or symbols, while units in deeper layers capture more abstract semantics.
Appendix F What Makes Certain Concepts Emerge More than Others?: Other Datasets
We investigate why certain concepts emerge more than others at Section 4.6 when the ByteNet is trained on English-to-French news dataset. Here, Figure 14 shows more results in other datasets. Consistent with our intuition, in all datasets, both the document frequency and the delta of expected loss are closely related to the number of units per concept. It concludes that the representations are learned for identifying not only the frequent concepts in the training set and but also the important concepts for solving the target task.
Appendix G Concept Distribution in Layers for Other Datasets
In section 4.4, we visualized how concepts are distributed across layers, where the model is trained on AG News dataset and English-to-German Europarl dataset. Here, Figure 15 shows concept distribution in other datasets noted in Table 1.
In the classification tasks, we expect to find more concepts that are directly related to predicting the output label, as opposed to the translation tasks where the representations may have to include information on most of the words for an accurate translation. While our goal is not to relate each concept to one of the labels, we find several concepts that are more predictive to a particular label than others.
Consistent with section 4.4, there are data-specific and task-specific concepts aligned in each layer; i.e. worst, 2 stars, awful at Yelp Review, film, ship, school at DBpedia, and some key words at translation datasets. Note that Yelp Review and DBpedia is a classification dataset, where the model is required to predict the polarity (i.e. +1 or -1) or ontology (i.e. Company, Educational Institution, Artist, Athlete, Officeholder, Mean of Transportation, Building, Natural Place, Village, Animal, Plant, Album, Film, Written Work) for given sentence in supervised setting.
Appendix H Multiple Occurrences of Each Concept at Different Layers
Figure 16 shows the number of occurrences of each concept at different layers. We count how many times each concept appears across all layers and sort them in decreasing order. We select two concepts in the translation model and seven concepts in the classification model, as to their number of occurrences. For example, since there are 15 encoder layers in the ByteNet translation model, we select 30 concepts in total. Although task and data specific concepts emerge at different layers, there is no strong pattern between the concepts and their occurrences at multiple layers.