Read Beyond the Lines: Understanding the Implied Textual Meaning via a Skim and Intensive Reading Model

The nonliteral interpretation of a text is hard to be understood by machine models due to its high context-sensitivity and heavy usage of figurative language. In this study, inspired by human reading comprehension, we propose a novel, simple, and effective deep neural framework, called Skim and Intensive Reading Model (SIRM), for figuring out implied textual meaning. The proposed SIRM consists of two main components, namely the skim reading component and intensive reading component. N-gram features are quickly extracted from the skim reading component, which is a combination of several convolutional neural networks, as skim (entire) information. An intensive reading component enables a hierarchical investigation for both local (sentence) and global (paragraph) representation, which encapsulates the current embedding and the contextual information with a dense connection. More specifically, the contextual information includes the near-neighbor information and the skim information mentioned above. Finally, besides the normal training loss function, we employ an adversarial loss function as a penalty over the skim reading component to eliminate noisy information arisen from special figurative words in the training data. To verify the effectiveness, robustness, and efficiency of the proposed architecture, we conduct extensive comparative experiments on several sarcasm benchmarks and an industrial spam dataset with metaphors. Experimental results indicate that (1) the proposed model, which benefits from context modeling and consideration of figurative language, outperforms existing state-of-the-art solutions, with comparable parameter scale and training speed; (2) the SIRM yields superior robustness in terms of parameter size sensitivity; (3) compared with ablation and addition variants of the SIRM, the final framework is efficient enough.


page 1

page 2

page 3

page 4


Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning

Understanding narratives requires reading between the lines, which in tu...

ZJUKLAB at SemEval-2021 Task 4: Negative Augmentation with Language Model for Reading Comprehension of Abstract Meaning

This paper presents our systems for the three Subtasks of SemEval Task4:...

Retrospective Reader for Machine Reading Comprehension

Machine reading comprehension (MRC) is an AI challenge that requires mac...

Commonsense Evidence Generation and Injection in Reading Comprehension

Human tackle reading comprehension not only based on the given context i...

Simple and Effective Curriculum Pointer-Generator Networks for Reading Comprehension over Long Narratives

This paper tackles the problem of reading comprehension over long narrat...

Towards Pose-invariant Lip-Reading

Lip-reading models have been significantly improved recently thanks to p...

Textual Paralanguage and its Implications for Marketing Communications

Both face-to-face communication and communication in online environments...

1 Introduction

Language does not always express its literal meaning, e.g., sarcasm and metaphor. People often use words that deviate from their conventionally accepted definitions in order to convey complicated and implied meanings Tay et al. (2018). A typical example is shown in Figure 1.

Figure 1: This is a typical case, which is a sentence expresses non-literal meaning.

Compared with standard (literal) text usage, the non-literal text can be associated with two typical linguistic phenomena:

  • From syntax and semantic viewpoint, the non-literal text is highly context-sensitive. People can perceive the implied meaning of text through unnatural language usage in context. For instance, as Figure 1 shows, ‘rope’ has two different meanings: literal meaning for ‘bungee jumping’ and implied meaning (‘umbilical cord’) for ‘came into this world’. People can tell the difference after digesting the whole sentence.

  • From a lexicon viewpoint, the non-literal text is often created by presenting words that are equated, compared, or associated with

    normally unrelated or figurative meanings. These words express different or even opposite meanings, which could change the word distribution under a semantic topic or sentiment polarity and then hinder the training of machine models. In addition, some of these words frequently appearing in the training set will mislead the machine model in the inference process. For instance, no matter how many times ‘rope’ refers to ‘umbilical cord’ in the training set, we can not assert that it conveys the same meaning in a new coming text.

Existing text representation studies, which mainly rely on content embeddings Mikolov et al. (2013b) based deep neural networks LeCun et al. (2015)

such as Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and Attention Mechanisms, are not totally suitable for aforementioned problems. RNNs, such as Long Short Term Memory (LSTM)

Hochreiter and Schmidhuber (1997)

, Gated Recurrent Unit (GRU)

Cho et al. (2014), and Simple Recurrent Unit (SRU) Lei and Artzi (2018) draw on the idea of the language model Bengio et al. (2003). However, RNNs, including bidirectional ones, could neglect the long-term dependency, as demonstrated in Wang et al. (2016); Li et al. (2018); Shen et al. (2019), since the current term directly depends on the previous term as opposed to the entire information. Although attention mechanisms Yang et al. (2016) over RNNs provide an important potential to aggregate all hidden states, they focus more on the local part of a text. CNNs Kalchbrenner et al. (2014) can characterize local spatial features and then assemble these features by deeper layers which are expert in extracting phrase-level features. Self-attention mechanism Vaswani et al. (2017) characterizes the dependency on one term with others in the input sequence and then encodes the mutual relationship to capture the contextual information. Unfortunately, all standard text representation models have not effectively utilized the contextual representation as input directly when encoding the current term, which is necessary to understand the implied meaning. There are also several tailored models for sarcasm detection Tay et al. (2018), which concentrate more on word incongruity in the text.

1.1 Research Objectives

Hence, the study aims to cope with the following drawbacks which can be summarized as follows:

  • Existing text representation models don’t specifically design a mechanism to effectively use context/global information when understanding the implied meaning of the input text.

  • Meanwhile, all existing models neglect the potential bad effect of the figurative words which can be frequently appearing in the training set of implied texts.

  • Existing methods do not take into account both model complexity and model performance at the time of design, which can be very important in practical applications.

To this end, we try to design a simple and effective model to interpret the implied meaning by overcoming the challenges mentioned above. From a human reading comprehension viewpoint, to understand a difficult text, a human may firstly skim it quickly to estimate the entire information of the target text. Then, in order to consume the content, he/she can read the text word by word and sentence by sentence with respect to the entire information. Inspired by the above procedure, in this study, we propose a novel deep neural network, namely Skim and Intensive Reading Model (SIRM), to address the implied textual meaning identifying problem. Furthermore, a human can skip some noisy or unimportant figurative phrases seen before to quickly consume the main idea of the target text. An adversarial loss is used in SIRM to achieve the same effect. To the best of our knowledge, SIRM is the first model trying to simulate such human reading procedure for understanding and identifying the non-literal text.

Furthermore, taking efficiency into consideration, we design the details of the proposed SIRM under the Occam’s Razor: ‘More things should not be used than are necessary’. In other words, under the premise of optimal task performance, we will remove unnecessary components and use the simplest architecture.

1.2 Contributions

Briefly, given the above research objectives, our main contributions of this work can be summarized as follows:

  • The challenges of understanding the implied textual meaning is well investigated in this research, which can be summarized as follows: context-sensitive and usage of figurative meaning. To the best of our knowledge, these challenges have not been thoroughly studied.

  • We propose the SIRM to understand implied textual meaning where the intensive reading component, which enables a hierarchical investigation for sentence and paragraph representation, depends on the global information extracted by the skim reading component. The cooperation of the skim reading component and the intensive reading component in the SIRM achieves a positive impact on comprehending nonliteral interpretation by modeling the global contextual information directly.

  • We introduce an adversarial loss as a penalty over the skim reading component to cut down noise due to special figurative words during the training procedure.

  • We conduct extensive comparative experiments to show the effectiveness, robustness, and efficiency of the SIRM. Compared with the existing alternative models, the SIRM achieves superior performance on F1 score and accuracy with a comparable parameter size and training speed. In addition, the SIRM outperforms all other models according to model robustness. And the ablation and addition tests show that the final SIRM is efficient enough.

The remainder of this paper is structured as follows: Some related works are summarized in Section 2 and the details of the proposed SIRM are introduced in Section 3. Section 4 presents the experimental settings which is followed by results and analyses in Section 5. Our concluding remarks in Section 6.

2 Related Work

This work is related to deep neural networks and semantic representation for text understanding.

Recently, a large number of CNNs and RNNs with potential benefits have attracted many researchers’ attention. Existing efforts mainly focus on the application of LSTM Hochreiter and Schmidhuber (1997); Pichotta and Mooney (2016); Palangi et al. (2016), GRU Cho et al. (2014); Chung et al. (2014), SRU Lei and Artzi (2018), and CNNs Kalchbrenner et al. (2014); Auli et al. (2017); Johnson and Zhang (2017) based on word embeddings Mikolov et al. (2013b, a) drawing on the idea of either language model Bengio et al. (2003); Mikolov et al. (2010) or spatial parameter sharing. And all these models have demonstrated impressive results in NLP applications. Many previous works have shown that the performance of deep neural networks can be improved by attention mechanism Bahdanau et al. (2015). In addition, self-attention mechanism with position embedding characterizes the mutual relationship between one and others as dependency to capture the semantic encoding information Vaswani et al. (2017). There are some other works that combine RNN and CNN for text classification Zhou et al. (2015); Wang et al. (2017) or use hierarchical structure for language modeling Lin et al. (2015); Yang et al. (2016). Besides hybrid neural networks, graph based models Noekhah et al. (2020) are widely employed to capture textural semantics.

Recently, sarcasm detection, one of the implied semantic recognitions, is widely studied by linguistic researchers Kunneman et al. (2015); Camp (2012); Campbell and Katz (2012); Ghosh and Veale (2016); Ivanko and Pexman (2003). Fersini et al. (2016)

proves that it is important to consider several valuable expressive forms to capture the sentiment orientation of the messages. And external sentiment analysis resources are benefit for sarcasm detection

Zhang et al. (2019). Furthermore, Tay et al. (2018) realizes a neural network to represent a sentence by comparing word-to-word embeddings which achieves the state-of-the-art performance. More specifically, an intra-attention mechanism allows their model to search for conflict sentiments as well as maintain compositional information.

However, all these approaches mentioned above don’t specifically make good use of the contextual representation as a direct input when understanding the implied meaning and they never worry about the possible noise such as special figurative phrases in the training data.

3 Skim and Intensive Reading Model

In this section, we propose a novel deep neural network inspired by reading comprehension procedure of people, namely Skim and Intensive Reading Model (SIRM), to address the essential issues for understanding texts with implied meanings. The architecture of the model is depicted in Figure 2.

Figure 2: This is the architecture of the proposed SIRM, mainly including a skim reading component, an intensive reading component, and an adversarial loss. Each part of the SIRM is designed under the Occam’s razor.

3.1 Overview

People always consume a difficult text word by word and sentence by sentence with respect to the global information extracted by reading quickly. Besides the input layer and the embedding layer, the SIRM consists of two main parts which are the skim reading component (SRC) associated with an adversarial loss and the intensive reading component (IRC) simulate the procedure of human reading comprehension. And for efficiency concern, the model is designed under the Occam’s razor, which means we use the simplest and minimal component to realize each part of the SIRM. More specifically, the SRC is a set of shallow CNNs to enable global feature extraction, while the IRC is a hierarchical framework to enhance the contextual information, from sentence level to paragraph level. Finally, over the output layer, a common cross entropy and an adversarial loss are utilized to represent the cost function of the end-to-end deep neural network.

3.2 Input

Each example of this task is represented as a set , where input is a paragraph with sentences, is th sentence in paragraph with words, and where is the label representing the category of

. We can represent the task as estimating the conditional probability

based on the training set, and identifying whether a testing example belongs to the target class by .

3.3 Word Embedding

The goal of word embedding layer is to represent th word in sentence with a

dimensional dense vector

. Given an input paragraph , it will be represented as , where each representation of sentence is a matrix consisting of word embedding vectors of th sentence.

3.4 Position Embedding

Position information can be potentially important for text understanding. In the SIRM, two types of position information are encoded, word position in a sentence, and sentence position in a paragraph. By leveraging the position encoding method from Vaswani et al. (2017), word/sentence positions are captured via sine and cosine functions of different frequencies to the input embeddings. Furthermore, the positional encodings have the same dimension as the corresponding embedding matrix, so that the results can be easily aggregated. The mathematical formulas are shown as follows:


where is the position and denotes the dimension. Moreover, for any fixed offset , can be represented as a sinusoidal function of .

After that, we add corresponding position embedding matrix to each sentence embedding matrix :


3.5 Skim Reading Component (SRC)

Since each word and sentence with implied textual meaning can be highly dependent on the contextual information, the proposed model needs to characterize the dynamic entire information with a quick manner like a human shown in Figure 3.

Figure 3: The SRC characterizes the entire information via convolutional neural networks with different kernel/window size.

A tailored CNN employs three key functions, e.g., sparse interaction, parameter sharing, and equivariant representation Wang et al. (2017), which can encode the partial spatial information. Hence, in the SRC, we use CNN layers with different window sizes in order to extract features like n-gram. Given a paragraph embedding reshaped from , the global feature is extracted as follows:


More specifically, convolution filters are applied to a window of words to produce a corresponding local feature. For example, a feature is generated from a window of words :


where denotes the convolutional operation and the feature map from filter with the same shape is represented as .

We then apply an average-over-time pooling operation over the feature map and obtain the feature as follows:


In this part, we utilize filters with kinds of window size to extract more accurate relevant information by taking the consecutive words (e.g., n-gram) into account, and then concatenate all from these filters to get the global semantic feature mentioned above which is represented as , where .

3.6 Intensive Reading Component (IRC)

Inspired by the human reading comprehension procedure, the IRC employs a hierarchical framework to characterize and explore the implied semantic information from sentence level to paragraph level. In other words, the sentence encoding outcomes will be used as the input of the paragraph-level part. The structure of IRC is shown in Figure 4.

Figure 4: The IRC encodes the current embedding, the near-neighbor information, and the skim information with a dense connection.

For sentence-level part (), given th sentence embedding from embedding layer with position embedding and the global information from the SRC, the sentence encoding information is extracted as a vector shown below:


and the paragraph embedding information is represented as a matrix: .

Before paragraph-level model (), a corresponding position embedding matrix is added to :


Then, the paragraph is encoded as a vector shown below:


Note that both and share the same structure, but the trainable parameter values are quite different. The detailed component descriptions can be found as follows.

3.6.1 Near-Neighbor Information Encoder

For people, in order to understand the implied meaning of the current word/sentence, besides the entire information of the whole paragraph, the near-neighbor information around the word/sentence, in a size word/sentence window, also plays an important role in characterizing the contextual information of the target word/sentence.

Hence, we pad both

words/sentences at the head and tail for input sentence embedding or paragraph embedding , respectively. Taking the sentence-level part as an example, filters, with window size , are applied to produce the near-neighbor information. So, the near-neighbor information of th word in th sentence is represented as a vector :


Finally, the near-neighbor information of all words in th sentence is encoded as a matrix: . The near-neighbor information is an important part of contextual information for the current word.

3.6.2 Dense Connection

To comprehensively understand implied semantics of a given text, the main effort of this work is to take advantage of the contextual information as a guidance and dependency to each word/sentence like people always do. Hence, inspired by Huang et al. (2017)

, the most direct idea is to concatenate the entire information (the skim information), the near-neighbor information, and the pure word/sentence embedding, and then feed them into a Multilayer Perceptron (MLP), also named dense connection layer, to realize an aggregate encoding. Taking the sentence-level part as an example, the aggregate encoding is achieved as below:


where is the concatenation operation.

Eventually, th sentence from sentence-level model is encoded as :


For paragraph-level IRC, the outputs from the near-neighbor information encoder and the aggregate encoder are represented as and , respectively.

Note that, a gate mechanism Cho et al. (2014) could replace the dense connection and an attention mechanism Yang et al. (2016) could replace the last average pooling. The results of comparison are shown in Figure 6.

3.7 Output

Undertaking the paragraph encoding and from the SRC and IRC respectively, a Multilayer Perceptron (MLP) is applied to generate the output :


Here, the output is the probability of the target category.

3.8 Model Training with Adversarial Learning

In the SIRM, the skim information is extracted from a set of shallow CNNs. Because this feature is similar to n-gram instead of deep semantic representation, it can be polluted by the noisy information such like special phrases highly related to the training data, e.g., some special figurative phrases.

Hence, the proposed model should be able to penalize the features strongly associated with the training data, while the general features should be boosted for the IRC optimization.

In this study, we implement this idea by utilizing an adversarial learning mechanism when training the model. For more theoretical details, refer to Goodfellow et al. (2014); Ganin et al. (2016); Ganin and Lempitsky (2015); Miyato et al. (2016). Specifically, we add a MLP over the SRC shown as follows:


Since the n-gram based global feature tends to be overfitting during training procedure, we expect to have a bit low performance when directly connecting to the output.

In a word, the final loss needs to minimize the normal loss and maximize the adversarial learning based loss, which is represented as:


where both and are the negative log likelihood and is an adjustment factor which is far less than 1. In addition, is named as adversarial loss (Adv) in this paper.

The SIRM is an end-to-end deep neural network, which can be trained by using stochastic gradient descent (SGD) methods, such as Adam 

Kingma and Ba . More implementation details will be given in the experiments section. In addition, all and mentioned above are weight matrix and bias respectively.

4 Experiments

In this section, we conduct extensive experiments to evaluate the proposed SIRM against baseline models and several variants of SIRM. As a byproduct of this study, we release the codes and the hyper-parameter settings to benefit other researchers.

4.1 Datasets

In order to validate the performance of the proposed SIRM and make it comparable with alternative baseline models, we conduct our experiments on three publicly available benchmark datasets about sarcasm detection and one real-world industrial spam detection dataset with metaphor. Details for all datasets are summarized in Table 1 and described as below:

Name Train Size Test Size Total Size Max Min Avg +/-
Tweets/ghosh 50,736 3,680 54,416 56 6 17 1/1
Reddit/movies 13,535 1,504 15,039 129 6 13 1/1
IAC/v1 1,483 371 1,854 1,045 6 57 1/1
Industry/spam 20,609 6,871 27,480 3,447 149 393 1/3
Table 1: Statistics for all datasets: is the length of text and +/- is the proportion of positive and negative samples.
  • Sarcasm Benchmark Datasets: Followed by previous works, we use Tweets/ghosh111 collected by Ghosh and Veale (2016, 2017) from Tweets, Reddit/movies222 collected by Khodak et al. (2018) from Reddit, and IAC/v1333 collected from Internet Argument Corpus (IAC) by Walker et al. (2012) in this study.

  • Industrial Dataset: We also evaluate the performance of the proposed SIRM on a Chinese online novel collection about spam detection with metaphor. The spam novels are firstly complained/reported by readers, e.g., parents of children/teenagers, and then confirmed by auditors. Note that the authors of these novels may purposely avoid using the explicit and sensitive words instead of figurative words because of the censorship.

4.2 Baselines

We employ the following baseline models (also see Table 3) for comparison, including word embedding Mikolov et al. (2013b)

based shallow neural networks, deep learning based models, and recent state-of-the-art models:

NBOW Shen et al. (2018): is a simple model based on word embeddings with average pooling.

CNN Kim (2014): is a simple CNN model with average pooling using different kernels. There are 7 kinds of filters whose widths are from 1 to 7 and each has 100 different ones.

LSTM Hochreiter and Schmidhuber (1997): is a vanilla Long Short-Term Memory Network. We set the LSTM dimension to 100.

Atten-LSTM Yang et al. (2016): is a LSTM applying an attention mechanism. The dimension is set to 100.

GRNN Zhang et al. (2016): employs a gated pooling method and a standard pooling method to extract content features and contextual features respectively from a gated recurrent neural network. This model has been demonstrated improvement compared to feature engineering based traditional models for sarcasm detection.

SIARN and MIARN Tay et al. (2018): capture incongruities between words with an intra-attention mechanism. A single-dimension intra-attention and a multi-dimension one are employed by SIARN and MIARN respectively. Both of them are the state-of-the-art models for sarcasm detection. We use the default settings by the authors.

Self-Atten Vaswani et al. (2017): is the state-of-the-art model from Google to encode deep semantic information using self-attention mechanism444 For the feasibility of training because of the large scale parameters, we set all dimensions as 64 just like ours and other hyper-parameters are same as given settings.

4.3 Evaluation Metrics

We choose to report parameter size (Param) and running time (Time) for evaluating the efficiency of the proposed SIRM. More specifically, the unit of the parameter size is thousand, and the whole running time of the NBOW is selected as the unit of Time. For effectiveness, we select Macro-Averaged F1 score (M F1) to show the performance for the label-balanced datasets and employ F1 score (F1) for label-unbalanced dataset. In addition, we report accuracy (Acc) for all of datasets.

4.4 Experiment Settings

For experiment fairness, we exploit the same data preprocessing as Tay et al. (2018). For the SIRM, the number of convolution filters in the SRC is 16 and the window size is from 1 to 4. The near-neighbor size is 1. The dimension of all other layers are all set to 64. The adjustment factor for adversarial loss is . In addition, the learning rate is and the batch size is 64. For Chinese novel dataset, we use JIEBA555

for tokenization. Furthermore, the statistical significance is conducted via the t-test with p-value


5 Results and Analysis

In this section, we give detailed experimental results and analysis to show insights into our model.

5.1 Performance Comparison

The parameter size, the running time, and the performance of the SIRM compared with baseline models are shown in Table 2 and Table 3.

Model Param Time
NBOW 10.3 1
CNN 30.3 2
LSTM 60.6 18
Atten-LSTM 71.0 22
GRNN 131.0 33
SIARN 100.9 150
MIARN 102.3 180
Self-Atten 254.9 17
SIRM 63.7 2
Table 2: Experimental results of efficiency comparison.
Tweets/gosh Reddit/movies IAC/v1 Industry/spam
Model M F1 Acc M F1 Acc M F1 Acc F1 Acc
NBOW 72.42 69.37 68.50 68.18 61.32 59.61 84.96 92.25
CNN 74.84 74.54 65.50 65.03 60.98 58.40 85.46 92.42
LSTM 75.08 75.16 67.71 66.74 44.73 53.84 82.16 90.86
Atten-LSTM 75.15 73.73 65.20 63.84 61.80 60.46 83.74 91.40
GRNN 79.43 79.24 64.59 63.19 52.45 54.78 86.30 93.04
SIARN 78.84 79.59 67.50 68.17 60.86 61.33 77.73 92.91
MIARN 72.71 72.31 63.44 62.12 55.74 58.95 86.14 92.25
Self-Atten 76.01 75.19 66.29 65.47 61.32 60.12 86.99 93.48
SIRM 82.54 82.38 70.01 69.94 63.01 62.13 88.18 93.94
Table 3: Experimental results of performance comparison.

NBOW realizes a decent performance for all datasets, especially for Reddit/movies. More importantly, NBOW has the lowest parameter size and achieves the least time cost. That means the NBOW can be a good choice in the vast majority of cases, also demonstrated by Shen et al. (2018); Conneau et al. (2018).

Unfortunately, the standard text representation models, such as CNN, LSTM, Atten-LSTM, and GRNN, don’t outperform NBOW significantly. And they can not even achieve a stable performance across all datasets because of the lack of the training data. For example, the GRNN performs well on Tweets/gosh and Industry/spam, but works worse on Reddit/movies and IAC/v1. The state-of-the-art models SIARN, MIARN, and Self-Atten don’t perform well as expected in this work intuitively. With more parameters and running time cost, these models may be even worse than NBOW. Moreover, RNN based models take more time than other models.

The proposed SIRM significantly outperforms all the baseline models according to accuracy and F1 score. It is clear that the proposed SIRM, along with SRC, IRC, and Adv, can be more stable on all datasets which have the diverse data sizes and text lengths, with architecture specially designed to simulate human’s reading comprehension process. Furthermore, other recent advanced models do not perform well due to the indifference of contextual information (for the word/sentence) and the bad impact of figurative expression. For example, SIRM makes it to identify [Tweets/ghosh: sarcasm] ‘Do you know what I love? Apartment construction at 7 a.m. 3 mornings in a row!’, but SIARN and Self-Atten fail to do so. The reason may be that words in the last two sentences look irrelevant or not explicitly contradictory to the first sentence, which will mislead the two models. But, the SIRM can capture the real meaning by reading each word/sentence with the global knowledge.

It’s worth mentioning that the parameter size and running time cost of the SIRM is comparable with all baselines. That is because we haven’t used any recurrent unit which means the SIRM can be totally parallel during training and testing.

Figure 5: Results of comparison for parameter size sensitivity: x-axis is main dimension of the model.

5.2 Parameter Size Sensitivity

As shown in Figure 5, parameter size sensitivity of the proposed SIRM against other baseline models is investigated based on Tweets/gosh. It is obvious that the proposed SIRM outperforms all representative baseline models, especially the tailored and state-of-the-art model for sarcasm detection, SIARN, according to macro F1 score and accuracy with all alternative parameter sizes. In contrast, Self-Atten achieves a lower score even than NBOW at the lowest dimension setting while the NBOW reaches the most stable performance among all alternative dimension settings. All these evidences demonstrate the robustness and superiority of the proposed SIRM.

Figure 6: The performance of ablation and addition for the SIRM: - denotes the ablation and + denotes the addition.

5.3 Ablation and Addition of SIRM

For efficiency purpose, we design each part of the SIRM with respect to Occam’s razor. Hence, we investigate the impact for complexity of the SIRM shown in Figure 6. By removing each part, such as IRC, SRC, and Adv, there is a decrease compared with the SIRM. That is because each part of the SIRM plays a different, necessary, and important role for implied semantic meaning understanding across different datasets. Meanwhile, we find that using a more sophisticated component to replace the simple one is not advisable. Gate and attention mechanisms don’t make performance increase. In particular, the more complex the model, the more space it will take.

6 Conclusion

In this study, we propose a novel model, SIRM, for understanding and identifying the implied textual meaning with a quick manner. In SIRM, the SRC is designed to capture the dynamic global information, while the IRC is employed to characterize the fine semantics via a hierarchical framework by taking the contextual information into consideration with the dense connection. In addition, the adversarial loss is applied over the SRC to eliminate the potential noise. We conduct extensive experiments on several sarcasm benchmarks and an industrial spam dataset with metaphor. The results indicate that the proposed model practically outperforms all alternatives, in the light of performance, robustness, and efficiency.

7 Acknowledgments

This work is supported by National Natural Science Foundation of China (71473183, 61876003) and Fundamental Research Funds for the Central Universities (18lgpy62).



  • M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In

    Proceedings of the 34th international conference on machine learning (ICML-17)

    pp. 1243–1252. Cited by: §2.
  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations, pp. 1–15. Cited by: §2.
  • Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin (2003) A neural probabilistic language model. Journal of machine learning research 3 (Feb), pp. 1137–1155. Cited by: §1, §2.
  • E. Camp (2012) Sarcasm, pretense, and the semantics/pragmatics distinction. Noûs 46 (4), pp. 587–634. Cited by: §2.
  • J. D. Campbell and A. N. Katz (2012) Are there necessary conditions for inducing a sense of sarcastic irony?. Discourse Processes 49 (6), pp. 459–480. Cited by: §2.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation.

    Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    , pp. 1724–1734.
    Cited by: §1, §2, §3.6.2.
  • J. Chung, C. Gulcehre, K. Cho, and Y. Bengio (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. In NIPS 2014 Workshop on Deep Learning, December 2014, pp. 1–9. Cited by: §2.
  • A. Conneau, G. Kruszewski, G. Lample, L. Barrault, and M. Baroni (2018) What you can cram into a single vector: probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2126–2136. Cited by: §5.1.
  • E. Fersini, E. Messina, and F. A. Pozzi (2016) Expressive signals in social media languages to improve polarity detection. Information Processing & Management 52 (1), pp. 20–35. Cited by: §2.
  • Y. Ganin and V. Lempitsky (2015)

    Unsupervised domain adaptation by backpropagation

    In Proceedings of the 32nd International Conference on Machine Learning, Vol. 37, pp. 1180–1189. Cited by: §3.8.
  • Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky (2016) Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §3.8.
  • A. Ghosh and T. Veale (2016) Fracking sarcasm using neural network. In Proceedings of the 7th workshop on computational approaches to subjectivity, sentiment and social media analysis, pp. 161–169. Cited by: §2, item 1.
  • A. Ghosh and T. Veale (2017) Magnets for sarcasm: making sarcasm detection timely, contextual and very personal. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 482–491. Cited by: item 1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §3.8.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §1, §2, §4.2.
  • G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger (2017) Densely connected convolutional networks. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    pp. 4700–4708. Cited by: §3.6.2.
  • S. L. Ivanko and P. M. Pexman (2003) Context incongruity and irony processing. Discourse Processes 35 (3), pp. 241–279. Cited by: §2.
  • R. Johnson and T. Zhang (2017) Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 562–570. Cited by: §2.
  • N. Kalchbrenner, E. Grefenstette, and P. Blunsom (2014) A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Volume 1: Long Papers, pp. 655–665. Cited by: §1, §2.
  • M. Khodak, N. Saunshi, and K. Vodrahalli (2018) A large self-annotated corpus for sarcasm. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC-2018), pp. 641–646. Cited by: item 1.
  • Y. Kim (2014) Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, pp. 1746–1751. Cited by: §4.2.
  • [22] D. P. Kingma and J. Ba Adam: a method for stochastic optimization. In International Conference for Learning Representations, pp. 1–15. Cited by: §3.8.
  • F. Kunneman, C. Liebrecht, M. Van Mulken, and A. Van den Bosch (2015) Signaling sarcasm: from hyperbole to hashtag. Information Processing & Management 51 (4), pp. 500–509. Cited by: §2.
  • Y. LeCun, Y. Bengio, and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
  • T. Lei and Y. Artzi (2018) Simple recurrent units for highly parallelizable recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 4470–4481. Cited by: §1, §2.
  • Z. Li, D. He, F. Tian, W. Chen, T. Qin, L. Wang, and T. Liu (2018) Towards binary-valued gates for robust LSTM training. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 2995–3004. External Links: Link Cited by: §1.
  • R. Lin, S. Liu, M. Yang, M. Li, M. Zhou, and S. Li (2015) Hierarchical recurrent neural network for document modeling. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 899–907. Cited by: §2.
  • T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013a) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. Cited by: §2.
  • T. Mikolov, M. Karafiát, L. Burget, J. Cernockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, Vol. 2, pp. 3. Cited by: §2.
  • T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean (2013b) Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pp. 3111–3119. Cited by: §1, §2, §4.2.
  • T. Miyato, A. M. Dai, and I. Goodfellow (2016) Adversarial training methods for semi-supervised text classification. In International Conference on Learning Representations, pp. 1–15. Cited by: §3.8.
  • S. Noekhah, N. binti Salim, and N. H. Zakaria (2020) Opinion spam detection: using multi-iterative graph-based model. Information Processing & Management 57 (1), pp. 102140. Cited by: §2.
  • H. Palangi, L. Deng, Y. Shen, J. Gao, X. He, J. Chen, X. Song, and R. Ward (2016) Deep sentence embedding using long short-term memory networks: analysis and application to information retrieval. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP) 24 (4), pp. 694–707. Cited by: §2.
  • K. Pichotta and R. J. Mooney (2016) Using sentence-level lstm language models for script inference. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, Volume 1: Long Papers, pp. 279–289. Cited by: §2.
  • D. Shen, G. Wang, W. Wang, M. R. Min, Q. Su, Y. Zhang, C. Li, R. Henao, and L. Carin (2018) Baseline needs more love: on simple word-embedding-based models and associated pooling mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers, pp. 440–450. Cited by: §4.2, §5.1.
  • Y. Shen, S. Tan, A. Sordoni, and A. Courville (2019)

    Ordered neurons: integrating tree structures into recurrent neural networks

    In International Conference on Learning Representations, pp. 1–14. Cited by: §1.
  • Y. Tay, A. T. Luu, S. C. Hui, and J. Su (2018) Reasoning with sarcasm by reading in-between. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1010–1020. Cited by: §1, §1, §2, §4.2, §4.4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §2, §3.4, §4.2.
  • M. A. Walker, J. E. F. Tree, P. Anand, R. Abbott, and J. King (2012) A corpus for research on deliberation and debate.. In LREC, pp. 812–817. Cited by: item 1.
  • B. Wang, K. Liu, and J. Zhao (2016) Inner attention based recurrent neural networks for answer selection. In The Annual Meeting of the Association for Computational Linguistics, pp. 1288–1297. Cited by: §1.
  • C. Wang, F. Jiang, and H. Yang (2017) A hybrid framework for text modeling with convolutional rnn. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2061–2069. Cited by: §2, §3.5.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy (2016) Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489. Cited by: §1, §2, §3.6.2, §4.2.
  • M. Zhang, Y. Zhang, and G. Fu (2016) Tweet sarcasm detection using deep neural network. In Proceedings of COLING 2016, The 26th International Conference on Computational Linguistics: Technical Papers, pp. 2449–2460. Cited by: §4.2.
  • S. Zhang, X. Zhang, J. Chan, and P. Rosso (2019)

    Irony detection via sentiment-based transfer learning

    Information Processing & Management 56 (5), pp. 1633–1644. Cited by: §2.
  • C. Zhou, C. Sun, Z. Liu, and F. Lau (2015) A c-lstm neural network for text classification. arXiv preprint arXiv:1511.08630. Cited by: §2.