Automatic text causality mining is a critical but difficult task because causality is thought to play an essential role in human cognition when making decisions . Thus, automatic text causality has been studied extensively in a wide range of areas, such as industry , physics  and healthcare , etc. A tool to automatically scour the plethora of textual content on the web and extract meaningful causal relations could help us construct causal chains to unveil previously unknown relationships between events  and accelerates the discovery of the intrinsic logic of the events .
Many research efforts have been made to mine causality from text corpus with complex sentence structures in the books or newspapers. In Causal-TimeBank  authors introduced ”CLINK” and ”C-SIGNAL” tag to mark events causal relation and causal signals respectively based on specific templates (e.g., ”A happened because of B”). Q. Do et al.  collected 25 newswire articles from CNN in 2010 and released event causality dataset that provides relatively dense causal annotations. Recently, Q. Do et al. improved the annotation method and implemented joint reasoning for causal and temporal relations 
. However, the volume of textual data in the wild, e.g., on the web, is much larger than that in books and newspapers. With the help of mobile technologies, people tend to express personal opinions and record memorable moments on the web, which have become a rich source of causality, consequently. There is a huge demand to investigate an approach for mining both explicit and implicit causality from web text. Despite the success of existing studies on extracting explicit causality, there are few reasons why most existing works cannot be directly applied into causality mining on the web text where a large number of implicit causality cases exist. First, most public datasets for causality mining are collected from books and newspaper where the language expressions are more formal and less diverse than the textual data on the web. Second, it would make the perception of causality incomplete because the existing works mainly focus on explicit causal relations that are expressed by intra-sentence or inter-sentence connectives, without considering ambiguous and implicit cases. Implicit commonsense causality can be expressed by a simple sentence structure without any connectives: for example, ”got wet” is the cause of ”fever” in Example 1 has no connectives assisting detect causality, while there are explicit connectives (i.e. ”since” and ”result”) in Example 2 to benefit complex causality detection.
Example 1 I got wet during the day and came home with a fever at night.
Example 2 Since computers merely execute the instructions they are given, bugs are nearly always the result of programmer error or an oversight made in the program’s design.
Normally, causality mining is divided into two sequential tasks: causality detection and cause-effect pair extraction. When dealing with large-scale web text, detecting causalities by specific classifiers with relational reasoning capacity is a pre-step of extracting cause-effect pairs. The performance of causality mining largely depends on how well the detection is performed. In this paper, we mainly focus on the detection step. This procedure can overcome the weakness of manual templates that hardly cover the linguistic variations of explicit causality expressions. It could help build a causality dataset with various expressions for extraction, which results in much less model complexity. Most existing works on causality detection have two common limitations. First, utilizing linguistic methods, such as part-of-speech (POS) tagging and syntax dependency parsing, to get handcrafted features is labor-intensive and takes ample time. Zhaoet al.  divided causal connectives into different classes as a new category feature based on the similarity of the syntactic dependency structure within causality sentences. Also, the proposed model copes with the interaction between the category feature and other frequently-used features such as contextual features, syntactic features, and position features. However, these extracted features hardly capture a wide variety of causality expressions. The algorithms that used the NLP toolkits to extract the features can pass on the errors caused by the toolkits. Hidey and McKeown  incorporated world knowledge, such as FrameNet, WordNet, and VerbNet, to measure the correlations between words and segments while the method barely handles those words which never appear in the learning phase. Second, the quality of extracting co-occurrence by pre-defined patterns is influenced by ambiguous connectives, such as ”subsequently” and ”force.” As seen in Table I, ”consequently” is observed in either causal examples or non-causal examples. Luo et al.  leveraged causal pairs extracted by a set of pre-defined patterns to form CausalNet where the weight of a causality pair is a frequency of causality co-occurrence. Unfortunately, due to the volume of their corpus, there was no further analysis of sentences syntactic dependency. To some extent, this restricts the performance of causal pairs detection.
To address the above problems, we propose a Multi-level Causality Detection Network (MCDN) for causality detection based on the following observations: 1) methods based on end-to-end deep neural networks could reduce the labor cost on feature engineering and relief errors propagation from the existing toolkits; 2) causality is a complicated relation, which calls for multi-level analysis, including parsing each word with its context firstly and inferring causality via the segments on both sides of the connectives secondly. Therefore, at the word level, we integrate word, position, and segment embeddings to encode the input sentence, followed by feeding it into stacked Transformer blocks, which have been widely used in various NLP tasks [13, 14]. In our research, the Transformer could pay attention to cause and effect entities, and capture long-distance dependency across connectives in the meantime. With this end-to-end module, we combine local context and long-distance dependency to acquire a semantic representation at the word level. Thus, we can relax the first limitation (i.e. feature engineering and accumulated errors). At the segment level, to inference the case and effect near the AltLex, we split the sentence into three segments on the ground of ”segment before AltLex”, ”AltLex” and ”segment after AltLex”. To solve the second limitation, we propose a novel causality inference module, namely Self Causal Relation Network (SCRN). Due to the characteristics of the dataset, the input of SCRN is a single sentence. This is different from Relation Networks in other areas. The feature maps of segments are constructed into four pair-wise groups that are concatenated with a sentence representation respectively. Our intuition is if the sentence can be expressed as ”B-AltLex-A”, we could inference these segments in pairs to identify: 1) the semantic relation of ”B-AltLex” and ”AltLex-A”; 2) the cause-effect relation between ”B-A” or ”A-B”. Then the segment-level representation is inferred by two non-linear layers. Finally, we combine word-level with segment-level representations to obtain the detection result.
In general, our model MCDN has a simple architecture but effective reasoning potential for causality detection. The contributions can be summarized as three-fold:
We introduce the task of mining causality from web text that is conducted into detection and extraction step. Utilizing detection instead of specific templates is a new direction and can provide a rich diversity of causality text with low-noise data for the subsequent extraction step and upstream applications.
We propose a neural model MCDN to tackle the problem at the word and segment levels without any feature engineering. MCDN contains a relational reasoning module named Self Causal Relation Network (SCRN) to infer the causal relations within sentences.
To evaluate the effectiveness of the proposed framework, we have conducted extensive experiments on a publicly available dataset. The experimental results show that our model achieves significant improvement over baseline methods, including many state-of-the-art text classification models, which illustrates detecting causality is a complex task. The detection not only requires multi-level information but also needs more reasoning capability than the text classification.
|Label||English Wikipedia||Simple Wikipedia|
|Causal||A moving observer thus sees the light coming from a slightly different direction and consequently sees the source at a position shifted from its original position.||A moving observer thus sees the light coming from a slightly different direction and consequently sees the source at a position shifted from its original position.|
|Non-causal||His studies were interrupted by World War I, and consequently taught at schools in Voronezh and Saratov.||However, he had to stop studying because of the World War I, and instead taught at schools in Voronezh and Saratov.|
Ii Related Work
Ii-a Causality Relation
Causality mining is a fundamental task with abundant upstream applications. Early works utilize Bayesian network[15, 16], syntactic constraint , and dependency structure  to extract cause-effect pairs. Nevertheless, they could hardly summarize moderate patterns and rules to avoid overfitting. Further studies incorporate world knowledge that provides a supplement to lexico-syntax analysis. Generalizing nouns to their hypernyms in WordNet and each verb to its class in VerbNet [19, 20] eliminates the negative effect of lexical variations and discover frequent patterns of cause-effect pairs. As is well known, the implicit expressions of causality are more frequent. J.-H. Oh et al.  exploited cue words and sequence labeling by CRFs and selected the most relevant causality expressions as complements to implicitly expressed causality. However, the method requires retrieval and ranking from enormous web texts. From natural properties perspective, causality describes relations between regularly correlated events or phenomena. Constructing a cause-effect network or graph could help discover co-occurrence patterns and evolution rules of causation [4, 20]. Therefore, Zhao et al.  conducted causality reasoning on the heterogeneous network to extract implicit relations cross sentences and find new causal relations.
Our work is similar to previous works on detecting causalities [11, 19]. The difference is we do not incorporate knowledge bases they used. We propose a neural-based multi-level model to tackle the problem without any feature engineering. Oh et al. 
proposed a multi-column convolutional neural network with causality-attention (CA-MCNN) to enhance MCNNs with the causality-attention based question and answer passage, which is not in coincidence with our task. In compared with CA-MCNN, the multi-head self-attention within the Transformer block we used at the word level is more effective, and the SCRN at the segment level augments the reasoning ability of our model.
Ii-B Relation Networks
Relation Networks (RNs) is initially a simple plug-and-play module to solve Visual-QA problems that fundamentally hinge on relational reasoning 
. RNs can effectively couple with convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and multi-layer perceptrons (MLPs) to reduce overall network complexity. We gain a general ability to reason about the relations between entities and their properties. Original RNs can only perform single step inference such asrather than . For tasks that require complex multi-step of relational reasoning, Palm et al.  introduced the recurrent relational network that operates on a graph representation of objects. Pavez et al.  added complex reasoning ability to Memory Networks with RNs, which reduced its computational complexity from quadratic to linear. However, their tasks remain text QA and visual QA. In this paper, it’s the first time that RNs is applied to relation extraction as proposed SCRN.
Iii Preliminary Statement
Iii-a Linguistic Background
This section describes the linguistic background of causal relation and the AltLexes dataset, which we used. It’s a commonly held belief that causality can be expressed explicitly and implicitly using various propositions. In the Penn Discourse Treebank (PDTB) , over of explicit discourse connectives are marked as causal such as ”hence”, ”as a result” and ”consequently”, as are nearly of implicit discourse relationships. In addition to these, there exists a type of implicit connectives in PDTB named AltLex (Alternative lexicalization) has been capable of indicating causal relations, which is an open class of markers and potentially infinite.
The definition of AltLex was extended with an open class of markers that occur within a sentence in . The following are examples widespread in the new AltLexes set but are not contained in PDTB explicit connectives. The word ”made” with many meanings here is used to express causality. Moreover, the expression of causality in the second example is somewhat obscure.
Ambiguous causal verbs, e.g. The flood made many houses to collapse.
Partial prepositional phrases, e.g. They have made l4 self-driving car with the idea of a new neural network.
According to our statistics in the parallel data constructed in , there are 1144 AltLexes indicate causal, and 7647 AltLexes indicates non-causal. Meanwhile, their intersection has 144 AltLexes, which is of causal sets and of non-causal sets.
In conclusion, ambiguous connectives and implicit expressions are frequently observed in the AltLexes dataset. Methods based on statistical learning with manual patterns have demerits to build a reliable model in such contexts. However, with the abstraction and reasoning capacity, our model MCDN can be well adapted to these situations.
Iii-B Notations and Definitions
For a given Wikipedia sentence , it is assumed that it has tokens. where is a filtered token at position . We use refers to the AltLex, refers to the segment before the AltLex and refers to the segment after the AltLex. Our objective is to generate a sentence-level prediction of which the label is as Equation 1. The proposed model MCDN is shown in Figure 2. We will detail each component in the rest of this section.
It’s worth noting that Hidey and McKeown  utilized English Wikipedia and Simple Wikipedia sentence pair to create a parallel corpus feature but still took one sentence as input each time. Unlike this approach, MCDN only leverages the input sentence for causal inference.
In this section, we elaborate the MCDN, a multi-level neural network-based approach with Transformer blocks at the word level and SCRN at the segment level for causality detection, which is primarily targeted at ambiguous and implicit relations.
Iv-a Input Representation
Our input representation is able to incorporate multi-source information into one token sequence. Inspired by , the representation of each token in the input sentence is constructed by summing the corresponding word, position, and segment embeddings. Unlike the previous work, BERT, the segment embeddings here indicate the , and segment in each sentence. As shown in Fig. 1, first, we adopt a word2vec toolkit 111https://radimrehurek.com/gensim/ to pretrain word embeddings with dimension on the English Wikipedia dump. Next, we utilize positional embeddings to map the positional information because our model has no recurrent architecture at the word level. Similarly, we use segment embeddings to involve more linguistic details. and is the dimension of positional embeddings and segment embeddings, respectively. By sum the three embeddings, finally, we get a new representation for token where . The representation could provide basic features for high-level modules.
Iv-B Word Level
There are two sub-layers in the Transformer block: self-attention and feed-forward networks. For stability and superior performance, we add a layer normalization after the residual connection for each of the sub-layers.
Self-Attention. In this paper, we employ scaled multi-head self-attention, which has many merits compared with RNN and CNN. Firstly, the ”receptive field” of each token can be extended to the whole sequence without long distance dependency diffusion. And any significant token would be assigned a high weight. Secondly, dot-product and multi-head can be optimized for parallelism separately, which is more efficient than RNN. Finally, multi-head model aggregates information from different representation sub-spaces. For scaled self-attention, given the input matrix of
query vectors, keys and values , computing the output attention score as:
We take the input vector matrix as queries, keys, and values matrix and linearly project them times respectively. Formally, for - head it is formulated as below:
Where the learned projections are matrices , , . Finally, we concatenate each head and map them to the output space with :
We apply feed-forward networks after the self-attention sub-layer. It consists of two linear layers and a ReLU activation between them. Note thatis the output of the previous layer:
where and . We set in our experiments.
The Transformer block is stacked times, of which the final output is regarded as the representation of the sentence at the word level. We aim to deal the word with its fine-grained local context and coarse-grained global long-distance dependency information. Thus, our word-level module could acquire not only lexico-syntax knowledge that manual patterns hardly cover but also lexical semantics among the words.
Iv-C Segment Level
We propose a novel approach to infer causality within sentences at the segment level. The model is named as Self Causal Relation Network (SCRN) due to it focuses on the causal relation intra-sentence compared with previous studies of RNs.
Dealing with segments. The core idea of Relation Networks is operating on objects. In our task, the sentence is split into three segments , , and according to the position of AltLex. Then the input representations of these segments can be formulated as , and where , , and are the length of tokens in each segment. Due to the difference of segment lengths, we use a three-column CNN (TC-CNN) to parse , , and into a set of objects. Particularly the representations here only employ word embeddings and segment embeddings because the TC-CNN could capture the position information. Unlike , TC-CNN convolves them through a 1D convolutional layer to feature maps of size , , and , where is the sum of kernels. The model exploits multi-scale kernels (with varying window size) to obtain multi-scale features. As seen in Fig. 2
, the feature maps of each segment are rescaled into a k-dimension vector by the max pooling layer after convolution. Finally, we produce a set of objects for SCRN:
Dealing with the sentence. The input representation of the sentence pass through a bidirectional-GRU (bi-GRU) with -dimension hidden units, and the final state of the bi-GRU is concatenated to each object-pair.
SCRN. We construct four object-pairs concatenated with . Let denote the pair-wise operation. For causality candidates, and indicate the relation between cause-effect and AltLex, while and inference the direction of causality. The object-pairs matrix is shown as follows:
Here ”;” is a concatenation operation for the object vectors. Consequently, we modify the SCRN architecture in a mathematical formulation and obtain the final output at the segment level:
In general, the model transforms the segments into object-pairs by the TC-CNN and passes sentence through bi-GRU to obtain the global representation. Then we integrate object-pairs with global representation and make a pair-wise inference to detect the relationship among the segments. Ablation studies show that the proposed SCRN at the segment level has the capacity for relational reasoning and promotes the result significantly.
Iv-D Causality Detection
Our model MCDN identifies causality of each sentence based on the output at the word level and at the segment level. The two outputs are concatenated as a unified representation . In this task, we use a 2-layer FFN consisting of units which have a ReLU activation and is followed by a softmax function to make the prediction:
In the AltLexes dataset, the number of non-causal examples is over seven times the number of causal examples, and this leads to an extremely sample imbalance problem. If we adopt cross-entropy (CE) as model loss function, the performance would be unsatisfactory. Moreover, the difficulty in detecting each sample is different. For example, the sentence contains an ambiguous AltLex such as ”make” is harder to infer than that contains ”cause”. Consequently, we need to assign a soft weight to a causal and non-causal loss to make the model pay more attention to those examples which are difficult to identify. Motivated by the works, we introduce the focal loss to improve normal cross entropy loss function. The focal loss
is formulated as the objective function with the balance weight hyperparameterand the tunable focusing hyperparameter .
For optimization, we use the Adam optimizer  with , , and clip the gradients norm.
In this section, we are interested in investigating the performance of MCDN that integrates Transformer blocks with SCRN and whether it is essential to incorporate inference ability in the sentence-level causality detection task.
V-a Experiment Settings
Dataset. We use the AltLexes dataset to evaluate the proposed approach. The detailed statistical information about the dataset is listed in Table II. The Bootstrapped set is generated using new AltLexes to identify additional ones based on the Training set, which increased causal examples by about 65 percent. In our experiment, we train MCDN on the Training set and Bootstrapped set separately and finetune hyperparameters on the validation set. The golden annotated set is used as the test set.
Hyperparameters. We set the initial learning rate to
then decreased half when the F1-score has stopped increasing more than two epochs. The batch size in this experiment is 32, and the epoch size is 20. To avoid overfitting, we employ two types of regularization during training: 1) dropout for the sums of the embeddings, the outputs of each bi-GRU layer except the last, each layer in FFN and residual dropout for Transformer blocks; 2) regularization for all trainable parameters. The dropout rate is set to 0.5 and the regularization coefficient is . In self-attention module, we set the stack time of Transformer blocks and the number of attention heads . In SCRN, the window sizes of TC-CNN kernels are 2, 3, 4 while the sum of kernel . We use a 2-layer bi-GRU with 64 units in each direction. As for the focal loss, we set .
Different evaluation metrics including accuracy, precision, recall, and F1-score are adapted to compare MCDN with the baseline methods. To understand our model comprehensively, we employ both Area under Receiver Operator Curve (AUROC) and Area under Precision-Recall Curve (AUPRC) to evaluate its sensitivity and specificity, especially under the situation that causality is relatively sparse in the web text.
V-B Baseline Methods
In this section, we elaborate on 10 baseline methods.
The first five methods are the most common class (MCC), , , , and . , , and represent KL-divergence score, lexical semantic feature, and categorical feature respectively. These methods are used as baselines in Hidey et al.’s work. and acquire the best accuracy and precision on the Training set. and are the best systems with the highest recall and F1-score respectively. The next five are the most commonly used methods in text classification. They are TextCNN, TextRNN, SASE, DPCNN, and BERT. In our experiment, we reproduced all of them except BERT. For BERT, we use the public released pre-trained language model (base). 222https://github.com/huggingface/pytorch-pre-trained-BERT and fine-tuned it on each dataset. The detailed information about these baselines is listed as follows:
TextCNN used here has a convolution layer, the window sizes of which are 2, 3, 4 and each have 50 kernels. Then we apply max-overtime-pooling and 2-layer FFN with ReLU activation. The dropout rate is 0.5 and regularization coefficient is .
TextRNN uses a bidirectional GRU the same as sentence encoder in SCRN and use max pooling across all GRU hidden states to get the sentence embedding vector, then use a 2-layer FFN to output the result. Dropout rate and regularization coefficient is the same as TextCNN.
SASE  uses a 2-D matrix to represent the sentence embedding with a self-attention mechanism and a particular regularization term for the model. It’s an effective sentence level embedding method.
DPCNN  is a low-complexity word-level deep CNN model for sentiment classification and topic categorization. It can make down-sampling without increasing the number of features maps which enables the efficient representation of long-range associations.
presented state-of-the-art results in a wide variety of NLP tasks, which is a pre-trained deep language representation model based on Transformer and Masked Language Model. BERT is inspired by transfer learning in the computer vision field, pre-training a neural network model on a known task, for instance, ImageNet, and then performing fine-tuning on a new purpose-specific task.
It’s worth noting that due to data imbalance and for comparison in the same situation, we also used focal loss in the above methods to acquire the best performance.
Table II shows the detection results from the two datasets of our model and competing methods. Firstly, we can see that MCDN remarkably outperforms all other models when trained on both datasets.
Although MCDN doesn’t obtain the highest precision, it increases F1-score by 10.2% and 3% compared with the existing best systems and . Furthermore,
feature based SVM yields the highest precision on the Training set, though poor recall and F1-score, because it focuses on the substitutability of connectives while the parallel examples usually have the same connective that would be estimated as false negatives. It is remarkable that MCDN is more robust on the original Training set and Bootstrapped set while the feature-based linear SVM and neural network-based approaches presented a considerable difference and got gain even more than 20 on F1-score.
Secondly, deep methods tend to acquire balanced precision and recall score except for BERT and MCDN whose recall is significantly higher than precision on Bootstrapped set. Besides, F1-score of both BERT and MCDN is beyond 80 on the Bootstrapped dataset. All the results above suggest that the neural network is more powerful than the traditional co-occurrence and world knowledge-based methods on this task, as we expected. MCDN has learned various semantic representations of causal relations from word level and been able to inference causality from segment level supported by concise and effective SCRN. Furthermore, the deep classification methods we employed don’t perform as well as MCDN which demonstrates causality detection is a much complex task that requires considerable relational reasoning capacity compared with text classification, although both can be generalized to classification problems.
|Test||and, 110||cause, 58||lead to, 35||due to, 35||as, 23|
|FP||to develop, 12||break, 7||change, 6||response to, 5||subsequently, 5|
|FN||and, 9||as, 7||for, 3||after, 3||become, 2|
FP: false positive samples in the test result; FN: false negative samples in the test result.
V-D Ablation Study
To demonstrate the synergy between different components and their contribution to MCDN architecture, we train the different components of MCDN separately and conduct ablation comparison. The results are shown in Table IV. From the table, we can find that the full MCDN can obtain the best result most of the time. SCRN provides most of the performance of the identifier, especially on the relatively small dataset which illustrates the significance of relational reasoning capacity for our task. The Transformer blocks, though not strong individually, supply complementary representation at the word level and enhance the overall performance of MCDN.
V-E Case Study
Firstly, according to the test results, our model correctly identifies some causal relations where the AltLexes hardly appear in the Training set and Bootstrapped set, such as the following Sample (1)(2):
The transfer was poorly received by some fans owing to a number of technical and format changes that were viewed as detrimental to the show ’s presentation. (Causal)
Most of the autosomal dominant familial AD can be attributed to mutations in one of three genes: those encoding amyloid precursor protein (APP) and presenilins 1 and 2. (Causal)
An English cloth business was developed in the fifteenth century, allowing the English also to become wealthier. (Causal)
The UK government placed Ireland under martial law in response to the Easter Rising; although, once the immediate threat of revolution had dissipated, the authorities did try to make concessions to nationalist feeling. (Non-Causal)
Secondly, the ”allowing” in Sample (3) represents ”developed cloth business” is the cause of ”the English became wealthier”. However, other meanings of ”allow” are more common in language context which makes the best traditional method based system and deep method based BERT couldn’t distinguish the causal sample clearly. It’s the same as in Sample (4) that ”to make” doesn’t convey causality as usual here. And MCDN gave correct prediction while compared methods didn’t. By these two samples, we illustrate that our model offers inference capacity superior to others. When faced with ambiguous and implicit sentences, the lexical and syntactic information from word level is not enough to detect causality. However, MCDN can comprehend specific semantic information in the context and reason the relations among different segments benefiting from its exclusive neural network at the segment level. Hence, our model performs the best result in the comparative experiment.
Finally, we investigated the misclassified samples. As seen in Table V, we lemmatize AltLexes to a standard form in the test set and find that the most frequent AltLexes in the false positive and false negative samples we identified are ”to develop” and ”and”. ”And” is the most frequent AltLex in the test set. The prediction accuracy for ”lead to” together with its variants is 91%, the same as ”due to”. Most false positive and false negative samples have verbs or conjunctions as AltLex, which often don’t express causality explicitly. In conclusion, the key to performance improvement of MCDN is detecting ambiguous and implicit causal relations more effectively and widely.
V-F Robustness of the Model
We evaluated the robustness of MCDN from two aspects. First, we alternate word representations with a different source to be word embeddings for MCDN. Then we test trained MCDN and BERT on a construct corpus directly to demonstrate the scalability of our model.
Stability on Different Word Embeddings. As is well-known, BERT uses a word-piece algorithm to split words into sub-words, which is different from the pre-trained word embeddings used by MCDN and other models. Here we want to evaluate the impact of different word embeddings on the performance of MCDN. As Table VI shows, the metrics bias is acceptable, which proves the performance of our model is stable when using different word embeddings.
Scalability on Other Dataset. We constructed a corpus, which is composed of 1340 sentences, half of these which contains causal relation. To investigate the scalability of our model, we train MCDN on the AltLex Training set and Bootstrapped set then test on the corpus. It is extracted from the SemEval-2010-Task8 dataset by filtering the sentences containing AltLex from ”Cause-Effect” relation and randomly from other relations. Table VII shows that although the F1-score had dropped by 5.5% and 6.6% separately compared to before MCDN acquired much better results than the fine-tuned BERT which shows the scalability of our model. We conjectured that the poor performance of BERT is due to the focal loss function we used and the unbalanced data distribution between the AltLex dataset and constructed corpus. Therefore, we divided the corpus into a train set and a test set to fine-tune BERT with normal cross entropy loss function. The result is in the last block of Table VII which is labeled as BERT* and MCDN*. It’s foreseeable from our point of view that the overall performance of MCDN is slightly worse than fine-tuned BERT because MCDN needs much fewer parameter and pre-training time. We could combine them in future works.
In this paper, we propose a multi-level causality detection network (MCDN) for web text causality detection, especially implicit and ambiguous ones. We define the causality mining task as a two-step procedure listed as detection and extraction. The most challenging problem in this work is how to capture the ambiguous and implicit relations. This is hugely different from that causality extraction in books and newspapers. MCDN leverages a self-attention mechanism at the word level and modified Relation Network at the segment level to construct an integrated sentence representation for inference. Our method has improved the main metrics significantly over the state-of-the-art models which use parallel corpus and semantic features. Comparing with several text classification methods including pre-trained BERT, we found that if we expect our model has the thorough understanding for complicated semantic information such as discourse relations, transitivity rules, and the development process of events it’s informative to combine the inference capacity with current methods.
However, how to extract causality expressed implicitly or across the sentences is still a big challenge for researchers. So it is crucial to extract causality automatically and effectively. Since MCDN showed its promising capacity for causality detection, in the future, using improved MCDN to conduct cause-effect extraction can be a promising direction.
-  K. Chan and W. Lam, “Extracting causation knowledge from natural language texts,” International Journal of Intelligent Systems, vol. 20, no. 3, pp. 327–358, 2005.
H. Qiu, Y. Liu, N. A. Subrahmanya, and W. Li, “Granger causality for time-series anomaly detection,” in2012 IEEE 12th International Conference on Data Mining. IEEE, 2012, pp. 1074–1079.
-  K. Budhathoki and J. Vreeken, “Accurate causal inference on discrete data,” in 2018 IEEE International Conference on Data Mining. IEEE, 2018, pp. 881–886.
-  S. Zhao, “Mining medical causality for diagnosis assistance,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017, pp. 841–841.
-  N. Asghar, “Automatic extraction of causal relations from natural language texts: a comprehensive survey,” arXiv preprint arXiv:1605.07895, 2016.
-  G. Li, H. Dai, and Y. Tu, “Linear causal model discovery using the mml criterion,” in 2002 IEEE International Conference on Data Mining, 2002. Proceedings. IEEE, 2002, pp. 274–281.
-  P. Mirza and S. Tonelli, “An analysis of causality between events and its relation to temporal information,” in Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers, 2014, pp. 2097–2106.
Q. X. Do, Y. S. Chan, and D. Roth, “Minimally supervised event causality
Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011, pp. 294–303.
-  Q. Ning, Z. Feng, H. Wu, and D. Roth, “Joint reasoning for temporal and causal relations,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2278–2288.
-  S. Zhao, T. Liu, S. Zhao, Y. Chen, and J.-Y. Nie, “Event causality extraction based on connectives analysis,” Neurocomputing, vol. 173, pp. 1943–1950, 2016.
-  C. Hidey and K. McKeown, “Identifying causal relations using parallel wikipedia articles,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2016, pp. 1424–1433.
-  Z. Luo, Y. Sha, K. Q. Zhu, S.-w. Hwang, and Z. Wang, “Commonsense causal reasoning between short texts,” in Fifteenth International Conference on the Principles of Knowledge Representation and Reasoning, 2016.
-  A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
-  J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
-  Z. Wang and L. Chan, “Using bayesian network learning algorithm to discover causal relations in multivariate time series,” in 2011 IEEE 11th International Conference on Data Mining. IEEE, 2011, pp. 814–823.
-  K. Yu, X. Wu, W. Ding, H. Wang, and H. Yao, “Causal associative classification,” in 2011 IEEE 11th International Conference on Data Mining. IEEE, 2011, pp. 914–923.
-  K. Radinsky, S. Davidovich, and S. Markovitch, “Learning causality for news events prediction,” in Proceedings of the 21st international conference on World Wide Web. ACM, 2012, pp. 909–918.
-  C. Hashimoto, K. Torisawa, J. Kloetzer, M. Sano, I. Varga, J.-H. Oh, and Y. Kidawara, “Toward future scenario generation: Extracting event causality exploiting semantic relation, context, and association features,” in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2014, pp. 987–997.
-  T. N. De Silva, X. Zhibo, Z. Rui, and M. Kezhi, “Causal relation identification using convolutional neural networks and knowledge based features,” World Academy of Science, Engineering and Technology, International Journal of Mechanical and Mechatronics Engineering, vol. 4, no. 6, pp. 697–702, 2017.
-  S. Zhao, Q. Wang, S. Massung, B. Qin, T. Liu, B. Wang, and C. Zhai, “Constructing and embedding abstract event causality networks from text snippets,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017, pp. 335–344.
-  J.-H. Oh, K. Torisawa, C. Kruengkrai, R. Iida, and J. Kloetzer, “Multi-column convolutional neural networks with causality-attention for why-question answering,” in Proceedings of the Tenth ACM International Conference on Web Search and Data Mining. ACM, 2017, pp. 415–424.
-  S. Zhao, M. Jiang, M. Liu, B. Qin, and T. Liu, “Causaltriad: Toward pseudo causal relation discovery and hypotheses generation from medical text data,” in Proceedings of the 2018 ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics. ACM, 2018, pp. 184–193.
-  A. Santoro, D. Raposo, D. G. Barrett, M. Malinowski, R. Pascanu, P. Battaglia, and T. Lillicrap, “A simple neural network module for relational reasoning,” in Advances in neural information processing systems, 2017, pp. 4967–4976.
-  R. Palm, U. Paquet, and O. Winther, “Recurrent relational networks,” in Advances in Neural Information Processing Systems, 2018, pp. 3368–3378.
-  J. Pavez, H. Allende, and H. Allende-Cid, “Working memory networks: Augmenting memory networks with a relational reasoning module,” arXiv preprint arXiv:1805.09354, 2018.
-  R. Prasad, N. Dinesh, A. Lee, E. Miltsakaki, L. Robaldo, A. K. Joshi, and B. L. Webber, “The penn discourse treebank 2.0.” in LREC. Citeseer, 2008.
-  Y. Shi, J. Meng, J. Wang, H. Lin, and Y. Li, “A normalized encoder-decoder model for abstractive summarization using focal loss,” in CCF International Conference on Natural Language Processing and Chinese Computing. Springer, 2018, pp. 383–392.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
-  Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1746–1751.
-  Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou, and Y. Bengio, “A structured self-attentive sentence embedding,” arXiv preprint arXiv:1703.03130, 2017.
-  R. Johnson and T. Zhang, “Deep pyramid convolutional neural networks for text categorization,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 562–570.