Risk is a core element of the modern financial system [poon2003forecasting]. Company risk prediction is one of the crucial tasks in finance, management, and operations research [kogan2009predicting]. Most early studies in this problem are based on numerical data such as price and volume [kogan2009predicting]
. With the advancement of deep learning, researchers attempted to incorporate unstructured data such as company annual reports[bao2014simultaneously] into their models. Among these unstructured data, earnings conference calls draw many researchers attention due to their unique form.
Earnings conference calls are a way for companies to share financial information with interested investors or institutions on a quarterly basis. Participants include executives from the company as well as buy-side analysts. An earnings conference call consists of two parts: the Presentation and the Q&A. In the presentation part, company executives present the previous quarter’s financial performance as well as an outlook for the future. During the Q&A part, analysts ask executives questions from their perspectives. Compared with other financial data, the communication between executives and analysts is more flexible, often reflecting the true picture of the company’s situation, which aids in the prediction of company risk. For example, AMD disclosed a higher-than-expected loss during its earnings conference call in Q1 2017, causing its shares to plunge 16.1% after the bell.
Although earnings conference calls have significant value for company risk prediction, most existing studies design models based on modality fusion and long text modeling without taking the dialogue structure in earnings conference calls into account. [qin_what_2019, theil_profet_2019]. Furthermore, not only does the information disclosed during the earnings conference call affect company risk, but so does the person who discloses the information. For example, despite knowing more about the company’s internal information, executives may avoid disclosing negative information. Analysts, on the other hand, ask questions from their own or their institution’s perspective, reflecting the market’s concerns about the company. Therefore, incorporating participant role information is beneficial for risk prediction.
Moreover, the relationship between companies is useful for predicting company risk. The flow of risk in the financial system caused the global financial crisis that lasted from 2007 to 2009. Although Lehman Brothers and AIG were not the largest companies in the market, their bankruptcy shocked the entire system due to their close relationship with other institutions. Thus, some researchers attempted to integrate company networks into company risk prediction. However, they construct companies that held earnings conference calls on different dates in an undirected graph, ignoring the requirement of no temporal information leakage for prediction tasks. [sawhney2020voltage].
To address the aforementioned issues, we propose a new model, Temporal Virtual Graph Neural Network (TVGNN), which jointly models earnings conference calls and company networks to predict company risk. The contributions of our research are listed below:
For the first time, we we incorporate participant role information into the task of company risk prediction.
We propose a new method to construct company networks, which guarantees no temporal information leakage.
We design two new modules to model company networks and market state for company risk prediction.
In addition, we create a new dataset to validate TVGNN, and the experimental results demonstrate the effectiveness and interpretability of our model.
2 Related Work
Our study involves two fields: graph neural networks and company risk prediction.
2.1 Graph Neural Network
Graph is everywhere in the real world. For example, the supply chain is a natural graph. However, it is difficult for traditional neural networks to deal with graphs. To solve this problem, graph neural networks were proposed.
The concept of graph neural networks was first proposed by Gori et al. [gori2005new] and then developed by Scarselli et al [scarselli2008graph]. These early studies were based on Banach’s fixed point theorem, updating the node embedding iteratively until it achieves stability, which was inefficient.
Inspired by convolutional neural networks, Bruna et al. proposed graph convolutional networks (GCN)[bruna2013spectral]. Kipf and Welling improved GCN’s computational efficiency by modeling convolutional kernels with Chebyshev polynomials and renormalization techniques [kipf2016semi]. Recently, a series of graph neural networks have been proposed, such as Graph Isomorphism Network [xu2018powerful], Graph Attention Network [velivckovic2017graph], and Graphormer [ying2021transformers].
Traditional graph neural networks are focused on static graphs with fixed nodes and edges, whereas many practical applications involve dynamic graphs with changing nodes or edges. A simple way to deal with dynamic graphs is to convert them into static graphs. Liben-Nowell and Kleinberg converted dynamic graphs to static graphs by adding the adjacency matrices of graphs at different time steps [liben2007link]. Hisano modeled the temporal information using the formation and dissolution matrices of previous time steps [hisano2018semi]. Aside from that, it is more natural to separately model the graphs at different time steps and then aggregate the results. Yao et al. used graph neural networks to model the graph snapshots at each time step, then weighted these snapshots based on their time difference from the current time step [yao2016link].
2.2 Company Risk Prediction
Company risk prediction is an essential task in the capital market. With the advancement of deep learning, unstructured data becomes increasingly important in this task. Kogan et al. predicted the volatility of listed companies’ stock returns using financial reports and digital indicators [kogan2009predicting]. Bao and Datta developed a variant of the latent Dirichlet allocation topic model for simultaneously discovering and quantifying risk types from textual risk disclosures, and they investigated how risk disclosures in 10-K forms affect investors’ risk perceptions [bao2014simultaneously].
Compared to annual reports and news, earnings conference calls contain more information due to their free form. Many researchers attempt to incorporate it into the task of company risk prediction. Qin and Yang proposed a multimodal deep regression model to capture CEOs’ textual and audio features in earnings conference calls for company risk prediction [qin_what_2019]
.Theil et al., on the other hand, used a hierarchical recurrent neural network to model the textual features of the presentation session and the Q&A session separately[theil2019profet]. Li et al. collected textual and audio data from the earnings conference calls of S&P 1500 companies from 2015 to 2018 and aligned the two modalities to construct a dataset called MAEC [li2020maec]. Yang et al. developed a Transformer-based multi-task architecture to model earnings conference call textual and audio features, with an auxiliary task of predicting stock price volatility on a given day [yang2020html]. Yang et al. improved his work in 2021 by introducing two new training tasks: Numeral Category Classification and Magnitude Comparison, which allowed the model to better capture numerical information [yang2022numhtml].
The above studies focus on textual and audio modality modeling and fusion, ignoring the unique dialogue structure of earnings conference calls. Zhen Ye et al. proposed a multi-round question-and-answer attention network (MRQA) to model the dialogue structure in earnings conference calls, employing a Reinforced Sentence Selector to identify important sentences in the Q&A session and a Reinforced Bidirectional Attention Network to model the association between the sentences [ye2020financial]. These studies, however, do not take into account the role information of participants in earnings conference calls.
Furthermore, Ye et al. utilized graph neural networks to incorporate inter-company relationships for company risk prediction for the first time [sawhney2020voltage]. However, in this study, companies that held earnings conference calls on different dates were constructed in an undirected graph, causing temporal information leakage.
In this section, we formulate our problem of company risk prediction and then present our proposed TVGNN model.
3.1 Problem Definition
Kogan et al. used the stock return volatility to measure the risk of listed companies [kogan2009predicting]. The greater the volatility of a company’s stock return, the greater the risk for investors holding this company’s stock. Generally, the stock return volatility from trading day to is defined as follows
where is the adjusted return of the given stock on the trading day , is the average of the adjusted returns from to , and is the length of the time window used to calculate stock return volatility. The adjusted stock return is defined as:
where is the adjusted closing price of the stock on the trading day which can reflect the actual change of stock price by removing the stock splits, dividends, and additions.
According to Earning Momentum, stock prices drift abnormally for a period of time after the release of earnings [ball1968empirical], so we set the time windows for calculating stock return volatility as 3, 7, and 15 days.
By quantifying company risk, we define the task of company risk prediction as a supervised regression task. Specifically, given the earnings conference call of company on day , the earnings conference call can be represented as a sequence of sentences in time order. The goal of company risk prediction is to predict the company’s stock return volatility by the sequence of sentences.
3.2 Proposed Model
We propose a new model, Temporal Virtual Graph Neural Network (TVGNN), to incorporate company network and earnings conference calls for company risk prediction, guaranteeing no temporal information leakage. The model has a hierarchical architecture, including three modules:
1) Sentence Encoder: This module extracts the textual information of the earnings conference call and incorporates dialogue structural features and participant role information to output a vector representation for each sentence.
2) Dialogue Encoder: This module captures the contextual relationship of sentences in earnings conference calls and encodes the entire dialogue into a vector.
3) Company Network Encoder: This is the key module of our model, consisting of company network construction, market encoder, and network encoder. First, we construct company networks based on company relationships, guaranteeing no temporal information leakage. The market encoder then models the market state at each time step. Finally, the network encoder employs a graph neural network to fuse all information and update the company representations.
At last, the obtained company representations are fed into an output layer for the downstream task.
3.2.1 Sentence Encoder
We use a pre-trained model MPNet 111https://www.sbert.net/ to encode sentences [song2020mpnet]. The sequence of encoded sentences is denoted as , where the is the encoded vector of the th sentence in the earnings conference call of company .
According to the position encoding in BERT [Devlin2019BERTPO], we use four structural embeddings to represent the dialogue structure of an earnings conference call:
1) Position embedding encodes the order of sentences in the dialogue.
2) Utterance embedding encodes to which utterance the sentences subordinate. All sentences spoken by a speaker at once are called an utterance.
3) Role embedding encodes the speaker’s role information of a given sentence. There are two roles in an earnings conference call: executives and analysts.
4) Part embedding encodes in which parts the sentence appears. An earnings conference call consists of two parts: presentation and Q&A.
Finally, we concatenate the structural embeddings to the sentence vector :
In this way, we obtain a sequence of sentence vectors that incorporates dialogue structural information, where each vector .
3.2.2 Dialogue Encoder
The key to text modeling is to model the contextual relationships. To obtain contextual information, we update sentence vectors with a Transformer encoder [vaswani2017attention]. Following BERT, we add a trainable [CLS] vector to the input sequence to obtain the representation of a given earnings conference call. Then the input sequence becomes . After -layer Transformer encoders, we take in the output sequence as the representation of the earnings conference call, denoted as .
3.2.3 Company Network Encoder
The company network encoder contains two submodules: a market encoder and a network encoder that model company networks and market state, respectively.
In this section, we first describe how to construct company networks. Then, we introduce how to apply the market encoder and the network encoder on the constructed company network to model the market state and the company relationships.
Company Network Construction Assume there are companies holding their earnings conference calls in a given quarter (natural quarter), we treat each company as a node in the graph, and the node corresponds to company . If there are some kind of relationship exists between and and the date when holds its earnings conference call in the given quarter is not earlier than , a weighted directed edge is connected between nodes and . The weight of the edge is defined as follows.
The weight defined in Equation (4) ensures that the weight of the edge between two nodes decreases as the interval between the two corresponding companies’ earnings conference calls increases. Figure reffig:cn shows an example of a company network constructed in this way.
The constructed company network is a directed static graph.The network’s directed edges prevent temporal information leakage, i.e., information can only flow from temporally preceding nodes to temporally following nodes, which is critical for a temporal prediction task.
Finally, we assign obtained from the dialogue encoder as the initial embedding of the node .
Market Encoder The capital asset pricing model points out that the return of an asset consists of the risk-free return of the market and the return of the asset [blume1973new]. Therefore, when predicting the risk of a given company, we should consider the impact of both the events (earnings conference calls) of the company and the market state. Thus, we design the market encoder to model the market state at day . The structure of the market encoder is shown in Figure 2.
Suppose there are different dates for holding an earnings conference call in a given quarter. The dates are sorted in chronological order. The set of companies holding earnings conference call at are denoted as .
The market state of can be represented as the sum of all events (earnings conference calls) happening in the market at . Thus, we can calculate the market state by a global attention module:
where is the attention sore of node , is a trainable parameter.
The market state is affected not only by the events at day
, but also by the market state of the past. Thus, we use a Gated Recurrent Unit Network (GRU)[li2015gated] to model the historial market state. Since the larger the time interval, the weaker the effect of the previous market state on the current market state, we define a coefficient which represents the time interval between and , to adjust the value of the reset gate in the GRU:
is an activation function.
Network Encoder We use a graph attention network with symmetric normalized adjacency matrix [wang2021bag] to update node embedding so that each company node can capture the topology of the company network. Assume using -layer graph attention networks, the update formula of the company node embedding at the -layer is as follows.
where equation (14) adds the company node embedding and corresponding market state to obtain the input vector for the -th layer of graph attention networks. is the attention score between and , calculated by
where and are trainable parameters. Compared to the vanilla graph attention network, we introduce the edge feature into the formula of attention score. Because the edge features commonly measure the strength of connections, we map it to a scalar coefficient to adjust the attention scores.
Finally, the company network encoder combines the market encoder and network encoder. The constructed company network and the company embedding are input into the company network encoder. For the th-layer company network encoder, it alternately uses a market encoder and a network encoder to update the company node embedding. In fact, the company encoder can be seen as a model that adds temporal virtual nodes to the company nodes and then uses extended GRU and GAT to update node embedding.
3.2.4 Output Layer
We use a multilayer perceptron as the output layer to predict stock return volatility:
where are trainable parameters, is an activation function.
Because the datasets used in previous studies lack the information of participant role, part segmentation, and company networks, which are required in our proposed model, we construct a new dataset to verify TVGNN.
For earnings conference calls, we collect the transcripts of companies in S&P 500 from Seeking Alpha 222https://seekingalpha.com/, date ranges from 2008 to 2019. Because the S&P 500 constituent companies are changing, we use all companies that have entered the index since its creation as target companies for processing ease 333https://en.wikipedia.org/wiki/List_of_S&P_500_companies.
For the construction of the company network, we use the Text-based Network Industry Classifications (TNIC) dataset 444https://hobergphillips.tuck.dartmouth.edu/industryclass.htm, which computes company pairwise similarity scores based on the product descriptions in 10-K files [hoberg2016text]. We add edges for two companies with a similarity score greater than 0.15. Since the TNIC dataset is updated annually based on the latest 10-K files, we use the TNIC dataset from the previous year to extract the company relationships to avoid temporal information leakage. The statistics of the constructed dataset are shown in Appendix A.
We split the dataset by time, using samples before 2016 as the training set, samples from 2016 as the validation set, and samples after 2016 as the test set. For a natural quarter, all samples are constructed as a company network, , with each node labeled with the corresponding company’s stock return volatility.
In order to verify the effectiveness of TVGNN, we compare it with baselines proposed in recent related studies. The baselines are shown below.
1) [kogan2009predicting]. is a simple but very effective benchmark model, which directly uses the return volatility as the prediction , without using any other information.
2) HAN[yang2016hierarchical]. The model is a widely used long document encoder that employs two BiGRU to capture contextual relationships at the word and sentence levels, as well as a simple attention module to obtain encoding at each level. The obtained document encoding is then used for downstream tasks.
3) ProFET[theil_profet_2019]. This model simply considers the dialogue structure of earnings conference calls and respectively models the presentation and Q&A part by a BiLSTM and an attention module. We use the text modeling part of ProFET for experiments.
4) MDRM[qin_what_2019]. The model employs a BiLSTM to extract textual and audio features from earnings conference calls, which are then fused by another BiLSTM to predict the company’s risk. We use the text modeling part of this model for experiments.
5) HTML[yang2020html]. The model repectively uses a pre-trained language model to extract word encoding and Praat to extract audio features. The features are then fused to obtain a multimodal encoding at the sentence level. The model is trained in a multi-task framework with an auxiliary task of predicting the return of the company’s stock on a given day. We use the text modeling and the multi-task part of the model for experiments.
6) MRQA[ye2020financial]. The model uses a BiLSTM to encode the textual features of earnings conference calls. It uses a reinforced sentence selector to select important sentences in the Q&A, and a reinforced bidirectional attention network to capture the interaction between questions and answers. The model directly models the dialogue structure of earnings conference calls.
4.3 Evaluation Metrics
The lower the MSE of a given model on the test set, the better the model. In addition, we use the metric as another metric to reflect the improvement of the model over :
4.4 Experimental Setting
We use the Adam optimizer to optimize models [kingma2014adam]
. The hyperparameters of TVGNN are tuned on the validation mean square error (MSE) to get: learning rate is 5e-4, weight decay is 1e-7, hidden state is 64, number of layers of dialogue encoder is 2, number of attention heads of dialogue encoder is 8, number of layers of company network is 3, and number of attention head of company network encoder is 1. For the baselines, we use the hyperparameters from the proposing paper. Training is stopped if the model’s MSE on the validation set does not decrease in 10 epochs.
4.5 Experimental Results
Table 1 shows the result of the comparison experiment, we can see that TVGNN performs best in the company risk prediction tasks for the three time windows. Compared to the best baseline models, TVGNN respectively improves 14.19%, 9.01%, and 6.92% for . On the overall performance , TVGNN improves 13.26% at the basis of the best baseline HTML, demonstrating the effectiveness of TVGNN.
Additionally, as the time window gets longer, the difference between and other baselines gets smaller. For , all baselines are worse than , indicating that the additional effect of the earnings conference calls transcripts gradually diminishes as the time window becomes longer, which is consistent with the Earning Momentum [ball1968empirical]. The impact of a company’s earnings information on its stock price fades over time. In our experiment, the effect lasts no longer than 15 days.
5 Supplementary Analysis
5.1 Ablation Experiment
There are four modules in TVGNN: sentence encoder, dialogue encoder, market encoder, and network encoder. To test the effectiveness of each module, we gradually add a module and build different variants of TVGNN for the ablation experiment.
The results are shown in Table 2, where the "+" in the table indicates that the variant contains the corresponding module. As we can see:
1) Adding each module can improve the model’s performance in company risk prediction, demonstrating the effectiveness of the four modules in TVGNN.
2) It is difficult for a model to improve on all three tasks without introducing new data. When and , variant 2 outperforms variant 1, but when , it outperforms variant 1. While variant 3 outperforms variant 2 on all three tasks by incorporating company networks.
3) Modeling the market state can improve the model’s performance on the task . TVGNN with the market encoder outperforms variant 3 on .
5.2 Transductive Learning
While TVGNN performs well on the company risk prediction task, we have to divide the samples by quarter for training and prediction, which is extremely restrictive in the real application. To better use our model in practice, we propose two transductive learning approaches:
1) TVGNN-T: We divide the nodes in a company’s network into training set, validation set, and test set in chronological order, in the ratio of 7:1:2. The constructed dataset is then used for training and prediction.
2) TVGNN-T(fine-tune): We pre-train TVGNN on samples from the previous period and then fine-tune the pre-trained model on the dataset constructed in 1).
The two transductive learning approaches are tested on the 2017 Q1 company network, and the results are shown in Table 3. We also compare their performance with TVGNN (The pre-trained model in 2, without fin-tune).
We can see that TVGNN-T performs poorly because it only trains on a small dataset, whereas TVGNN and TVGNN-T (fine-tune) benefit from the large pre-training dataset. Furthermore, TVGNN-T (fine-tune) outperforms TVGNN for it can obtain helpful information about the company network and market state for the given quarter.
5.3 Case Study
It is a requirement for models applied in finance to be interpretable because the field’s decisions involve millions of dollars. In this section, we use GNNExplainer to understand the TVGNN predictions. GNNExplainer can identify the important subgraph that influences the prediction of a given GNN model /cite ying2019gnnexplainer. Oracle, whose cloud business soared and increased its dividend in 2017 Q1, was our study case.
|JNPR||Juniper Networks||Communications Equipment||-3.6766|
|HPE||Hewlett Packard Enterprise||Technology Hardware, Storage & Peripherals||-3.0791|
The important 1-hop subgraph of ORCL (Oracle) computed by GNNExplainer is shown in Figure 3. The details of the nodes are shown in Table 4. The widths of the edges in the subgraph represent the edge’s important score. Despite the fact that all of the companies in the subgraph are in the information technology sector, not all of them have a significant impact on the risk of other companies. TVGNN’s prediction for ORCL (Cisco) is mainly influenced by CTXS, HPE, CSCO, FTNT, and FFIV. All five companies are Oracle partners, and they all sell Oracle products in their cloud services. Furthermore, each node in the graph has a self-loop, which represents the impact of the company’s event, i.e., the earnings conference call, on the risk of the company. We can see that the risk of JNPR (Juniper Networks) and FTNT (Fortinet) is influenced more by their earnings conference calls, whereas the rest is influenced more by other companies in the company network. That is because JNPR and FTNT released some shocking financial information during their earnings conference calls. JNPR posted a disappointing profit outlook, causing its stock to plunge 7.5% after the earnings conference call. FTNT’s stock, on the other hand, jumped 10.8% after its earnings conference call, as its performance in 2016 gave investors confidence in its 2017 performance.
In this paper, we propose a model called TVGNN for company risk prediction by jointly modeling company networks and earnings conference calls. We make a methodological contribution by designing a new method to construct company networks and developing a new model based on graph neural networks to model earnings conference calls, market state, and company networks. Empirical results on our constructed dataset demonstrate the superiority of our proposed model over competitive baselines from the extant literature. We also conduct supplementary analysis to examine the effects of our model’s modules and interpretability.
Appendix A Dataset Statistics
The quarterly statistics of the constructed dataset are shown in Table 5.
|Quarterly||Average #Utterance||Average #Sentences||#Nodes (Companies)||#Edges|
The label distribution on the training set, validation set, and test set is shown in Figure 4. We can see that the distribution of labels in different splits is consistent.