Stock markets are a symbol of market capitalism and billions of shares of stock are traded every day. In 2018, stocks worth more than 65 trillion U.S. dollars were traded worldwide and market capitalization of domestic companies listed in the U.S. exceeds the country’s GDP 111https://data.worldbank.org/. Although stock movement prediction is a difficult problem, its solutions can be applied to industry. Many researchers in both industry and academia have long shown interest in predicting future trends in the stock market. Researchers focused on finding profitable patterns in historical data are known as quants in the financial industry and referred to as data scientists in general. Regardless of which term is used, such researchers are increasingly using more systematic trading algorithms to automatically make trading decisions.
Even though there is still room for debate , numerous studies have showed that the stock market is predictable to some extent , . Existing methods are based on the ideas of fundamentalists or technicians, both of whom have different perspectives on the market.
Fundamentalists believe that the price of securities of a company corresponds to the intrinsic value of the company or entity . If the current price of a company’s stock is lower than its intrinsic value, investors should buy the stock as its price will go up and eventually be the same as its fundamental value. The fundamental analysis of a company involves an in-depth analysis of its performance and profitability. The intrinsic value of the company is based on its product sales, employees, infrastructure, and profitability of its investments .
Technicians, on the other hand, do not consider real world events when predicting future trends in the stock market. For technicians, stock prices are considered as only typical time series data with complex patterns. With appropriate preprocessing and modeling, patterns can be analyzed, from which profitable patterns may be extracted. The information used for technical analysis consists of mainly closing prices, returns, and volumes. The movement of stock prices is known to be stochastic and non-linear. Technical analysis studies focus on reducing stochasticity and capturing consistent patterns.
Some technical analysis works have focused on how to extract meaningful features from raw price data. In the finance industry, features extracted from such data are called technical indicators and include adaptive moving average, relative strength index, stochastics, momentum oscillator, and commodity channel index. Creating meaningful technical indicators is similar to manual feature engineering in general machine learning tasks. Like in any machine learning task, extracted features contain important information for models. Hence, some works have utilized indicators to more accurately predict the movement of stock prices, .
Technicians are also interested in finding meaningful patterns in raw price data. Numerous works have analyzed the effectiveness of different models. Although the majority of researchers agree that the stock market moves in a non-linear way, many empirical studies show that non-linear models do not outperform linear models , . The results of these studies show that even though deep neural network based models have been successfully applied to many challenging domains, careful consideration should be given when trying to design profitable models for stock market prediction. Lately, more studies are finding that non-linear models outperform linear models 
. Many studies have shown that recurrent neural network based models are effective in stock movement prediction.
As the amount of information on the web continues to rapidly increase, it becomes easier to obtain information on securities from different sources. Data scientists with a computer science background have begun to pay attention to unstructured data such as text from the news and Twitter , . Models that use text data reflecting the real world events of companies can be categorized as fundamental analysis models. However, text-based stock prediction approaches try to capture investors’ opinions about an event. Based on the assumption that the price of a company’s stock can be based on the total aggregation of investors’ opinions about the company, some works focused on reading investor’s opinions about companies . There also exist researches focusing on understanding the impact of events on stock price .
More recently, computer science research communities have been highly interested in utilizing graph-structured data , . Stock market prediction methods using corporate relational data have also been proposed , . Chen et al. created a network of companies based on financial investment information . Using a constructed adjacency matrix, they trained a GCN model and compared its prediction performance with that of more conventional network embedding models’. Feng et al. developed a more general framework  that involves using many different types of relations in a publicly available knowledge database. They also proposed a GNN model that can capture temporal features of stocks. Although these models were the first to integrate relational data for stock market prediction, they can be still be improved. The quality of information varies considerably depending on the type of relation. However, no existing work has thoroughly investigated which types of relational data are more beneficial to stock movement prediction or focus on finding an effective way to selectively aggregate information on different relation types.
Furthermore, previous works have focused mainly on node classification. Node classification and graph classification are the two main tasks in graph-based learning. In a stock market network, individual nodes typically represent companies. Predicting future trends in individual stock prices is similar to the node classification task. We argue that previously proposed models can be used as a node representation updating function in the graph classification task which we propose in this work.
To address the limitations mentioned above, in this paper, we study how to effectively utilize graph-based learning methods and relational data in stock market prediction. We use different types of relations and investigated their effect on performance in stock price movement prediction of individual companies. In our experiments, we found that only relevant relations are useful for stock prediction. Information from some irrelevant relations even degraded prediction performance. We propose HATS which is a new hierarchical graph attention network method that uses relational data for stock market prediction. HATS selectively aggregates information from different relations and adds the information to the representations of companies. Specifically, node features are initialized with extracted features from the feature extraction module. HATS is used as relational modeling module to selectively gather information from neighboring nodes. The representations with the added information are then fed into a task-specific prediction layer.
We applied our method to the following two graph related tasks: predicting the movement of individual stock, which is the same node classification task performed in previous works, and predicting the movement of the market index, which is similar to the graph classification task. This is a new way of adapting graph-based learning in stock market prediction. Since market indices consist of individual stocks, we can predict the movement of a market index using a graph classification based approach. The experimental results on both tasks demonstrate the effectiveness of our proposed method.
The main contributions of this work can be summarized as follows.
We thoroughly investigate the effect of using relations in stock market prediction and find characteristics of more meaningful relation types.
We propose a new method HATS which can selectively aggregate information on different relation types and add the information to each representation.
We propose graph classification based stock prediction methods. Considering the market index as an entire graph and constituent companies as individual nodes, we predict the movement of a market index using a graph pooling method.
We perform extensive experiments on stocks listed in the S&P 500 Index. Our experimental results demonstrate that the performance of our method in terms of Sharpe ratio and F1 score was 19.8% and 3% higher than the existing baselines, respectively.
The remainder of this paper is organized as follows. In Section 2, we provide short preliminaries that can be helpful in understanding our work. Detailed descriptions of our proposed framework are provided in Section 3. In Section 4, we explain how we collected the data used in our experiments. We discuss our experimental results in Section Section 5 and we conclude our work in Section 6.
A graph is a powerful data structure which can be used to deal with relational data. Various methods learn meaningful node representations in graphs. In this section, we provide a brief preliminary about a graph based method. Graph consists of a set of vertices (nodes) and edges . If a node is denoted as and is an edge connecting node and , the Adjacency matrix is an matrix with . The degree of a node is the number of edges connected to the node, and is denoted as D where . Each node can have node features (attributes) X, where is a feature matrix.
The features of nodes change over time in a spatial-temporal graph which can be defined using a feature matrix where is the length of time steps.
Graph Neural Networks
With a growing interest in utilizing graph-structured data, a large amount of research has been conducted for learning meaningful representations in graphs. Most graph neural networks (GNNs) can be categorized as spectral or non-spectral.
Spectral graph theory based methods such as GCN 
utilize convolutional neural networks (CNN) to capture local patterns in graph-structured data. GCN applies a spectral convolution filter to extract information in the Fourier domain.
Equation (2.1) describes a spectral convolution filter used for graph data
(n companies) and a diagonal matrix M. U is the eigenvector matrix of a graph Laplacian matrix.
However, in large graph data, computing eigendecomposition of graph Laplacian is computationally too expensive. To address this problem, Kipf and Welling approximated spectral filters in terms of Chebyshev polynomials up to order based on Chebyshev coefficient , which can be defined as follows.
denotes the largest eigenvalue of graph Laplacian.
Additionally, Chebyshev coefficients could be represented as with and . In , GCN is proven to be effective with the parameter setting of K=1. Also, they simply transformed Equation (2.2) as a fully connected layer with a built-in convolution filter.
On the other hand, non-spectral approaches directly define convolution operations directly on the graph, utilizing spatially close neighbors. For example, Hamilton et al. proposed a general framework for sampling and aggregating features from the local neighborhood of a node to generate embeddings. Specifically, features of neighboring nodes are aggregated iteratively using a learnable aggregation function, which is described as follows.
where denotes the representation of node at -th iteration and is a learnable aggregation function. Many proposed methods can be considered as special types of aggregation functions. For example,Veličković et al. assigned different weights using attention mechanism to aggregate features of neighboring nodes .
Updated node representations can be used in both node classification and graph classification tasks. For a graph classification task, additional layers are needed to sum individual node representations and make graph representations. Graph pooling is a technique used in making graph representations. Numerous works which can effectively aggregate node features have been proposed , .
In this section, we will first explain our entire framework. Our framework is based on many different stock market prediction methodologies that use corporate relational data. Knowing how the general framework functions can help in understanding the importance of using relational data in stock prediction. The overall framework is shown in Fig. 1. After providing a general description of the framework, we will elaborate on the structure of our method HATS is a new type of relational modeling module.
3.1 General framework
Feature Extraction Module
A stock market graph is a typical type of spatial-temporal graph. If we regard individual stocks (companies) as nodes, each node feature can represent the current state of each company with respect to price movement. Also, node features can evolve over time. As mentioned in Section 1, numerous types of data (e.g. historical price, text or fundamental analysis based sources) can be used as an indicator for the movement of a stock price. As data such as raw text or price data are not informative enough, we need a feature extraction module for obtaining meaningful representations of individual companies. In this study, we use only historical price data.
A feature extraction module is used to represent the current state of a company based on historical movement patterns. Numerous tools that predict future trends in the stock market using raw price data as their input can be used as a feature extraction module. In this study, we use LSTM and GRU as our feature extraction modules. LSTM is the most widely used framework in time series modeling and  and 
have also used LSTM as their feature extraction module. For a more detailed description of how node feature vectors are extracted from raw price data, we refer readers to. We also use GRU as a feature extraction module as it is known to be more efficient than LSTM in time series tasks, and obtains similar performance with appropriate tuning. From our experiments, we found that LSTM performs slightly better than GRU on average. However, it was more difficult to train LSTM especially when a model had more layers. For this reason, we use LSTM for the individual stock prediction task and GRU for the index movement prediction task where an additional graph pooling layer is needed.
Relational Modeling Module
A relational modeling module is a node updating function. Gilmer et al. considered graph-based learning as information exchange between related nodes . The main function of graph neural networks is information exchange between neighboring nodes. Information from neighboring nodes is aggregated and then added to each node representation. Information collected from different nodes and relation types needs to be effectively combined. To this end, we propose a new GNN based Hierarchical graph Attention Network for Stock market prediction (HATS) method. Each layer is designed to capture the importance of neighboring nodes and relation types. A detailed description of our proposed method HATS is provided in the below section Subsection.
After node representations are updated using relational modeling, the node representations are fed into the task-specific module. Since node representations can be used in various tasks with appropriate modeling, the layer is considered "task-specific." In this study, we performed experiments on the following two graph-based learning tasks: individual stock prediction and market index prediction. Individual stock prediction is similar to the node classification task which was performed in previous researches , . As market indices consist of multiple related stocks, information on the current state of an individual company can be utilized to predict the movement of its index. As recently proposed graph pooling methods can be used to aggregate information of individual nodes to represent an entire graph, they can also be used for the index prediction task. The experimental results in Section 5 demonstrate that the graph pooling methods outperform all baseline methods.
In the next subsections, we describe HATS in more detail. We present our method HATS which aggregates information and adds it to node representations. Then, we explain how we use node representations with added information in two different tasks.
3.2 Hierarchical Attention Network
Let us denote a f-dimensional feature vector from a feature extraction module of a company i at time t as . In figure 2, we omit superscript t for simplicity, assuming that all the node representation vectors are calculated at time step t. We can define edges between different types of relations. For the graph neural network operation, we have to know the set of neighboring nodes for our target node i from each relation type. Let us denote the set of neighboring nodes of i for relation type m as and the embedding vector of relation type m as . Here, d is a dimension of a relation type embedding vector. Our goal is to selectively gather information on different relations from neighboring nodes. We want our models to filter information that is not useful for future trend prediction. This process is important because companies have many different types of relationships and some information is not related to movement prediction.
Attention mechanism is widely used to assign different weight values for information selection. With hierarchically designed attention mechanism, our Hierarchical Attention network for Stock prediction (HATS) selects only meaningful information at each level. Its hierarchical attention network is key in improving performance. The architecture of HATS is shown in Fig. 2.
At the first state attention layer, HATS selects important information on the same type of relation from a set of neighboring nodes. The attention mechanism is used to calculate different weights based on the current state (representation) of a neighborhood node. To calculate the state attention scores, we concatenate relation type embedding and the node representations of i and j into a vector where . If we denote the concatenated vector as , the state attention score is calculated as follows:
where and are learnable parameters used to calculate the state attention scores. With attention weight calculated using Eq. (3.2), we combine all weighted node representations to calculate a vector representation of relation for company as Eq. (3.3).
With above equation, all the representations of each type of relation are obtained. We selectively gathered information on specific relations from neighboring nodes. A representation can be considered as summarized information of a relation. Vector contains summarized information from relation . For example, the representation of the industry relation summarizes the general state of the industry of our target company. Like human investors, our model should prioritize trading decisions based on summarized information of each relation. The second layer of HATS is designed to continuously assign importance weights to information using another attention mechanism.
We concatenate the summarized relation information vector , representation of the current state of company , and the relation type embedding vector to use as input for the relation attention layer.
Finally, the representation of a node is added.
In the next two subsections, we describe how updated node representations can be used in different tasks.
3.3 Individual Stock Prediction Layer
Like previous works such as  and , our model can be applied to the individual stock prediction task. We performed classification on the following three types of labels: [up, neutral, down]. A detailed description of the task setting is provided in Section 5
. For the individual stock prediction task, we added only a simple linear transformation layer.
where , , and l is the number of movement classes. We trained models on all the corporate relational data using cross-entropy loss.
where is a ground truth movement class of company and denotes all the companies in our dataset.
3.4 Graph Pooling for Index Prediction
A market index consists of multiple stocks chosen based on specific criteria. Let us denote a graph of a specific market index with companies as , where a group of constituent companies of index is and its updated node representation is . To obtain the representation of the entire graph, the features of individual nodes need to be aggregated. Recently, numerous graph pooling methods for aggregation, such as  and , have been proposed. Stock market index data has its own historical price patterns which can be used as features. Therefore, we combine features obtained by graph pooling individual nodes and features directly extracted from historical price data.
We used mean pooling methods in our experiments to calculate graph representations as follows:
where is the updated representation of company . By denoting the target index’s own feature vector extracted using the feature extraction module as , the final representation of an entire graph can be obtained by combining the original representation of the graph and the representation obtained by graph pooling as follows.
We also concatenated the two representations; however, this did not have a significant impact on performance. As in the individual stock prediction task, we make predictions using simple linear transformation with and , and train models using cross-entropy loss as follows.
Note that we use the most basic pooling method as this is the first work to apply graph pooling to the stock prediction task. There exists much room for improvement, which we we leave for future work.
4.1 Price-related data
In this study, we focused on the U.S. stock market, one of the largest markets in the world based on market capitalization. We gathered corporate relational data from a public database which contains information on most of the S&P 500 companies. Among the S&P listed companies, there exist some companies without any type of relation with other companies in the database. After removing such companies, the remaining 431 companies were used as our target companies.
We sampled price data for our study from 2013/02/08 to 2017/10/05 (1174 trading days in total). Figure 3. shows the closing price of the S&P 500 index, which represents the overall market condition. As shown in Figure 3., although the index price has a tendency to go up, there are several crashes in our sample period. We used different experimental settings with varying degrees of volatility to evaluate performance. A more detailed description of the task settings is provided in Section 5.
As we described in Section 3, raw features of price-related data are fed to the feature extraction module. Many different types of raw features such as open price, close price, and volume can be used. In this study, following , we use historical price change rates as our input. Let and be the closing prices of a company i at time , respectively. The price change rate at time t is calculated as . As our model can predict the movement of a stock price, the price change rate can also be predicted. Therefore, our model can predict the price change rate of the next day , given the sequence of the historical price change rate of a company .
4.2 Corporate Relation data
The second type of data we used is corporate relational data. Following Feng et al., we collected corporate relational data from Wikidata . Wikidata is a free collaborative knowledge base which contains relations between various types of entities (e.g. person, organization, country). Each entity has an index. If two entities have a relationship, it is considered as property. For example, the sentence "Steve jobs founded Apple" is expressed as a triplet [Apple, Founded by, Steve jobs]. In terms of graphs, each entity in Wikidata is a node and each property is an edge. Therefore, Wikidata can be understood as a heterogeneous graph with many different types of nodes and edges.
Here, companies are the only node type in which we are interested. However, there are a few types of edges between companies and their connections are very sparse. To address this problem, we utilize meta-path which is commonly used to deal with heterogeneous graphs . If Steve Jobs was a board member of Google, we can make the triplet [Steve jobs, Board Member, Google]. Combined with the above mentioned relation [Apple, Founded, Steve jobs], the two companies Apple and Google are now connected by the meta-path [Founded by, Board member] and share the node Steve Jobs. In this way, we found that there exist 75 types of relations including direct relations between companies and meta-paths. The entire lists of individual relations and meta-paths used in this study are provided in the appendix.
One of our main goals is to study the effect of using corporate relational data on stock market prediction performance. There are many ways to define a set of neighboring nodes. We used a meta-path with only 2 hops at maximum to convert an originally heterogeneous graph into a homogeneous graph with only company nodes. Still, methods for building a corporate relational network from a large knowledge base can be much improved, which we leave for future work.
5.1 Experiment design
- As we mentioned in Section 3.3, we divided the training data into the following three classes based on the price change ratio: [up, neutral, down]. Specifically, two threshold values were used to divide the training data into the three classes and to assign labels to evaluation and test data. This labeling strategy labels small movements as neutral and uses only the significant movements as directional labels.
As shown in figure 3, although the price tends to go up eventually, there exist frequent stock market crashes. To ensure a strategy is profitable, it is important to keep your drawdown at a minimum level. Therefore, we should determine whether models are effective even in a highly volatile period. For this purpose, we divided our entire dataset into 8 smaller datasets that went through different phases, following . Each phase consists of 250 days of training, 50 days of evaluation, and 100 days of testing.
For all the models in our experiments, we used a 50-day lookback period. As we used only the price change ratio as our input feature, the length of the input vector is 50. We used LSTM as a feature extraction module for individual stock prediction, and GRU for the index movement prediction task. We optimized all the models using the Adam optimizer, and tuned the hyper-parameters for each model within a certain range. Specifically, we used a learning rate between 1e-3 and 1e-5, weight decay between 1e-4 and 1e-5, and dropout between 0.1 and 0.9. Relu was used as our activation function. We measured the performance of the models on the evaluation set for each period. We performed early stopping based on F1 score. As the results of the stock prediction task tend to vary widely, all the experiments in this work were repeated five times. The results were averaged to obtain those numbers in the table.
To measure the profitability of the models, we used a trading strategy based on movement prediction. Following Fischer and Krauss, we made a neutralized portfolio based on the prediction value obtained by models 
.Since there are three classes, the prediction vectors from all the models are three dimensional. Values of each dimension represent the predicted probability of each class. We selected 15 companies with the highest up class probability and the long position was taken. For the 15 companies with the highest down class probability, the short position was taken. This method is widely used when creating simple trading strategies for prediction models. We implemented our model in TensorFlow. Our source code and data are available athttps://github.com/to-be-done.
We evaluated our models based on profitability and classification. In general, creating profitable trading strategies is the ultimate goal of stock movement prediction. Using the trading strategy mentioned above, we used two metrics to calculate profitability.
Return We calculated the return of our portfolio as follows.
where denotes a set of companies included in the portfolio at time , and denotes the price of stock at time . is a binary value between 0 and 1. is 0 if the long position is taken at time for stock ; otherwise, it is 1.
Sharpe Ratio Sharpe ratio is used to measure the performance of an investment compared to its risk. The ratio calculates the earned return in excess of the risk-free rate per unit of volatility (risk) as follows:
where denotes an asset return and is the risk free rate. We used the 13-week Treasury bill to calculate the risk-free rates.
As price movement prediction is a special type of classification task, we used metrics widely used in classification tasks.
Accuracy and F1-Score These two metrics are the most widely used for measuring classification performance.
Each prediction can be labeled as True Positive(TP), True Negative(TN), False Positive(FP), or False Negative (FN). Accuracy and F-Score are calculated as follows.
After calculating the F1-score of each class, we averaged all the scores to obtain the macro F1-score.
- We conducted experiments on the following baseline models. We describe the architecture of each model. We used different combinations of architectures and found that deeper structures generally suffer from overfitting.
Baselines without the relational modeling module.
Basic Multi Layer Perceptron model. We used an MLP consisting of 2 hidden layers with 16 and 8 hidden dimensions, respectively, and 1 prediction layer.
CNN We used Convolutional Neural Network as it is known to be fast and as effective as RNN-based models in time series modeling. In our experiments, we used CNN with 4-layers and 2 convolutions and 2 pooling operations. The two convolutional layers with filter sizes of 32 and 8, respectively, and 5 kernels are used for each layer.
LSTMLong Short-Term Memory is one of the most powerful deep learning models for time series forecasting. Many previous works have proven the effectiveness of LSTM. We used a LSTM network with 2 layers and a hidden size of 128. To train LSTM, we used the RMSProp optimizer which is known to be suitable for RNN-based models.
Baselines with the relational modeling module.
GCN-TOP20 We used the same GCN model but we used the edges from only the top 20 types of relations in the experiment, described in subsection 5.2, to create an adjacency matrix. Only relations that are manually selected for stock market prediction are included in the adjacency matrix. By comparing GCN-Top20 with vanilla GCN, we analyzed the effect of using relations on stock market prediction performance.
TGC Temporal Graph Convolution module for Relational modeling. Feng et al. proposed a general module for stock prediction. This module assigns values to the neighboring nodes of the target company based on the current state of the company and the relations between the nodes and the company. TGC aggregates all the information of a target company from its neighboring nodes while our HATS model summarizes information on different relation types.
As mentioned in Section 3.1, for all models with a relational modeling module, LSTM is used as a feature extraction module in the individual stock prediction task. In the index movement prediction task, GRU is used as a feature extraction module. The simpler design of GRU makes it easier to train and helps obtain consistent results with deeper model architecture. Therefore, we used GRU as a feature extraction module for all models with the relational modeling module in the index movement prediction task.
|Industry-Product or material produced||0.3251|
|Parent organization-Owner of||0.325|
|Founded by-Founded by||0.3245|
|Complies with-Complies with||0.3242|
|Owner of-Parent organization||0.3241|
|Legal form-Instance of||0.311|
|Instance of-Legal form||0.3082|
|Location of formation-Country||0.307|
|Country-Location of formation||0.3053|
|Country of origin-Country||0.2948|
|Country-Country of origin||0.2886|
|Country-Country of origin||0.2851|
|Instance of-Instance of||0.2748|
|Stock Exchange-Stock Exchange||0.2665|
|Phase 1||Phase 2||Phase 3||Phase 4||Phase 5||Phase 6||Phase 7||Phase 8||Average|
5.2 Analysis of the effect of using relation data
We first conducted experiments to investigate the impact of using different types of relations for stock market prediction. The experiments were performed on the individual stock prediction task. To measure the effect of different relations, we used a basic GCN model that cannot distinguish the types of relations. Following , we used a GCN with two convolution layers and one prediction layer, which is defined as follows:
where . Here, is an adjacency matrix with added self-connections, and is a degree matrix of . Therefore, changing the relation type changes the adjacency matrix that is fed into the GCN. We list the 10 best and 10 worst relations and their F1 scores on the test set of Phase 4 in Table 1 Table 1.
Our key findings are as follows.
Using relation data does not always yield good results in stock market prediction. In our worst cases, using relation data significantly decreased performance. On the other hand, some relation information proved to be helpful in prediction. The best performance is 6% higher than the worst performance.
Densely connected networks usually have noise. We confirmed this while analyzing the characteristics of the best and worst relations. Although the number of relations does not affect performance the most, less semantically meaningful relations such as country and stock exchange have very dense networks. Intuitively, densely connected networks carry a considerable amount of noise, which adds irrelevant information to the representations of target nodes.
Manually finding optimal relations is laborious. Although semantically meaningful relations generally help improve performance, selecting such relations requires much work and expertise.
Based on the above findings, we can conclude that relational information should be selectively chosen when using it for stock market prediction. Furthermore, the framework should be designed to automatically select useful information to minimize the need for manual feature engineering. We conducted experiments on two different tasks to verify the effectiveness of different relational modeling approaches.
5.3 Individual Stock Prediction
|Phase 1||Phase 2||Phase 3||Phase 4||Phase 5||Phase 6||Phase 7||Phase 8||Average|
|Average Daily Return (%)|
|Sharpe Ratio (Annualized)|
The classification accuracy results of the experiments on individual stock market prediction are summarized in Table 2. Among the baselines without a relational modeling module, LSTM generally performs the best. Therefore, we compare the results of the models with a relational modeling module and the result of LSTM. In terms of accuracy, all models with a relational modeling module performed better than LSTM. However, not all relational models outperformed LSTM in terms of F1 score. As shown in Table 2, only GCN-Top20 and HATS achieved higher F1 scores. It is interesting that the GCN and TGC both of which obtained lower F1 scores than LSTM achieved the best accuracy. GCN and TGC tend to make biased predictions on a specific class. By making biased predictions, the GCN and TGC models obtained higher accuracy but lower F1 scores. On the other hand, GCN-Top20 and HATS obtained slightly lower accuracy than the two other relational module baselines but higher F1 scores.
Selectively aggregating information from different relations can help improve F1 scores. Although TGC performed better than vanilla GCN, TGC was outperformed by GCN-Top20 which was trained on manually selected relational data. In contrast, our proposed model HATS generally outperformed all the baselines in terms of F1 score. These results are consistent with the profitability test results which are provided in the following subsection.
The individual stock prediction results on the profitability test are summarized in Table 3. We calculated the daily returns of the neutralized portfolio made using the strategy discussed in Section 5.1, and averaged them for each period. On average, GCN-Top20 and HATS obtained the highest average daily return. As mentioned above, GCN-Top20 and HATS outperformed GCN and TGC in terms of F1 score. TGC performed better than vanilla GCN but worse than LSTM. Surprisingly, the Sharpe ratio of GCN-Top20 was lower than that of LSTM. Without even calculating the Sharpe ratio, we can see in Table 3
that the expected return results of GCN-Top20 have large variance, which may be attributed to GCN-Top20 using relational data statically. Although relations used for GCN-Top20 are manually selected and expected to improve stock prediction, fixed relations may be useful only in a specific market condition. As GCN cannot assign importance to neighboring nodes based on the market condition and current state of a given node, its results vary widely. By selecting useful information based on the market situation, our HATS model obtains good performance in terms of expected return and Sharpe ratio.
5.4 Market Index Prediction
As mentioned in section 4, we gathered price and relational data for 431 companies listed in the S&P 500. There exist 9 different market indices each representing an industrial sector. We removed four indices with less than 20 constituent companies and have five remaining market indices. The five market indices are as follows: S5CONS (S&P 500 Consumer Staples Index), S5FINL (S&P 500 Financials Index), S5INFT (S&P 500 Information Technology Index), S5ENRS (S&P 500 Energy Index), S5UTIL (S&P 500 Utilities Index). As the graph of constituent companies is already sparse, we do not use GCN-Top20 as a baseline. The results are summarized in table 5.
Due to the space constraints, we provide only the averaged results for each index in table 5. The experimental results of each phase are provided in the appendix. Furthermore, we did not measure the profitability performance of a neutralized portfolio on the market index prediction task. It is not reasonable to make neutralized portfolio With only five assets as our portfolio selection universe.
On average, models with a relational modeling module outperformed LSTM on the market index prediction task. However, HATS is the only model that achieved significantly better performance than LSTM in terms of F1 score and accuracy. GCN performed slightly better than LSTM and TGC performed worse than LSTM in terms of F1 score. As we used the same pooling operation for all the models, the differences in performance can be mainly attributed to their relational modeling module. This again proves that HATS is effective in learning node representations for a given task. On the market index prediction task, HATS outperforms all the baselines in terms of F1 score and accuracy on average.
Unexpectedly, the other baselines with the relational modeling module did not perform significantly better than LSTM. The baselines cannot easily select information from different relation types and they use a naive structure to obtain graph representations. Many graph pooling methods such as  and  have already been proposed for learning graph representations, and proven to be more effective in many different tasks. We expect that more advanced pooling methods will further improve performance on the market index prediction task.
5.5 Case Study
Relation attention scores
In this section, we conduct two case studies to further analyze the decision-making mechanism of HATS. As previously mentioned, HATS is designed to gather information from only useful relations. For our first case study, we calculated the attention score of each relation. By analyzing the relation types with the highest and lowest attention scores, we can understand what types of relations are considered to be important. Fig. 5 shows a visualization of the attention scores of all the relations. We calculated the average attention scores on the test sets from all the phases and selected 20 relations with the highest attention scores and 10 relations with the lowest attention scores. The visualization shown in Fig. 5 is based on the average scores calculated in each test phase. As shown in Fig. 5, the relations with the highest attention scores are mostly dominant-subordinate relationships such as parent organization-subsidiary relationships. Some relations with the highest scores represent industrial dependencies. On the other hand, most of the relations with the lowest attention scores are geographical features.
In studies on graph neural network methods, researchers are interested in representations obtained by GNN. We present the visualization node representation obtained by HATS in Fig. 6. We obtain the representations of all companies on a specific day and use the T-SNE algorithm to map each representation to a two-dimensional space. In Figure 6(a), the movement of a stock on a given day is denoted by any one of the three colors which represent the up/neutral/down labels we used in our experiment. In Figure 6(b), industries of companies are denoted by different colors. We can find a rough line that separates companies with up labels from companies with down labels in Figure 6(a). It is also interesting that representations of the neutral movement are widely spread. In Figure 6(b), there exists a group of clusters in the same industry. We can find these clusters in any time phase. Although the prices of two stocks in the same industry do not always move in the same direction, the clusters in Figure 6(b) show that HATS learned meaningful representations.
In this work, we proposed our model HATS which uses relational data in stock market prediction. HATS is designed to selectively aggregate information on different relation types to learn useful node representations. HATS performed the graph related tasks of predicting individual stock prices and predicting market index movement. The experimental results prove the importance of using proper relational data and show that prediction performance can change dramatically depending on the relation type. The results also show that HATS which automatically selects information to use outperformed all the existing models.
There exist many possibilities for future research. First, finding a more effective way to construct a corporate network is an important research objective that could be the focus of future studies. In this study, we define the neighborhood of a company as a cluster of companies connected by direct edges or meta-paths with at most 2 hops. However, the way in which we define it could be improved. Furthermore, we used a single database (WikiData) to create a company network. In future work, we could use another source of data and we could even create knowledge graphs from unstructured text of various sources. Applying more advanced pooling methods to obtain graph representations could improve the overall performance of GNN methods on the market index prediction task.
This work was supported by the National Research Foundation of Korea (NRF-2017R1A2A1A17069645, NRF-2017M3C4A7065887)
- Adebiyi et al.  Adebiyi, A.A., Adewumi, A.O., Ayo, C.K., 2014. Comparison of arima and artificial neural networks models for stock price prediction. Journal of Applied Mathematics 2014.
- Agrawal et al.  Agrawal, J., Chourasia, V., Mittra, A., 2013. State-of-the-art in stock prediction techniques. International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering 2, 1360–1366.
Bao et al. 
Bao, W., Yue, J., Rao,
A deep learning framework for financial time series using stacked autoencoders and long-short term memory.PloS one 12, e0180944.
- Bollen et al.  Bollen, J., Mao, H., Zeng, X., 2011. Twitter mood predicts the stock market. Journal of computational science 2, 1–8.
- Bollerslev et al.  Bollerslev, T., Marrone, J., Xu, L., Zhou, H., 2014. Stock return predictability and variance risk premia: statistical inference and international evidence. Journal of Financial and Quantitative Analysis 49, 633–661.
- Chen et al. [2018a] Chen, J., Ma, T., Xiao, C., 2018a. FastGCN: Fast learning with graph convolutional networks via importance sampling, in: International Conference on Learning Representations.
- Chen et al. [2018b] Chen, Y., Wei, Z., Huang, X., 2018b. Incorporating corporation relationship via graph convolutional neural networks for stock price prediction, in: Proceedings of the 27th ACM International Conference on Information and Knowledge Management, ACM. pp. 1655–1658.
- Dechow et al.  Dechow, P.M., Hutton, A.P., Meulbroek, L., Sloan, R.G., 2001. Short-sellers, fundamental analysis, and stock returns. Journal of Financial Economics 61, 77–106.
- Dempster et al.  Dempster, M.A., Payne, T.W., Romahi, Y., Thompson, G.W., 2001. Computational learning techniques for intraday fx trading using popular technical indicators. IEEE Transactions on neural networks 12, 744–754.
Ding et al. 
Ding, X., Zhang, Y., Liu,
T., Duan, J., 2015.
Deep learning for event-driven stock prediction, in: Twenty-Fourth International Joint Conference on Artificial Intelligence.
- Ding et al.  Ding, X., Zhang, Y., Liu, T., Duan, J., 2016. Knowledge-driven event embedding for stock prediction, in: Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pp. 2133–2142.
- Dong et al.  Dong, Y., Chawla, N.V., Swami, A., 2017. metapath2vec: Scalable representation learning for heterogeneous networks, in: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, ACM. pp. 135–144.
- Feng et al.  Feng, F., He, X., Wang, X., Luo, C., Liu, Y., Chua, T.S., 2019. Temporal relational ranking for stock prediction. ACM Transactions on Information Systems (TOIS) 37, 27.
- Fischer and Krauss  Fischer, T., Krauss, C., 2018. Deep learning with long short-term memory networks for financial market predictions. European Journal of Operational Research 270, 654–669.
- Gilmer et al.  Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E., 2017. Neural message passing for quantum chemistry, in: Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org. pp. 1263–1272.
- Hamilton et al.  Hamilton, W., Ying, Z., Leskovec, J., 2017. Inductive representation learning on large graphs, in: Advances in Neural Information Processing Systems, pp. 1024–1034.
- Kipf and Welling  Kipf, T.N., Welling, M., 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907 .
- Krizhevsky et al.  Krizhevsky, A., Sutskever, I., Hinton, G.E., 2012. Imagenet classification with deep convolutional neural networks, in: Advances in neural information processing systems, pp. 1097–1105.
- Lee et al.  Lee, J., Lee, I., Kang, J., 2019. Self-attention graph pooling. arXiv preprint arXiv:1904.08082 .
Li et al. 
Li, X., Xie, H., Chen,
L., Wang, J., Deng, X.,
News impact on stock price return via sentiment analysis.Knowledge-Based Systems 69, 14–23.
- Malkiel  Malkiel, B.G., 2003. The efficient market hypothesis and its critics. Journal of economic perspectives 17, 59–82.
- Patel et al.  Patel, J., Shah, S., Thakkar, P., Kotecha, K., 2015. Predicting stock and stock price index movement using trend deterministic data preparation and machine learning techniques. Expert Systems with Applications 42, 259–268.
- Phan et al.  Phan, D.H.B., Sharma, S.S., Narayan, P.K., 2015. Stock return forecasting: some new evidence. International Review of Financial Analysis 40, 38–51.
- Rather et al.  Rather, A.M., Agarwal, A., Sastry, V., 2015. Recurrent neural network and a hybrid model for prediction of stock returns. Expert Systems with Applications 42, 3234–3241.
- Veličković et al.  Veličković, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y., 2017. Graph attention networks. arXiv preprint arXiv:1710.10903 .
- Vrandečić and Krötzsch  Vrandečić, D., Krötzsch, M., 2014. Wikidata: a free collaborative knowledge base .
- Ying et al. [2018a] Ying, R., He, R., Chen, K., Eksombatchai, P., Hamilton, W.L., Leskovec, J., 2018a. Graph convolutional neural networks for web-scale recommender systems, in: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 974–983.
- Ying et al. [2018b] Ying, Z., You, J., Morris, C., Ren, X., Hamilton, W., Leskovec, J., 2018b. Hierarchical graph representation learning with differentiable pooling, in: Advances in Neural Information Processing Systems, pp. 4800–4810.
- Zhang and Chen  Zhang, M., Chen, Y., 2018. Link prediction based on graph neural networks, in: Advances in Neural Information Processing Systems, pp. 5165–5175.
Appendix A Appendix
|P17||Country||sovereign state of this item; don’t use on humans|
|P112||Founded by||founder or co-founder of this organization, religion or place|
|P121||Item operated||equipment, installation or service operated by the subject|
|P127||Owned by||owner of the subject|
|the item is located on the territory of the following administrative entity.|
|P159||Headquarters location||specific location where an organization’s headquarters is or has been situated.|
|P166||Award received||award or recognition received by a person, organisation or creative work|
|P169||Chief executive officer||highest-ranking corporate officer appointed as the CEO within an organization|
|P176||Manufacturer||manufacturer or producer of this product|
|P355||Subsidiary||subsidiary of a company or organization, opposite of parent organization|
|P361||Part of||object of which the subject is a part|
|P414||Stock Exchange||exchange on which this company is traded|
|P452||Industry||industry of company or organization|
|P463||Member of||organization or club to which the subject belongs|
|P488||Chairperson||presiding member of an organization, group or body|
|P495||Country of origin||country of origin of this item (creative work, food, phrase, product, etc.)|
|P625||Coordinate location||geocoordinates of the subject.|
|P740||Location of formation||location where a group or organization was formed|
|P749||Parent organization||parent organization of an organisation, opposite of subsidiaries (P355)|
|P793||significant event||significant or notable events associated with the subject|
|P1343||Described by source||dictionary, encyclopaedia, etc. where this item is described|
|P1344||Participant of||event a person or an organization was/is a participant in,|
|P1454||Legal form||legal form of an organization|
|P1552||Has quality||the entity has an inherent or distinguishing non-material characteristic|
|P1830||Owner of||entities owned by the subject|
|P1889||Different from||item that is different from another item, with which it is often confused|
|P3320||Board member||member(s) of the board for the organization|
|P5009||Complies with||the product or work complies with a certain norm or passes a test|
|collection that have works of this artist|
|Relation Index||Relation Combination (Code)||Relation Index||Relation Combination (Code)|
|Phase 1||Phase 2||Phase 3||Phase 4||Phase 5||Phase 6||Phase 7||Average|
|Phase 1||Phase 2||Phase 3||Phase 4||Phase 5||Phase 6||Phase 7||Phase 8||Phase 9||Average|