1 Introduction
Graphs are a versatile and succinct way to describe entities through their relationships. The information contained in many knowledge graphs (KG) has been used in several machine learning tasks in natural language understanding
Peters et al. (2019)Li et al. (2017), and recommendation systems Wang et al. (2019). The same information can also be used to expand the graph itself via node classification Grover and Leskovec (2016), clustering Perozzi et al. (2014), or link prediction tasks Kazemi and Poole (2018), in both transductive and inductive settings Kipf and Welling (2016).Recent graph models have focused on learning dense representations that capture the properties of a node and its neighbours. One class of methods generates spectral representation for nodes Bruna et al. (2013); Defferrard et al. (2016). The rigidity of this approach may reduce the adaptability of a model to graphs with structural differences.
Model architectures that reduce the neighbourhood of a node have used pooling Hamilton et al. (2017), convolutions Duvenaud et al. (2015)
, recurrent neural networks (RNN)
Li et al. (2015), and attention Veličković et al. (2017). These approaches often require many computationallyexpensive messagepassing iterations to generate representations for the neighbours of a target node. Sparse Graph Attention Networks (SGAT) Ye and Ji (2019) was proposed to address this inefficiency by producing an edgesparsified graph. However, it may neglect the importance of local structure when graphs are large or have multiple edge types. Other methods have used objective functions to predict whether a node belongs to a neighbourhood Perozzi et al. (2014); Kazemi and Poole (2018), for example by using noisecontrastive estimation
Gutmann and Hyvärinen (2012). However, incorporating additional training objectives into a downstream task can be difficult to optimize, leading to a multistep training process.In this work, we present a graph network architecture (GATAS) that can be easily integrated into any model and supports general graph types, such as: cyclic, directed, and heterogeneous graphs. The method uses a selfattention mechanism over a multistep neighbourhood sample, where the transition probability of a neighbour at a given step is parameterized.
We evaluate the proposed method in node classification tasks using the Cora, Citeseer and Pubmed citation networks in a transductive setting, and on a protein to protein interaction (PPI) dataset in an inductive setting. We also evaluate the method on a link prediction task using a Twitter and YouTube dataset. Results show that the proposed graph network can achieve better or comparable performance to stateoftheart architectures.
2 Related Work
The proposed architecture is related to GraphSAGE Hamilton et al. (2017)
, which also reduces neighbour representations from fixedsize samples. Instead of aggregating uniformlysampled 1hop neighbours at each depth, we propose a single reduction of multistep neighbours sampled from parameterized transition probabilities. Such parameterization is akin to the Graph Attention Model
AbuElHaija et al. (2018), where trainable depth coefficients scale the transition probabilities from each step. Thus, the model can choose the depth of the neighbourhood samples. We further extend this approach to transition probabilities that account for paths with heterogeneous edge types.We use an attention mechanism similar to the one in Graph Attention Networks (GAT) Veličković et al. (2017). While GAT reduces immediate neighbours iteratively to explore the graph structure in a breadthfirst approach that processes all nodes and edges at each step, our method uses multistep neighbourhood samples to explore the graph structure. Our method also allows each neighbour to have a different representation for the target node, rather than using a single representation as in GAT.
MoNet Monti et al. (2017) generalizes many graph convolutional networks (GCN) as attention mechanisms. More recently, the edgeenhanced graph neural network framework (EGNN) Gong and Cheng (2019) consolidates GCNs and GAT. In MoNet, the attention coefficient function only uses the structure of the nodes. In addition, our model employs node representations to generate attention weights.
Other approaches use recurrent neural networks Scarselli et al. (2008); Li et al. (2015) to reduce path information between neighbours and generate node representations. The propagation algorithm in Gated Graph Neural Networks (GGNNs) Li et al. (2015) reduces neighbours one step at a time, in a breadthfirst fashion. In contrast, we use a depthfirst approach where a reduction operation is applied across edges of a path.
3 Model
3.1 Preliminaries
An initial graph is defined as , where is a set of nodes or vertices, is a set of edge types or relations, and is a set of triplets. Directed graphs are represented by having one edge type for each direction so that . To be able to incorporate information about the target node, the graph is augmented with selfloops using a new edge type . Thus, a new set of edge types is created such that , and a new set of triplets conforms a new graph .
An edge type path between nodes and is defined as , where is the maximum number of steps considered. The number of all possible edge type paths is given by . The set of all possible edge type paths is defined as . The edge type sequences in the set are levelordered so that , and so that . As an example, if and , then corresponds to: . The subset of relation paths connecting nodes is defined as }, where represents the extraneous selfloops.
3.2 Neighbour Representations
Graph relations in
are represented by trainable vectors
. To reflect the position in a path, the edge representations can be infused with information that reflects its position in a path. Following Vaswani et al. (2017), we assign a sinusoid of a different wavelength to each dimension, each position being a different point:where is the position in an edge type path and is the index of the dimension.
We represent nodes as a set of vectors . Each vector is defined by , where are trainable embedding representations and represents the features of the node. The learnable embedding representations can capture information to support a neighbourhood sample.
Given an edge type path , we generate a neighbour representation using attention over transformed neighbour representations for each edge type in :
(1) 
where and are two different learnable transformations. The transformation given by allows neighbour representations to be different according to the edge type in the path. For the selfloop edge type path , we set .
3.3 Transition Tensors
We define transition probability distributions for neighbours and their possible edge type paths within
steps. When there are multiple edges types connecting two nodes, their transition probabilities split. Thus, when computing transition probabilities for random walks starting at , it is necessary to track the probability of an edge type path for each destination vertex , effectively computing . Also, when performing random walks from a starting node , we break cycles by disallowing transitions to nodes already visited in previous steps. This reduces to the set of shortest edge type paths possible.Let
be a sparse adjacency tensor for
, where:An initial transition tensor can be computed by normalizing the matrices to sum to one, when applying the function :
Using the order in to obtain the probabilities for specific edge type paths, the unnormalized sparse transition tensor for steps can be computed as follows:
(2) 
where specifies the edge type for the last step in path , and is the edge type path without the last step in . As an example, if is the sequence of relation indices that corresponds to , then corresponds to , and corresponds to .
The conditional function in Equation 2 sets the transition probability to zero if the node can be reached from in a previous step . This procedure effectively breaks cycles and allows only the most relevant and shortest edge type paths to be sampled. A normalized transition tensor is obtained by .
3.4 Neighbourhood Sampling
When considering a neighbourhood , it can be relevant to attend to nodes beyond the first degree neighbourhood. However, as the number of hops between nodes increases, their relationship weakens. It can also be prohibitive to attend to all nodes within hops, as the neighbourhood size grows proportionally with .
To overcome these complications, we create a fixedsize neighbourhood sample from an adjustable transition tensor . Similar to the work in AbuElHaija et al. (2018), we obtain neighbour probabilities by a linear combination of random walk transition tensors for each step , with learnable coefficients :
where and is a vector of unbounded parameters. corresponds to a transition tensor for the added selfloops:
Depending on the task and graph, the model can control the scope of the neighbourhood by adjusting these coefficients through backpropagation.
To generate a neighbourhood for we sample without replacement from so that , where is the maximum size for a neighbourhood sample.
3.5 Node Representations
Given a neighbourhood for node and a transition tensor , we apply an attention mechanism with attention coefficients given by:
(3) 
where
is a learnable transformation. The logits produced by
are scaled by the transition probabilities, exerting the importance of the neighbour, and allowing the coefficients to be trained. We concatenate multihead attention layers to create a new node representation :(4) 
where is another learnable transformation, and
is a nonlinear activation function, such as the ELU
Clevert et al. (2015) function. The transformation allows relevant information for the node to be selected.3.6 Algorithmic Complexity
The complexity of generating node representations with the proposed algorithm (GATAS) is governed by , where is the batch size and . GraphSAGE Hamilton et al. (2017) has a similar complexity of . GAT Veličković et al. (2017) on the other hand, has a complexity that is independent of the batch size but processes all nodes and edges. It is given by , where is the number of layers that controls depth. For downstream tasks where only a small subset of nodes are actually used, the overhead complexity of GAT can be overwhelming. Generalizing, our model is more efficient when .
4 Evaluation
We evaluate the performance of GATAS using node classification tasks in transductive and inductive settings. To evaluate the performance of the proposed attention mechanism over heterogeneous multistep neighbours, we rely on a multiclass link prediction task.
For the transductive learning experiments we compare against GAT Veličković et al. (2017) and some of the approaches specified in Kipf and Welling (2016), including a CNN approach that uses Chebyshev approximations of the graph eigendecomposition Defferrard et al. (2016), the Graph Convolutional Network (GCN) Kipf and Welling (2016), MoNet Monti et al. (2017), and the Sparse Graph Attention Network (SGAT) Ye and Ji (2019)
. We also benchmark against a multilayer perceptron (MLP) that classifies nodes only using its features without any graph structure.
For the inductive experiments we compare once again against GAT Veličković et al. (2017) and SGAT Ye and Ji (2019). We also compare against GraphSAGE Hamilton et al. (2017)
, a method that aggregates node representations from fixedsize neighbourhood samples, using different methods such as LSTMs and maxpooling.
GATAS is capable of utilizing edge information, which we consider to be an important advantage. Hence we also conducted link prediction experiments on multiplex heterogeneous network datasets against some of the stateofthe art models, namely GATNECen et al. (2019), MNEZhang et al. (2018), and MVEQu et al. (2017). GATNE creates multiple representations for a node under different edge type graphs, aggregates these individual views using reduction operations similar to GraphSAGE, and combines these node representations using attention.
4.1 Datasets
For the transductive node classification tasks we use three standard citation network datasets: Cora, Citeseer, and Pubmed Sen et al. (2008). In these datasets, each node corresponds to a publication and undirected edges represent citations. Training sets contain 20 nodes per class. The validation and test sets have 500 and 1000 unseen nodes respectively.
For the inductive node classification experiments, we use the protein interaction dataset (PPI) in Hamilton et al. (2017). The dataset has multiple graphs, where each node is a protein, and undirected edges represent an interaction between them. Each graph corresponds to a different type of interaction between proteins. 20 graphs are used for training, 2 for validation and another 2 for testing.
For the link prediction task, we use the heterogeneous Higgs Twitter Dataset^{1}^{1}1http://snap.stanford.edu/data/higgstwitter.html De Domenico et al. (2013). It is made up of four directional relationships between more than 450,000 Twitter users. We also use a multiplex bidirectional network dataset that consists of five types of interactions between 15,088 YouTube users Tang and Liu (2009); Tang et al. (2009). Using the dataset splits provided by the authors of GATNE^{2}^{2}2http://github.com/thudm/gatne, we work with subsets of 10,000 and 2,000 nodes for Twitter and YouTube respectively, reserving 5% and 10% of the edges for validation and testing. Each split is augmented with the same amount of nonexisting edges, that are used as negative samples.
Detailed statistics for these datasets are summarized in Table 4 in Supplementary Materials.
4.2 Experiment Setup
Node features are normalized using layer normalization Ba et al. (2016). These features are then passed through a single dense layer to obtain the input features . The Twitter dataset does not provide node features so . The inductive node classification task does not use learnable node embeddings so . We define
as a linear transformation,
as a twolayer neural network with a nonlinear hidden layer and a linear output layer, and and as onelayer nonlinear neural networks. Nonlinear layers use ELU Clevert et al. (2015) activation functions.For all models we set . We experimented with learnable node embeddings and edge type embedding sizes of 10 and 50 for the transductive node classification and link prediction tasks respectively. We use an edge type embedding size of 5 for the inductive tasks. The transductive tasks have 8 attention heads, while the other tasks have 10 heads.
For the transductive node classification tasks, the output layer is directly connected to the concatenated attention heads. For the inductive task, the concatenated attention heads are passed through 2 nonlinear layers before going through the output layer. In the link prediction task, the concatenated attention heads are passed through a nonlinear layer and a pair of corresponding node representations are concatenated before they pass through 2 nonlinear layers and an output layer. All these hidden layers have a size of 256.
The optimization objective is the multiclass or multilabel crossentropy, depending on the task. It is minimized by the Nadam SGD optimizer Dozat (2016). The validation set is used for early stopping and hyperparameter tuning. Since the training sets for the transductive node classification tasks are very small, it is crucial to add noise to the model inputs to prevent overfitting. We mask out input features with 0.9 probability, and apply Dropout Srivastava et al. (2014) with 0.5 probability to the attention coefficients and resulting representations. We also add regularization with .
In the node classification and link prediction tasks, neighbourhood candidates can be at most 3 and 2 steps away from the target node respectively. The unnormalized transition coefficients are initialized with a nonlinear decay given by . To accommodate for an inductive setting, edges across graphs in the protein interaction dataset are treated as the same type and use the same edge type representations. In the link prediction task we reuse the node representations during test time and rely on the neighbours given by the edges in the training set.
The experiment parameters are summarized in Table 5 in Supplementary Materials. In the transductive node classification experiments, the architecture hyperparameters were optimized on the Cora dataset and are reused for Citeseer and Pubmed. A single experiment can be run on a V100 GPU under 12 hours. Implementation code is available on GitHub ^{3}^{3}3http://github.com/wattpad/gatas.
Transductive (Accuracy %)  Inductive (MicroF1)  
Model  Cora  Citeseer  Pubmed  PPI 
MLP  
Chebyshev Defferrard et al. (2016)  —  
GCN Kipf and Welling (2016)  —  
MoNet Monti et al. (2017)  —  —  
GraphSAGE Hamilton et al. (2017)  —  —  —  0.768 
GAT Veličković et al. (2017)  
SGAT Ye and Ji (2019)  84.2%  68.2%  77.6%  0.966 
GATAS  
GATAS  
GATAS 
We selected the best results reported by SGAT.
MVEQu et al. (2017)  MNEZhang et al. (2018)  GATNETCen et al. (2019)  GATAS  

Dataset  ROCAUC  F1  ROCAUC  F1  ROCAUC  F1  ROCAUC  F1 
72.62  67.40  91.37  84.32  92.30  84.96  95.44  87.13  
YouTube  70.39  65.10  82.30  75.03  84.61  76.83  96.63  83.59 
Results reported for GATNE, MNE and MVE are from the original GATNE paper Cen et al. (2019).
4.3 Results
Table 1
summarizes our results on the node classification tasks. For the transductive tasks, we report the mean classification accuracy and standard deviation over 100 runs. For the inductive task, we report the mean microF1 score and standard deviation over 10 runs. We compare against the metrics already reported in
Veličković et al. (2017); Kipf and Welling (2016), and use the same dataset splits provided. For GraphSAGE, we report the better results obtained in Veličković et al. (2017).Using the settings described in the previous section, we provide variations of our method using different neighbourhood sample sizes: 10, 100, and 500. We notice that the model can achieve comparable performance, and that we have achieved a new stateoftheart performance on the PPI dataset in an inductive setting, by a 1.2% margin.
Performance increases with the neighbourhood sample size, as it expands the graph structure covered. However, we do not see substantial improvements for a sample size of 500. Given the average number of neighbours per node for each dataset, as shown in Table 1, we can see that an increased neighbourhood sample size of 500 might not add additional neighbours to the models in the transductive experiments. However, for the PPI dataset, 500 is still significantly below an estimated average neighbourhood size of , where is the number of steps considered.
We note that in the PPI dataset, a small neighbourhood sample size of 10 impacts performance considerably more than in the transductive setting. This could be because there is no support from the learnable node representations. As the amount of information provided by neighbours decreases, the model might become more dependent on these parameters. The proposed neighbourhood sampling technique trades in a small amount of accuracy to gain efficiency. As a result, the model can easily be used with large datasets and downstream tasks.
Table 2 summarizes our results on the link prediction task. We report the macro area under the ROC curve and the macro F1 score for a single run. When comparing against GATNE, we use the transductive version of the model since we do not precompute raw features for the nodes and rely on the learned representations during test time.
The results suggest the produced node representations are able to capture path attributes as part of the neighbourhood information. GATAS outperforms GATNET, with a lift of 3.14% and 12.02% in ROCAUC, as well as 3.17% and 6.76% in the F1 score, on the Twitter and YouTube datasets respectively. The results achieved new stateoftheart performances, to the best of our knowledge.
4.4 Ablation Study
The proposed architecture has three independent components that have not been considered in previous work: (1) the neighbour sampling technique using transition probabilities with learnable step coefficients that affect the attention weights; (2) the learnable node representations that augment node features with neighbourhood information; and (3) the attention network that allows neighbour representations to adapt to the target node given itself and the path information.
In this section we measure the impact of each component on the Cora and Pubmed datasets for the transductive setting and the PPI dataset for the inductive setting. We consider five model variations that test the importance of each component. All variations of the inductive models do not use learnable node representations because of the nature of its setting. We would like to test the importance of edge type information but the link prediction datasets do not provide node features that would allow us to run all variations. We define the following variations:

Base only samples immediate neighbours with uniform probability and the transition probabilities are not part of the attention weights. The model does not adapt neighbour representations and nodes do not include learnable representations. Neighbour representations are reduced using the attention mechanism described in Section 3.5.

GATAS w/o trans is the proposed solution but all nodes within steps can be sampled with uniform probability and the transition probabilities are not part of the attention weights.

GATAS w/o embed is the proposed solution but only node features are used such that and is not used. This variation is not available for the inductive task because of its nature.

GATAS w/o paths corresponds to the proposed solution but neighbour representations are not transformed according to the target node and path information.

GATAS is the proposed solution.
We ran experiments using the same settings described in Section 4.2. In all the transductive variations, we use a sample size , since a larger value might attenuate the impact of using learnable representations, interfering with the results of the GATAS w/o embed variation. For the variations in the inductive task we use a sample size of .
Table 3 shows our results. The largest jump in performance corresponds to the use of the neighbourhood sampling technique and incorporation of the transition probabilities. The use of learnable embeddings as part of node representations does not seem to cause a big impact on performance but this could be a consequence of a large neighbourhood size , which might reduce the need to utilize these parameters. Finally, the use of adaptable neighbour representations does not seem to affect performance for these tasks. We hypothesize that the nature of the tasks might not require such neighbour transformations but note that edge direction and different edge types are not present in these datasets.
Transductive (Accuracy %)  Inductive (MicroF1)  

Model  Cora  Pubmed  PPI 
Base  
GATAS w/o trans  
GATAS w/o embed  —  
GATAS w/o paths  
GATAS 
5 Conclusion
In this paper, we proposed a new neural network architecture for graphs. The algorithm represents nodes by reducing their neighbour representations with attention. Multistep neighbour representations incorporate different path properties. Neighbours are sampled using learnable depth coefficients.
Our model achieves comparable results across different tasks and various baselines, on the benchmark datasets: Cora, Citeseer, Pubmed, PPI, Twitter and YouTube. We successfully retained performance while increasing efficiency on large graphs, achieving stateoftheart performance on multiple datasets from different tasks. We conducted an ablation study in transductive and inductive settings. The experiments show that sampling neighbourhoods according to weighted transition probabilities achieves the largest performance gain, especially in the inductive setting.
References
 Watch your step: learning node embeddings via graph attention. In Advances in Neural Information Processing Systems, pp. 9180–9190. Cited by: §2, §3.4.
 Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: §4.2.
 Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203. Cited by: §1.
 Representation learning for attributed multiplex heterogeneous network. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 1358–1368. Cited by: Table 2, §4.
 Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289. Cited by: §3.5, §4.2.
 The anatomy of a scientific rumor. Scientific reports 3, pp. 2980. Cited by: §4.1.
 Convolutional neural networks on graphs with fast localized spectral filtering. In Advances in neural information processing systems, pp. 3844–3852. Cited by: §1, Table 1, §4.

Incorporating nesterov momentum into adam
. Cited by: §4.2.  Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems, pp. 2224–2232. Cited by: §1.

Exploiting edge features for graph neural networks.
In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pp. 9211–9219. Cited by: §2.  Node2vec: scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 855–864. Cited by: §1.
 Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. Journal of Machine Learning Research 13 (Feb), pp. 307–361. Cited by: §1.
 Inductive representation learning on large graphs. In Advances in Neural Information Processing Systems, pp. 1024–1034. Cited by: §1, §2, §3.6, §4.1, Table 1, §4.
 Simple embedding for link prediction in knowledge graphs. In Advances in Neural Information Processing Systems, pp. 4284–4295. Cited by: §1, §1.
 Semisupervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907. Cited by: §1, §4.3, Table 1, §4.
 Situation recognition with graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4173–4182. Cited by: §1.
 Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493. Cited by: §1, §2.

Geometric deep learning on graphs and manifolds using mixture model cnns
. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5115–5124. Cited by: §2, Table 1, §4.  Deepwalk: online learning of social representations. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 701–710. Cited by: §1, §1.
 Knowledge enhanced contextual word representations. In EMNLP, Cited by: §1.
 An attentionbased collaboration framework for multiview network representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 1767–1776. Cited by: Table 2, §4.
 The graph neural network model. IEEE Transactions on Neural Networks 20 (1), pp. 61–80. Cited by: §2.
 Collective classification in network data. AI magazine 29 (3), pp. 93–93. Cited by: §4.1.
 Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15 (1), pp. 1929–1958. Cited by: §4.2.
 Uncovering crossdimension group structures in multidimensional networks. In SDM workshop on Analysis of Dynamic Networks, pp. 568–575. Cited by: §4.1.
 Uncoverning groups via heterogeneous interaction analysis. In 2009 Ninth IEEE International Conference on Data Mining, pp. 503–512. Cited by: §4.1.
 Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.2.
 Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §2, §3.6, §4.3, Table 1, §4, §4.

Explainable reasoning over knowledge graphs for recommendation.
In
Proceedings of the AAAI Conference on Artificial Intelligence
, Vol. 33, pp. 5329–5336. Cited by: §1.  Sparse graph attention networks. External Links: 1912.00552 Cited by: §1, Table 1, §4, §4.
 Scalable multiplex network embedding.. In IJCAI, Vol. 18, pp. 3082–3088. Cited by: Table 2, §4.
Appendix A Supplementary Materials
a.1 Dataset Statistics
Cora  Citeseer  PubMed  PPI  YouTube  

Node Classes  7  6  3  121  1  1 
Edge Types  1  1  1  1  4  5 
Node Features  1,433  3,703  500  50  0  0 
Nodes  2,708  3,327  19,717  56,944  456,626  15,088 
Edges  5,429  4,732  44,338  818,716  15,367,315  13,628,895 
Training Nodes  140  120  60  44,906  9,990  2,000 
Training Edges  —  —  —  1,246,382  282,115  1,114,025 
Validation Nodes  500  500  500  6,514  9,891  2,000 
Validation Edges  —  —  —  201,647  16,463  65,512 
Testing Nodes  1,000  1,000  1,000  5,524  9,985  2,000 
Testing Edges  —  —  —  164,319  32,919  131,007 
Neighbours per Node  3.9  2.7  4.4  28.3  28.2  557.0 
Shared nodes across dataset types with access to training set edges only.
Access to all edges.
a.2 Experiment Settings
Node Classification  Link Prediction  
Parameter  Transductive  Inductive  Transductive 
Maximum number of steps ()  3  3  2 
Neighbourhood sample size ()  10/100/500  10/100/500  100 
Layer size (, , )  50  50  — 
Node embedding size ()  10  —  50 
Edge type embedding size ()  10  5  50 
Number of attention heads  8  10  10 
Input noise rate  0.9  0  0 
Dropout probability  0.5  0  0 
regularization coefficient ()  0.05  0  0 
Learning rate  0.001  0.001  0.001 
Maximum number of epochs 
1000  1000  1000 
Early stopping patience  100  10  5 
Batch size ()  5000  100  200 