Software Engineering Event Modeling using Relative Time in Temporal Knowledge Graphs

07/02/2020 ∙ by Kian Ahrabian, et al. ∙ 39

We present a multi-relational temporal Knowledge Graph based on the daily interactions between artifacts in GitHub, one of the largest social coding platforms. Such representation enables posing many user-activity and project management questions as link prediction and time queries over the knowledge graph. In particular, we introduce two new datasets for i) interpolated time-conditioned link prediction and ii) extrapolated time-conditioned link/time prediction queries, each with distinguished properties. Our experiments on these datasets highlight the potential of adapting knowledge graphs to answer broad software engineering questions. Meanwhile, it also reveals the unsatisfactory performance of existing temporal models on extrapolated queries and time prediction queries in general. To overcome these shortcomings, we introduce an extension to current temporal models using relative temporal information with regards to past events.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 9

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Hosting over 100 million repositories, GitHub (GH) is one of the biggest social coding platforms (githubBlog:online)

. Over the past decade, the available artifacts hosted on GH have become one of the most important resources for software engineering (SE) researchers to study various aspects of programming, software development, the characteristics of open source users and ecosystem 

(cosentino2017systematic). Example questions of interest include when an issue will be closed (kikas2016using; rees2017better), how likely a pull request will be merged and when (gousios2014dataset; soares2015acceptance), and who should review a pull request (yu2016reviewer; hannebauer2016automatically).

Our aim in this work is to connect the above SE research questions to the literature on learning knowledge graph (KG) embeddings, with a particular emphasis on temporal KGs due to the importance of the temporal component in the above questions. Methods for time prediction and time-conditioned link prediction in KGs (kazemi2020representation, Section 5.1-5.2) are generally based on point process models (trivedi2017know; trivedi2019dyrep; knyazev2019learning) or adaptations of KG embeddings that additionally use time to compute scores (dasgupta2018hyte; leblay2018deriving; garcia2018learning; goel2019diachronic). While point processes are elegant, they are more challenging to work with and require strong assumptions on the underlying intensity functions (see e.g., trivedi2019dyrep,  Equation 1 & Section 4). Thus we focus on KG embedding-based methods, in particular starting with Diachronic Embeddings (DE-(goel2019diachronic) for time-varying embeddings and RotatE (sun2019rotate) for scoring facts.

Our contributions are the following:

  • Collecting two new temporal KG datasets111https://zenodo.org/record/3928580 from GH public events that allow casting the above SE questions as time-conditioned prediction queries (Section 2).

  • Benchmarking existing temporal KG embedding methods on the new datasets (Section 4).

  • Based on the observation that existing temporal KG embedding models do not well capture patterns in relative time that are important to SE applications, particularly in extrapolated settings, e.g., How long will it take from pull request being opened to being closed?, we propose a new relative temporal KG embedding inspired by the use of relative time

    in attention-based neural networks like Music Transformer

    (huang2018music) and Transformer-XL (dai2019transformer) (Section 3).

In total, our work brings together SE research questions and temporal KG models by introducing new datasets, benchmarking existing recent methods on the datasets, and suggesting a new direction of leveraging Transformer-style relative time modeling in KG embeddings.


Dataset
GitHub-SE 1Y 125,455,982 249,375,075 19 365 3,519,105 2
GitHub-SE 1Y-Repo 133,335 285,788 18 365 73,345 1
GitHub-SE 1M 13,690,824 23,124,510 19 31 1,324,179 2
GitHub-SE 1M-Node 139,804 293,014 14 31 639 2
GitHub (Original) (trivedi2019dyrep) 12,328 771,214 3 366 - -
GitHub (Subnetwork) (knyazev2019learning) 284 20,726 8 366 4,790 53.5
ICEWS14 (garcia2018learning) 7,128 90,730 230 365 6,083 3
ICEWS05-15 (garcia2018learning) 10,488 461,329 251 4017 52,890 5
YAGO15K (garcia2018learning) 15,403 138,056 34 198 6,611 5
Wikidata (leblay2018deriving) 11,153 150,079 96 328 586 5
GDELT (leetaru2013gdelt) 500 3,419,607 20 366 53,857 10,336
Table 1: Characteristics comparison of the introduced datasets to existing temporal KG datasets.

2 Dataset

Creation To create the dataset, we retrieved from GH Archive222https://www.gharchive.org all of the raw public events in GH in 2019. The knowledge graph was then constructed by tuples, each of which represents an individual event containing temporal information based on its type and a predefined set of extraction rules. The properties of the constructed KG is shown in the first row of Table 1, referred to as GitHub-SE 1Y.

Due to the substantial size of the GitHub-SE 1Y and unrealistic computational demands of training KG embedding models on this KG, we sampled the GitHub-SE 1Y using two distinct strategies, described in more details in Appendix A.1. The first strategy aims to retain maximum temporal information about particular SE projects. To achieve this, first, an induced sub-graph containing all related nodes was extracted for each node with type Repository. Then, for each sub-graph a popularity score was calculated as where is the size of the graph, is the time-span of the graph, and are weight values. Finally, from the top three ranked repositories, we selected the Visual Studio Code repository to extract a one-year slice as it exercised more functionalities related to the target entities in this work, i.e. issues and pull requests. We name this dataset GitHub-SE 1Y-Repo due to its repository-centric characteristics.

The second strategy aims at preserving the most informative nodes regardless of their type. We used a variation of Snowball Sampling (goodman1961snowball) on all the events in December 2019. This sampled dataset, i.e. GitHub-SE 1M-Node, captures events across various projects and therefore, can be used to answer queries such as which project does a user start contributing at a certain time.

Characteristics In Table 1, we compare the variations of GitHub-SE KG proposed in this work with commonly used datasets in the literature. Even the sampled down versions of our datasets are considerably larger in terms number of nodes. They have much higher edge to node ratios which translates into sparsity in graphs, but this sparsity level is close to what appears in GitHub as a whole. Additionally, similar to relations, each node in our datasets is also typed.

trivedi2019dyrep also collects a temporal KG dataset from GitHub. However, this dataset is exclusively focused on the social aspects of GitHub, discarding repositories and only including user-user interactions, and it does not appear to be publicly available beyond raw data and a small subnetwork extracted in a follow-up work (knyazev2019learning). To differentiate our datasets, which focus on the SE aspects of GitHub, we append -SE to the dataset names.

The distinguishing characteristics of the proposed datasets, i.e. size, sparsity, node-typing, diversity, focus on SE aspects, and temporal nature, introduce a variety of engineering and theoretical challenges that make these datasets a suitable choice for exploring and exposing the limitations of temporal knowledge graph embedding models.

3 Method

3.1 Existing KG embedding models

We first examine the performance of the state-of-the-art KG embedding models on the GitHub-SE 1M-Node and GitHub-SE 1Y-Repo datasets. We select RotatE (sun2019rotate) for the static settings considering its ability to infer Symmetry, Antisymmetry, Inversion, and Composition relational patterns. Moreover, we use DE- (goel2019diachronic) for the dynamic setting due to its superior performance on existing benchmarks and the fact that for any static model there exists an equivalent DE- model ensuring the ability to learn aforementioned patterns.

Notationally, a KG is a set of tuples of the form respectively representing the subject, relation, object, and timestamp. The diachronic embeddings are defined as

where are embedding lookup tables and the last three respectively represent Amplitude, Frequency, and Phase of a sinusoid. Similar to goel2019diachronic, whenever consists of multiple set of numbers rather than one, e.g. year, month, and day, for each set of numbers a separate is defined, and the values are summed up. Subsequently, the scoring function of the DE-RotatE model is defined as

where is an embedding lookup table for relations.

3.2 Relative Temporal Context

The idea of using relative temporal information has been successfully employed in natural language processing 

(vaswani2017attention; dai2019transformer) and music generation (huang2018music). These models formalize the intuition that temporal spacing between events is more central than the absolute time at which an event happened. We believe this framing is also appropriate for SE applications of temporal knowledge graphs: to predict if a pull request is closed at time , it is more important to know how long it has been since the pull request was opened than it is to know .

A challenge is that there are a lot of events, and we do not want to hard-code which durations are relevant. Instead, we would like the model to learn which temporal durations are important for scoring a temporal fact. As the number of related facts to an entity could be as high as a few thousand, we propose to pick a fixed-number of facts as temporal context to provide as an input to the models.

Let be the set of times associated with facts involving entity and relation occurring before time , and let be the relative time since a fact involving and relation has occurred. Hence, an entity’s relative temporal context at query time is .

3.3 Relative Time DE-RotatE (RT-DE-Rotate)

We now turn attention to using the relative temporal context as an input to temporal KG embeddings. Our inspiration is the Transformer

encoder, which has emerged as a successful substitute to more traditional Recurrent Neural Network approaches used for sequential  

(vaswani2017attention; dai2019transformer; huang2018music) and structural (parmar2018image) tasks. The core idea is to employ a variation of attention mechanism called Self-Attention that assigns importance scores to the elements of the same sequence.

Unlike recurrence mechanism, the positional information is injected to the Transformer styled encoders by 1) adding sine/cosine functions of different frequencies to the input (vaswani2017attention; dai2019transformer), or 2) directly infusing relative distance information to attention computation in form of a matrix addition (shaw2018self; huang2018music). vaswani2017attention

introduced a positional encoding scheme in form of sinusoidal vectors defined as

, where if is even and if

is odd,

is the absolute position, and is the embedding dimension. In the follow-up Transformer-XL model, dai2019transformer introduce a reparameterization of the relative attention where the attention score between a query element at position and a key element at position is defined as

where are the and element embeddings, , , , are trainable matrices, and is the relative position between and .


Dataset Type Model
GITHUB-SE 1M-NODE Interpolated RotatE 47.58 76.66 88.95 807.40 0.6328
DE-RotatE 47.98 76.92 88.87 779.50 0.6349
RT-DE-RotatE (ours) 49.70 78.67 90.48 773.90 0.6522
Extrapolated RotatE 25.40 49.02 57.54 4762.87 0.3797
DE-RotatE 26.28 48.53 57.33 4840.16 0.3838
RT-DE-RotatE (ours) 26.50 49.54 57.94 4891.81 0.3888
GITHUB-SE 1Y-REPO Interpolated RotatE 44.05 57.14 80.95 18.54 0.5460
DE-RotatE 42.17 53.88 76.88 24.67 0.5233
RT-DE-RotatE (ours) 48.93 60.96 78.32 14.47 0.5815
Extrapolated RotatE 2.11 4.82 9.71 1917.03 0.0464
DE-RotatE 1.77 4.08 9.10 1961.75 0.0402
RT-DE-RotatE (ours) 38.25 40.08 64.06 1195.02 0.4345
Table 2:

Performance comparison on time-conditioned Link Prediction. Results within the 95% confidence interval of the best are bolded.

The main difference in our setting is that the above models compute a score based on a single relative time , while our relative temporal context contains relative times for each entity. Our approach is to score a tuple based on the information available at query time .333During training, is the of the positive sample, and during evaluation, is set to the maximum timestamp in the training set. For each entity we define a positional embeddings matrix of relative times between and the events in its relative temporal context as

Intuitively, these relative times encode “if the event happened at time , how long would it have been since the events in the relative time context.” A learned, relation-specific row vector for chooses which rows of are important, and then , abbreviated , embeds the relative temporal context of , replacing :

where is a relation-specific weight matrix and and are tuple-agnostic weight matrices; however, this formulation is suitable for bilinear models. Hence, we derive a translational variation for the DE-RotatE model as

where is replaced by an embedding lookup table . Intuitively, under this formulation capture entities compatibility and and capture entity-specific temporal context compatibility. In comparison, the existing models only include term discarding terms and .

4 Experiments

Datasets: We use a 90%-5%-5% events split for constructing the train, validation, and test sets. For the interpolated queries the split was done randomly, whereas we split the data using event timestamps for the extrapolated queries. Table 3 in the Appendix presents details of the splits.

Queries: For time-conditioned link prediction, we selected events related to the resolution of Github issues and pull requests due to their direct impact on software development and maintenance practice. Particularly, we used “Who will close issue X at time T?” and ‘Who will close pull-request X at time T?” for evaluation. For time prediction, we used the analogous time queries of the aforementioned queries for evaluation, i.e. “When will issue X be closed by user Y?” and ‘When will pull-request X be closed by user Y?”.

Evaluation and Results:

We calculated the standard metrics to evaluate the model performance on the test set. For the extrapolated time-conditioned link prediction queries, after using the validation set for hyperparameter tuning, we retrained the selected models using both training and validation sets for evaluation. We also report the model performance without retraining in the Appendix Table

5.

In Table 2 we compare the model performance on the time-conditioned link prediction queries. On the Github-SE 1M-NODE queries, our model slightly outperforms existing models in some cases, but the difference is statistically insignificant in others. On the Github-SE 1Y-REPO, on the other hand, our RT-DE-ROTATE model shows a significant performance boost, particularly on the extrapolated time-conditioned link prediction queries, indicating the importance of using relative time as temporal context.

For the extrapolated time prediction queries on GITHUB-SE 1Y-REPO dataset, our model performed slightly better on HITS@1, HITS@3, and Mean Reciprocal Rank (MRR) than the other existing models while marginally surpassing the random baseline on all metrics. These results, detailed in the Appendix Table 8, stress the necessity of having further studies on extrapolated time prediction queries.

5 Conclusion

In this work, we bridge between the SE domain questions and the literature on KG embedding models by introducing two new datasets based on the daily interactions in the GH platform and casting those questions as queries on an appropriate KG. Furthermore, we introduce RT-X, a novel extension to existing KG embedding models that make use of relative time with regards to past relevant events. Our experiments highlight shortcomings of existing temporal KG embedding models, notably on extrapolated time-conditioned link prediction, and exhibit the advantage of leveraging relative time as introduced in the RT-X model. In total, this work highlights new opportunities for improving temporal KG embedding models on time prediction queries.

Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). This research was enabled in part by support provided by Google, Calcul Québec, and Compute Canada. We thank Daniel Johnson for helpful comments.

References

Appendix A Dataset

a.1 Sampling

Algorithm 1 describes the snowball sampling used to create the GitHub-SE 1M-NODE dataset. This algorithm aims at preserving the most informative nodes regardless of their types. Moreover, Algorithm 2 describes the temporal sampling used to create the GitHub 1Y-REPO dataset. This algorithm aims at preserving maximum temporal information regarding particular repositories.

0:  set of nodes , set of edges , sample size , growth size , initial sample size
   w.r.t node degree
  
  for  do
     .put()
  end for
  
  while  do
      .top() w.r.t node degree
     .put()
      with size
     for  do
        .put()
     end for
  end while
Algorithm 1 Snowball Sampling strategy used for extracting the GitHub-SE 1M-NODE dataset.
0:  set of nodes , size importance factor , time span importance factor , sample size
   {Repository node type only}
  
  for  do
     
  end for
  
  
  for  do
     if  then
        .union()
     end if
  end for
Algorithm 2 Temporal Sampling strategy used for extracting the GitHub 1Y-REPO dataset.

a.2 Extraction

Table 9 presents the set of extraction rules used to build the KG from raw events each representing a relation type. Although 80 extractions rules are defined in Table 9, the raw events that we used only contained 18 of them.

The codes presented in the Relation column of Table 9, when divided on underscore, are interpreted as a) the first and the last components respectively represent entity types of the event’s subject and object, b) AO, CO, SE, SO, and HS are abbreviations of extracted information from raw payloads 444https://developer.github.com/webhooks/event-payloads/ serving as distinguishers between different relation types among entities, and c) the second to the last component represents the concrete action taken that triggers the event.


Dataset Type #Train #Validation #Test
GITHUB-SE 1M-NODE Interpolated 285,953 3,530 3,531
Extrapolated 281,056 2,104 3,276
Standard Extrapolated 275,805 2,104 3,276
GITHUB-SE 1Y-REPO Interpolated 282,597 1,595 1,595
Extrapolated 269,789 2,281 1,472
Standard Extrapolated 252,845 2,281 1,472
Table 3: Details of train, validation, and test splits.

Appendix B Model

b.1 Complexity

Table 4 presents time and space complexity comparison between the existing models and the introduced RT-X model. Notice that, while yielding superior performance, the number of free-parameters introduced in our extension does not increase linearly with the number of entities which is one of the bottlenecks of training large KG embedding models.


Model Computational Complexity Free Parameters Complexity
RotatE
DE-RotatE
RT-DE-RotatE (ours)
Table 4: Time and space complexity comparison of the models given static embedding dimension , diachronic embedding dimension , relative time embedding dimension , entity set , and relation set .

Dataset Model
GITHUB-SE 1M-NODE RotatE 19.60 38.37 45.54 6437.30 0.2965
DE-RotatE 20.97 38.03 45.21 6504.79 0.3005
RT-DE-RotatE (ours) 22.10 38.61 45.54 5782.83 0.3113
GITHUB-SE 1Y-REPO RotatE 0.41 1.49 2.45 2259.03 0.0141
DE-RotatE 5.16 8.83 16.44 1342.25 0.0911
RT-DE-RotatE (ours) 38.59 40.01 43.27 1613.70 0.4034
Table 5: Performance comparison on standard extrapolated time-conditioned Link Prediction. Results within the 95% confidence interval of the best are bolded.

b.2 Loss Function

Similar to the self-adversarial negative sampling introduced in sun2019rotate, we use weighted negative samples as

where is the sampling temperature and is the

-th negative sample. Hence, the loss function is defined as

where is a fixed margin and

is the sigmoid function.

Appendix C Experiments

c.1 Time Prediction

To evaluate the time prediction queries, we consider the dates in the min-max timestamp range of the set that is being evaluated as the candidate set.

c.2 Model Selection

The best model is selected using the MRR on validation set and HITS@N with , Mean Rank (MR), MRR on test set are reported.

c.3 Negative Sampling

We follow the negative sampling scheme employed by dasgupta2018hyte providing the model with both sets of time-agnostic and time-dependent negative samples.

c.4 Regularization

We apply L3 regularization parameterized by as introduced in lacroix2018canonical on , , , and .

c.5 Re-ranking Heuristics

We employed two re-ranking heuristics during the evaluation time for time-conditioned link prediction. First, each entity was only evaluated among entities with the same type. Next, we push down the ranks of entities with prior interactions with the given entity.


Hyperparameter Range
Dropout
Table 6: Hyperparameter ranges used for experiments.

Model Samples Avg Runtime
RotatE 6400 77s
DE-RotatE 6400 80s
RT-DE-RotatE 6400 87s
Table 7: Average runtime comparison of the models in seconds.

Figure 1: Example heatmap of the absolute importance scores between relations learned as part of the model.

Dataset Model
GITHUB-SE 1Y-REPO RotatE 1.77 18.27 56.05 10.62 0.1675
DE-RotatE 3.46 7.27 60.73 9.35 0.1724
RT-DE-RotatE (ours) 6.18 19.29 55.91 9.40 0.2073
Random 5.26 15.79 52.63 9.5 0.1867
Table 8: Performance comparison on extrapolated Time Prediction.

c.6 Hyperparameters

Initially, we tuned our models using the hyperparameter ranges reported in Table 6 for dropout, , , and resulting in total of 72 runs. Then, following the best hyperparameters achieved on RotatE and DE-RotatE models, we used dropout = , , , , , time-agnostic negative ratio = , time-dependant negative ratio = , batch size = , warm-up steps = , warm-up decay rate = , steps = , and validation steps = for all experiments.

To make a fair comparison, we chose a base embedding size of 128 for all experiments. Subsequently, we only report on the combinations of static embedding dimension values and diachronic embedding dimension values presented in Table 6 where . We evenly distribute among all diachronic embeddings to prevent giving models a distinct advantage in terms of free-parameters. As for the relative time embedding dimension , we report on all the combinations in Table 6 with and respecting the stated restriction resulting in total of 17 experiments per dataset.

c.7 Runtime

Table 7 presents the average runtime of each model for every 100 steps with batch size set to 64. All experiments were carried on servers with 16 CPU cores, 64GB of RAM, and a NVIDIA V100/P100 GPU.

c.8 Standard Error

We use standard error to calculate confidence intervals and detect statistically indistinguishable results.

c.9 Relations Importance Matrix

Figure 1 presents the importance matrix between relations, i.e. , learned as part of the RT-X model. From this figure, it is evident that the learned matrix is not symmetric, indicating that the model learns different importance scores conditioned on the query relation.

c.10 Implementation

We implemented our model using PyTorch 

(paszke2019pytorch). The source code is publicly available in GitHub555https://github.com/kahrabian/RT-X.

Event Type Head Relation (Code) Tail
Commit Comment User Actor (U_AO_CC) Commit Comment
Fork Repository Fork (R_FO_R) Repository
Issue Comment User Created (U_SO_C_IC) Issue Comment
Edited (U_SO_E_IC)
Deleted (U_SO_D_IC)
Issue Comment Created (IC_AO_C_I) Repository
Edited (IC_AO_E_I)
Deleted (IC_AO_D_I)
Issues User Opened (U_SE_O_I) Issue
Edited (U_SE_E_I)
Deleted (U_SE_D_I)
Pinned (U_SE_P_I)
Unpinned (U_SE_UP_I)
Closed (U_SE_C_I)
Reopened (U_SE_RO_I)
Assigned (U_SE_A_I)
Unassigned (U_SE_UA_I)
Locked (U_SE_LO_I)
Unlocked (U_SE_ULO_I)
Transferred (U_SE_T_I)
User Assigned (U_AO_A_I) Issue
Unassigned (U_AO_UA_I)
Issue Opened (I_AO_O_R) Repository
Edited (I_AO_E_R)
Deleted (I_AO_D_R)
Pinned (I_AO_P_R)
Unpinned (I_AO_UP_R)
Closed (I_AO_C_R)
Reopened (I_AO_RO_R)
Assigned (I_AO_A_R)
Unassigned (I_AO_UA_R)
Locked (I_AO_LO_R)
Unlocked (I_AO_ULO_R)
Transferred (I_AO_T_R)
Member User Added (U_CO_A_R) Repository
Removed (U_CO_E_R)
Edited (U_CO_R_R)
Pull Request Review Comment User Created (U_SO_C_PRC) Pull Request Review Comment
Edited (U_SO_E_PRC)
Deleted (U_SO_D_PRC)
Pull Request Review Comment Created (PRC_AO_C_P) Pull Request
Edited (PRC_AO_E_P)
Deleted (PRC_AO_D_P)
Pull Request Review User Submitted (U_SO_S_PR) Pull Request Review
Edited (U_SO_E_PR)
Dismissed (U_SO_D_PR)
Pull Request Review Submitted (PR_AO_S_P) Pull Request
Edited (PR_AO_E_P)
Dismissed (PR_AO_D_P)
Pull Request User Assigned (U_SO_A_P) Pull Request
Unassigned (U_SO_UA_P)
Review Requested (U_SO_RR_P)
Review Request Removed (U_SO_RRR_P)
Opened (U_SO_O_P)
Edited (U_SO_E_P)
Closed (U_SO_C_P)
Ready for Review (U_SO_RFR_P)
Locked (U_SO_L_P)
Unlocked (U_SO_UL_P)
Reopened (U_SO_R_P)
Synchronize (U_SO_S_P)
User Assigned (U_AO_A_P) Pull Request
Unassigned (U_AO_U_P)
User Review Requested (U_RRO_A_P) Pull Request
Review Request Removed (U_RRO_R_P)
Pull Request Assigned (P_AO_A_R) Repository
Unassigned (P_AO_UA_R)
Review Requested (P_AO_RR_R)
Review Request Removed (P_AO_RRR_R)
Opened (P_AO_O_R)
Edited (P_AO_E_R)
Closed (P_AO_C_R)
Ready for Review (P_AO_RFR_R)
Locked (P_AO_L_R)
Unlocked (P_AO_UL_R)
Reopened (P_AO_R_R)
Synchronize (P_AO_S_R)
Push User Sender (U_SO_C) Repository
Star User Created (U_HS_A_R) Repository
Deleted (U_HS_R_R)
Table 9: Extraction rules used to build the KG from raw events.