Hosting over 100 million repositories, GitHub (GH) is one of the biggest social coding platforms (githubBlog:online)
. Over the past decade, the available artifacts hosted on GH have become one of the most important resources for software engineering (SE) researchers to study various aspects of programming, software development, the characteristics of open source users and ecosystem(cosentino2017systematic). Example questions of interest include when an issue will be closed (kikas2016using; rees2017better), how likely a pull request will be merged and when (gousios2014dataset; soares2015acceptance), and who should review a pull request (yu2016reviewer; hannebauer2016automatically).
Our aim in this work is to connect the above SE research questions to the literature on learning knowledge graph (KG) embeddings, with a particular emphasis on temporal KGs due to the importance of the temporal component in the above questions. Methods for time prediction and time-conditioned link prediction in KGs (kazemi2020representation, Section 5.1-5.2) are generally based on point process models (trivedi2017know; trivedi2019dyrep; knyazev2019learning) or adaptations of KG embeddings that additionally use time to compute scores (dasgupta2018hyte; leblay2018deriving; garcia2018learning; goel2019diachronic). While point processes are elegant, they are more challenging to work with and require strong assumptions on the underlying intensity functions (see e.g., trivedi2019dyrep, Equation 1 & Section 4). Thus we focus on KG embedding-based methods, in particular starting with Diachronic Embeddings (DE-) (goel2019diachronic) for time-varying embeddings and RotatE (sun2019rotate) for scoring facts.
Our contributions are the following:
Benchmarking existing temporal KG embedding methods on the new datasets (Section 4).
Based on the observation that existing temporal KG embedding models do not well capture patterns in relative time that are important to SE applications, particularly in extrapolated settings, e.g., How long will it take from pull request being opened to being closed?, we propose a new relative temporal KG embedding inspired by the use of relative time
in attention-based neural networks like Music Transformer(huang2018music) and Transformer-XL (dai2019transformer) (Section 3).
In total, our work brings together SE research questions and temporal KG models by introducing new datasets, benchmarking existing recent methods on the datasets, and suggesting a new direction of leveraging Transformer-style relative time modeling in KG embeddings.
|GitHub (Original) (trivedi2019dyrep)||12,328||771,214||3||366||-||-|
|GitHub (Subnetwork) (knyazev2019learning)||284||20,726||8||366||4,790||53.5|
Creation To create the dataset, we retrieved from GH Archive222https://www.gharchive.org all of the raw public events in GH in 2019. The knowledge graph was then constructed by tuples, each of which represents an individual event containing temporal information based on its type and a predefined set of extraction rules. The properties of the constructed KG is shown in the first row of Table 1, referred to as GitHub-SE 1Y.
Due to the substantial size of the GitHub-SE 1Y and unrealistic computational demands of training KG embedding models on this KG, we sampled the GitHub-SE 1Y using two distinct strategies, described in more details in Appendix A.1. The first strategy aims to retain maximum temporal information about particular SE projects. To achieve this, first, an induced sub-graph containing all related nodes was extracted for each node with type Repository. Then, for each sub-graph a popularity score was calculated as where is the size of the graph, is the time-span of the graph, and are weight values. Finally, from the top three ranked repositories, we selected the Visual Studio Code repository to extract a one-year slice as it exercised more functionalities related to the target entities in this work, i.e. issues and pull requests. We name this dataset GitHub-SE 1Y-Repo due to its repository-centric characteristics.
The second strategy aims at preserving the most informative nodes regardless of their type. We used a variation of Snowball Sampling (goodman1961snowball) on all the events in December 2019. This sampled dataset, i.e. GitHub-SE 1M-Node, captures events across various projects and therefore, can be used to answer queries such as which project does a user start contributing at a certain time.
Characteristics In Table 1, we compare the variations of GitHub-SE KG proposed in this work with commonly used datasets in the literature. Even the sampled down versions of our datasets are considerably larger in terms number of nodes. They have much higher edge to node ratios which translates into sparsity in graphs, but this sparsity level is close to what appears in GitHub as a whole. Additionally, similar to relations, each node in our datasets is also typed.
trivedi2019dyrep also collects a temporal KG dataset from GitHub. However, this dataset is exclusively focused on the social aspects of GitHub, discarding repositories and only including user-user interactions, and it does not appear to be publicly available beyond raw data and a small subnetwork extracted in a follow-up work (knyazev2019learning). To differentiate our datasets, which focus on the SE aspects of GitHub, we append -SE to the dataset names.
The distinguishing characteristics of the proposed datasets, i.e. size, sparsity, node-typing, diversity, focus on SE aspects, and temporal nature, introduce a variety of engineering and theoretical challenges that make these datasets a suitable choice for exploring and exposing the limitations of temporal knowledge graph embedding models.
3.1 Existing KG embedding models
We first examine the performance of the state-of-the-art KG embedding models on the GitHub-SE 1M-Node and GitHub-SE 1Y-Repo datasets. We select RotatE (sun2019rotate) for the static settings considering its ability to infer Symmetry, Antisymmetry, Inversion, and Composition relational patterns. Moreover, we use DE- (goel2019diachronic) for the dynamic setting due to its superior performance on existing benchmarks and the fact that for any static model there exists an equivalent DE- model ensuring the ability to learn aforementioned patterns.
Notationally, a KG is a set of tuples of the form respectively representing the subject, relation, object, and timestamp. The diachronic embeddings are defined as
where are embedding lookup tables and the last three respectively represent Amplitude, Frequency, and Phase of a sinusoid. Similar to goel2019diachronic, whenever consists of multiple set of numbers rather than one, e.g. year, month, and day, for each set of numbers a separate is defined, and the values are summed up. Subsequently, the scoring function of the DE-RotatE model is defined as
where is an embedding lookup table for relations.
3.2 Relative Temporal Context
The idea of using relative temporal information has been successfully employed in natural language processing(vaswani2017attention; dai2019transformer) and music generation (huang2018music). These models formalize the intuition that temporal spacing between events is more central than the absolute time at which an event happened. We believe this framing is also appropriate for SE applications of temporal knowledge graphs: to predict if a pull request is closed at time , it is more important to know how long it has been since the pull request was opened than it is to know .
A challenge is that there are a lot of events, and we do not want to hard-code which durations are relevant. Instead, we would like the model to learn which temporal durations are important for scoring a temporal fact. As the number of related facts to an entity could be as high as a few thousand, we propose to pick a fixed-number of facts as temporal context to provide as an input to the models.
Let be the set of times associated with facts involving entity and relation occurring before time , and let be the relative time since a fact involving and relation has occurred. Hence, an entity’s relative temporal context at query time is .
3.3 Relative Time DE-RotatE (RT-DE-Rotate)
We now turn attention to using the relative temporal context as an input to temporal KG embeddings. Our inspiration is the Transformer
encoder, which has emerged as a successful substitute to more traditional Recurrent Neural Network approaches used for sequential(vaswani2017attention; dai2019transformer; huang2018music) and structural (parmar2018image) tasks. The core idea is to employ a variation of attention mechanism called Self-Attention that assigns importance scores to the elements of the same sequence.
Unlike recurrence mechanism, the positional information is injected to the Transformer styled encoders by 1) adding sine/cosine functions of different frequencies to the input (vaswani2017attention; dai2019transformer), or 2) directly infusing relative distance information to attention computation in form of a matrix addition (shaw2018self; huang2018music). vaswani2017attention
introduced a positional encoding scheme in form of sinusoidal vectors defined as, where if is even and if
is odd,is the absolute position, and is the embedding dimension. In the follow-up Transformer-XL model, dai2019transformer introduce a reparameterization of the relative attention where the attention score between a query element at position and a key element at position is defined as
where are the and element embeddings, , , , are trainable matrices, and is the relative position between and .
Performance comparison on time-conditioned Link Prediction. Results within the 95% confidence interval of the best are bolded.
The main difference in our setting is that the above models compute a score based on a single relative time , while our relative temporal context contains relative times for each entity. Our approach is to score a tuple based on the information available at query time .333During training, is the of the positive sample, and during evaluation, is set to the maximum timestamp in the training set. For each entity we define a positional embeddings matrix of relative times between and the events in its relative temporal context as
Intuitively, these relative times encode “if the event happened at time , how long would it have been since the events in the relative time context.” A learned, relation-specific row vector for chooses which rows of are important, and then , abbreviated , embeds the relative temporal context of , replacing :
where is a relation-specific weight matrix and and are tuple-agnostic weight matrices; however, this formulation is suitable for bilinear models. Hence, we derive a translational variation for the DE-RotatE model as
where is replaced by an embedding lookup table . Intuitively, under this formulation capture entities compatibility and and capture entity-specific temporal context compatibility. In comparison, the existing models only include term discarding terms and .
Datasets: We use a 90%-5%-5% events split for constructing the train, validation, and test sets. For the interpolated queries the split was done randomly, whereas we split the data using event timestamps for the extrapolated queries. Table 3 in the Appendix presents details of the splits.
Queries: For time-conditioned link prediction, we selected events related to the resolution of Github issues and pull requests due to their direct impact on software development and maintenance practice. Particularly, we used “Who will close issue X at time T?” and ‘Who will close pull-request X at time T?” for evaluation. For time prediction, we used the analogous time queries of the aforementioned queries for evaluation, i.e. “When will issue X be closed by user Y?” and ‘When will pull-request X be closed by user Y?”.
Evaluation and Results:
We calculated the standard metrics to evaluate the model performance on the test set. For the extrapolated time-conditioned link prediction queries, after using the validation set for hyperparameter tuning, we retrained the selected models using both training and validation sets for evaluation. We also report the model performance without retraining in the Appendix Table5.
In Table 2 we compare the model performance on the time-conditioned link prediction queries. On the Github-SE 1M-NODE queries, our model slightly outperforms existing models in some cases, but the difference is statistically insignificant in others. On the Github-SE 1Y-REPO, on the other hand, our RT-DE-ROTATE model shows a significant performance boost, particularly on the extrapolated time-conditioned link prediction queries, indicating the importance of using relative time as temporal context.
For the extrapolated time prediction queries on GITHUB-SE 1Y-REPO dataset, our model performed slightly better on HITS@1, HITS@3, and Mean Reciprocal Rank (MRR) than the other existing models while marginally surpassing the random baseline on all metrics. These results, detailed in the Appendix Table 8, stress the necessity of having further studies on extrapolated time prediction queries.
In this work, we bridge between the SE domain questions and the literature on KG embedding models by introducing two new datasets based on the daily interactions in the GH platform and casting those questions as queries on an appropriate KG. Furthermore, we introduce RT-X, a novel extension to existing KG embedding models that make use of relative time with regards to past relevant events. Our experiments highlight shortcomings of existing temporal KG embedding models, notably on extrapolated time-conditioned link prediction, and exhibit the advantage of leveraging relative time as introduced in the RT-X model. In total, this work highlights new opportunities for improving temporal KG embedding models on time prediction queries.
We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). This research was enabled in part by support provided by Google, Calcul Québec, and Compute Canada. We thank Daniel Johnson for helpful comments.
Appendix A Dataset
Algorithm 1 describes the snowball sampling used to create the GitHub-SE 1M-NODE dataset. This algorithm aims at preserving the most informative nodes regardless of their types. Moreover, Algorithm 2 describes the temporal sampling used to create the GitHub 1Y-REPO dataset. This algorithm aims at preserving maximum temporal information regarding particular repositories.
Table 9 presents the set of extraction rules used to build the KG from raw events each representing a relation type. Although 80 extractions rules are defined in Table 9, the raw events that we used only contained 18 of them.
The codes presented in the Relation column of Table 9, when divided on underscore, are interpreted as a) the first and the last components respectively represent entity types of the event’s subject and object, b) AO, CO, SE, SO, and HS are abbreviations of extracted information from raw payloads 444https://developer.github.com/webhooks/event-payloads/ serving as distinguishers between different relation types among entities, and c) the second to the last component represents the concrete action taken that triggers the event.
Appendix B Model
Table 4 presents time and space complexity comparison between the existing models and the introduced RT-X model. Notice that, while yielding superior performance, the number of free-parameters introduced in our extension does not increase linearly with the number of entities which is one of the bottlenecks of training large KG embedding models.
|Model||Computational Complexity||Free Parameters Complexity|
b.2 Loss Function
Appendix C Experiments
c.1 Time Prediction
To evaluate the time prediction queries, we consider the dates in the min-max timestamp range of the set that is being evaluated as the candidate set.
c.2 Model Selection
The best model is selected using the MRR on validation set and HITS@N with , Mean Rank (MR), MRR on test set are reported.
c.3 Negative Sampling
We follow the negative sampling scheme employed by dasgupta2018hyte providing the model with both sets of time-agnostic and time-dependent negative samples.
We apply L3 regularization parameterized by as introduced in lacroix2018canonical on , , , and .
c.5 Re-ranking Heuristics
We employed two re-ranking heuristics during the evaluation time for time-conditioned link prediction. First, each entity was only evaluated among entities with the same type. Next, we push down the ranks of entities with prior interactions with the given entity.
Initially, we tuned our models using the hyperparameter ranges reported in Table 6 for dropout, , , and resulting in total of 72 runs. Then, following the best hyperparameters achieved on RotatE and DE-RotatE models, we used dropout = , , , , , time-agnostic negative ratio = , time-dependant negative ratio = , batch size = , warm-up steps = , warm-up decay rate = , steps = , and validation steps = for all experiments.
To make a fair comparison, we chose a base embedding size of 128 for all experiments. Subsequently, we only report on the combinations of static embedding dimension values and diachronic embedding dimension values presented in Table 6 where . We evenly distribute among all diachronic embeddings to prevent giving models a distinct advantage in terms of free-parameters. As for the relative time embedding dimension , we report on all the combinations in Table 6 with and respecting the stated restriction resulting in total of 17 experiments per dataset.
Table 7 presents the average runtime of each model for every 100 steps with batch size set to 64. All experiments were carried on servers with 16 CPU cores, 64GB of RAM, and a NVIDIA V100/P100 GPU.
c.8 Standard Error
We use standard error to calculate confidence intervals and detect statistically indistinguishable results.
c.9 Relations Importance Matrix
Figure 1 presents the importance matrix between relations, i.e. , learned as part of the RT-X model. From this figure, it is evident that the learned matrix is not symmetric, indicating that the model learns different importance scores conditioned on the query relation.
We implemented our model using PyTorch(paszke2019pytorch). The source code is publicly available in GitHub555https://github.com/kahrabian/RT-X.
|Event Type||Head||Relation (Code)||Tail|
|Commit Comment||User||Actor (U_AO_CC)||Commit Comment|
|Issue Comment||User||Created (U_SO_C_IC)||Issue Comment|
|Issue Comment||Created (IC_AO_C_I)||Repository|
|Pull Request Review Comment||User||Created (U_SO_C_PRC)||Pull Request Review Comment|
|Pull Request Review Comment||Created (PRC_AO_C_P)||Pull Request|
|Pull Request Review||User||Submitted (U_SO_S_PR)||Pull Request Review|
|Pull Request Review||Submitted (PR_AO_S_P)||Pull Request|
|Pull Request||User||Assigned (U_SO_A_P)||Pull Request|
|Review Requested (U_SO_RR_P)|
|Review Request Removed (U_SO_RRR_P)|
|Ready for Review (U_SO_RFR_P)|
|User||Assigned (U_AO_A_P)||Pull Request|
|User||Review Requested (U_RRO_A_P)||Pull Request|
|Review Request Removed (U_RRO_R_P)|
|Pull Request||Assigned (P_AO_A_R)||Repository|
|Review Requested (P_AO_RR_R)|
|Review Request Removed (P_AO_RRR_R)|
|Ready for Review (P_AO_RFR_R)|