Modern software is becoming more and more complicated and the development usually needs collaboration of a team and depends on a large number of third-party software packages, which promotes the wide adoption of social collaborative coding paradigm. Recently, social collaborative coding platforms such as GitHub have emerged to provide developers with abundant functionalities of social collaboration and technical development and produce a large amount of high-quality open source software packages. According to GitHub 111https://github.com/about, there are more than 56 million developers collaborating on more than 100 million software projects as of March, 2021.
While the explosive growth of open source software packages will significantly fuel the prosperity of the software industry, it also exposes developers to the challenge of information overload. Developers often need to spend much time searching software packages they are interested in. To address this challenge, it is essential to introduce recommendation systems which have been proven powerful to deal with information overload problem in various fields [rendle2012bpr, he2017neural, chen2016minimizing, zhang2021privacy, zhang2019covering].
Recently, conventional recommendation models have been applied to software recommendation [ichii2009software, thung2013automated, he2020diversified, zhang2019location], but they usually do not consider either the dynamics of developers’ interests [jiang2020adapting] or particular constraints of social coding such as social influence among developers and dependency relations among software packages. During the whole lifetime as a developer, his technology interests will gradually evolve due to the emergence of new technology or new development requirements. And his new technology choice is usually influenced by his friends and dependent on the dependency relationship among software packages in the environment of social collaborative coding. Taking Figure 1 as an example, a frontend developer A focused on Angular-based technical stack during the first period and changed to Vue-based technical stack in the following period because Vue makes frontend development more convenient and efficient. In each period, the base technology, i.e. Angular or Vue, constrains developer A’s choices to its own field and within the constraints, developer A consults his friends of the same base technology for recommendation.
In this article, we focus on modeling the dynamic interests of developers with both social influence and dependency constraints, and propose the Session-based Social and Dependency-aware software Recommendation (SSDRec) model. Our main contributions are summarized as follows:
We propose to model the dynamic interests of developers with both social influence from friends and dependency constraints among software packages.
We develop a unified framework to integrate a recurrent neural network and two graph attention networks. The recurrent neural network models the short-term dynamic interests of developers in each session and the two graph attention networks capture social influence from friends and dependency constraints from dependent software packages, respectively.
The rest of this article is organized as follows. In Section 2, we briefly review the important work related to this paper. Section 3 provides some preliminaries of the article. Our proposed model SSDRec is described in Section 4 and the experiment results are shown in Section 5. Finally, we draw conclusions in Section 6.
2 Related work
In this section, we will briefly review the related works in dynamic recommendation, social recommendation and software recommendation.
2.1 Dynamic recommendation
As users’ interests dynamically evolve, conventional recommendation models which capture users’ long-term static interests are not applicable anymore and various dynamic recommendation models have been proposed. For example, an earlier work utilized Markov Chains in successive point-of-interest recommendation[cheng2013you]. Recently, recurrent neural network (RNN) is exploited to model users’ dynamic interests from their recent behaviors. Manotumruksa et al. [manotumruksa2017deep] integrated the sequential time information of users’ behaviors into the matrix factorization (MF) with a RNN. Dong et al. [dong2018recurrent] further performed joint optimization of RNN and MF with shared parameters in a multi-task learning framework.
Furthermore, instead of modeling behavior sequence as a whole as the above models, session-based recommendation models segment users’ behavior sequence into several sessions to model users’ dynamic interests in a more fine-grained granularity. Hidasi et al. [hidasi2015session] first proposed a RNN-based approach for session-based recommendations to capture users’ short-term dynamic interests within a session. Then based on the assumption that a session often serves different purposes, Wang et al. [wang2019modeling] proposed a mixture-channel model with attention mechanism to detect the purposes of each item. In addition to modeling interactions within a session, information across sessions has been also introduced. Ruocco et al. [ruocco2017inter] exploited two separate RNNs to process the current session and the past sessions separately.
2.2 Social recommendation
Social recommendation utilizes social network information to enhance the performance of recommendation models. Conventionally, this information is utilized by some hand-crafted features. Ma et al. [2011Recommender] regularized the matrix factorization framework with social network information. Zhao et al. [2014Leveraging] leveraged friends’ interaction as another positive feedback for Bayesian Personalized Ranking (BPR).
Recently, deep neural network is adopted to process the social network information in recommendation models instead of hand-crafted features. Deng et al. [deng2016deep]
utilized Autoencoder to initialize vectors in MF and updated them with both social trust ensemble and community effect. With the successful application of graph neural network (GNN) in recommendation systems[wang2019neural], Fan et al. [fan2019graph] coherently modeled two graphs and heterogeneous strengths to jointly capture interactions and opinions in the user-item graph. To distinguish the different social influence from different friends in the social network and to model the evolution of social influence, Song et al. [song2019session] unified the recurrent neural network and the graph attention network into one framework.
2.3 Software recommendation
Software recommendation emerges as a hot research field with the rapid development of social collaborative coding platforms such as GitHub and explosive growth of third-party software packages. Conventional software recommendation usually employed collaborative filtering-based recommendation models[rendle2012bpr, he2017neural]. For example, Ichii et al. [ichii2009software] used collaborative filtering to recommend similar software packages to developers. Thung et al. [thung2013automated] further combined association rule and collaborative filtering to capture deeper relationships between software packages. He et al. [he2020diversified] employed an adaptive weighting mechanism and neighborhood information to neutralize popularity bias in MF, which significantly increased both the diversity and accuracy of the recommendation results.
As social collaborative coding becomes popular and development stages change rapidly, some software recommendation models have begun to employ social influence or developers’ dynamic behaviors. Guy et al. [guy2009personalized] aggregated developers’ familiarity network and similarity network to recommend social software. Jiang et al. [jiang2020adapting] adopted time decay factor and operation behaviors weight to model developers’ dynamic interests in a popular programming platform, Scratch. However, social influence or developers’ dynamic behaviors are usually modeled separately and the dependency relations among software packages are usually ignored in existing works.
In this section, we will first introduce some necessary definitions and formulate the problem. The main notations are summarized in Table 1. Generally, sets, vectors and matrices are denoted as upper-case letters, bold lower-case letters and bold upper-case letters, respectively.
|The set of developers|
|The set of software packages|
|,||The dependency network and the social network|
|The session of developer during time period|
|’s one-hop neighbors in dependency network|
|’s one-hop neighbors in social network|
|the embeddings of software package and developer in -th layer|
|the final embeddings of software package and developer|
|aggregation weight matrices for dependency and social network in -th layer|
|transformation matrices for dependency and social network|
Definition 1 (Session)
An ordered set of software packages with which the developer has interacted within a specific time period. Let and denote the sets of developers and software packages, respectively. The session of developer during time period is the ordered set of software packages developer watched within time period , i.e., (, ).
In the software development community, developers need to keep investigating new software packages as new development requirements arise or new techniques emerge. Especially, developers pay attention to different technical fields during different time periods. Thus, developers’ historical interactions with software packages need to be segmented into different groups by time periods to represent their different interests in a fine-grained granularity and to capture the evolution of the interests.
Definition 2 (Dependency constraint)
The dependency network formed by the dependency relations among software packages constrains developer ’s future choices to the near neighbors of his/her previous chosen software packages .
As the complexity of modern software grows, it becomes a popular paradigm in modern software engineering to reuse third-party software packages. For example, to enable OAuth login using accounts from third-party social network platforms such as Facebook and Twitter, it is better to reuse well-tested third-party OAuth packages such as Laravel Socialite 222https://github.com/laravel/socialite, which will not only improve the efficiency of development but also guarantee the stability and security. Thus complicated dependency relations appear and form the dependency network. These dependency relations often constrain the technical choices of developers because developers often focus on specific technical fields which are in fact some sub-networks of the whole dependency network.
Definition 3 (Social influence)
The choices of developer ’s friend developers in the developer social network influence ’s future choices.
Similar to the dependency relations among software packages, developers in social collaborative coding platforms like GitHub usually follow other developers and build the social network . Then the friend developers’ activities such as watching or forking a new software project appear in the developer’s timeline, which will influence the future technical choices of developers.
Problem 1 (Session-based software recommendation)
Given a new session of developer , the goal of session-based software recommendation is to recommend a set of software packages from that the developer is most likely to watch in the next step within session , i.e., .
In this article, we focus on the session-based software recommendation by simultaneously considering both dependency constraint among software packages and social influence among developers.
In this section, we present the proposed Session-based Social and Dependency-aware software Recommendation model SSDRec in detail. The overview of the model is shown in Figure 2
and it is composed of four components: (1) Dependency constraint: we use a graph attention network to capture the dependency relations among software packages and obtain the embedding of each software package. (2) Dynamic interest modeling: we use a recurrent neural network to model the sequence of software packages in a session and obtain the final embedding of developer’s dynamic interests in this session. (3) Social influence: we use another graph attention network to capture the social influence from neighbor developers and obtain the final embedding of each developer. (4) Recommendation: we use softmax to estimate the probability a developer will choose a given software package.
4.1 Dependency constraint
Dependency constraint can be regarded as the domain knowledge specific for software recommendation. We employ this domain knowledge into the embeddings of software packages by a graph attention network.
First, the initial embedding of each software package is obtained by a transformation of its one-hot encoding,
where and are the software package ’s one-hot encoding and initial embedding, respectively. is the transformation matrix.
Then the attention mechanism is utilized to distinguish the different dependency constraints from different neighbor software packages and the attention weight between software package and () in the -th layer of the graph attention network is calculated as
where is the attention weight between software package and , and is the set of one-hop neighbors of software package in the dependency network. Note that we also use self-connection edge to preserve the embedding of software package itself.
Finally, the embedding of software package is updated by a weighed aggregation of the embeddings of its neighbor packages from the previous layer,
where is the aggregation weight matrix of the -th layer and
is the nonlinear activation function.
After layers’ aggregation of dependency constraints, the embedding is obtained. We then concatenate it with its original embedding as the final embedding of the software package,
is a linear transformation matrix.
The overall procedure is summarized in Algorithm 1.
4.2 Dynamic interest
Developers’ interests gradually evolve as development requirements change and new technology emerges. We capture developers’ dynamic interests in each session by modeling the sequence of software packages within each session using a recurrent neural network (RNN).
Given the session of developer in time period , the embeddings of the software packages in this session
are first obtained from the previous step and then fed to a RNN. The RNN recurrently learns a hidden representation from the sequence by taking account both the current input package and previous input packages,
where is a kind of recurrent neural network model and is the hidden representation with dimension at step .
Generally, there are various RNN models and we choose the long short-term memory (LSTM) model. Details of LSTM are shown in Equation (6),
where , and are all model parameters, and denotes element-wise product.
The overall procedure is summarized in Algorithm 2.
4.3 Social influence
In social collaborative coding platforms like GitHub, a developer can follow other developers to have their recent activities appeared in his/her timeline. Thus, neighbor developers’ choices can influence the developer’s own choice. We capture this social influence by applying a graph attention network on the social network of developers.
First, for a given developer at time period , the initial embedding of each of his/her neighbor is obtained by a combination of his/her one-hot encoding and dynamic interest of previous time period ,
where , and are the developer ’s one-hot encoding, dynamic interest embedding at time period and initial embedding, respectively. and are the transformation matrices.
Specially, for the target developer , we initialize his/her embedding using the dynamic interest embedding at time period ,
Then similar to dependency constraints, the attention mechanism is also utilized to distinguish the different social influence from different neighbor developers. The attention weight between developer and () in the -th layer of the graph attention network is calculated as
where is the attention weight between developer and , and is the set of one-hop neighbors of developer in the social network.
Finally, the embedding of developer at time period is updated by a weighed aggregation of the embeddings of its neighbor developers from the previous layer,
where is the aggregation weight matrix of the -th layer and is the nonlinear activation function.
After layers’ aggregation of social influence, the embedding is obtained. We then concatenate it with its initial embedding as the final embedding of developer ,
where is a linear transformation matrix.
The overall procedure is summarized in Algorithm 3.
After obtaining the embeddings of the software packages and the developer , we employ the softmax to estimate the probability that developer will choose each software package and recommend the top- software packages to developer .
To learn the model parameters, we maximize the log-likelihood of the observed packages in all sessions,
where denotes the lenght of session of developer .
The overall training procedure is summarized in Algorithm 4.
In this section, we conduct experiments to answer the following research questions:
RQ1 Does SSDRec outperform the baseline methods on all experimental settings?
RQ2 Do the components of SSDRec enhance the effectiveness by (a) modeling social influence and dependency constraints, (b) capturing friends’ dynamic and static interests.
RQ3 How do some hyper-parameters affect the performance of SSDRec?
In the remainder of this section, we will first describe the experimental settings (Section 5.1) and then answer the above research questions(Section 5.2, Section 5.3, Section 5.4). Finally, several illustrative examples are given to verify our assumptions.(Section 5.5).
5.1 Experimental Settings
|Avg. friends per developer||8.59||13.05||16.24|
|Avg. dependencies per package||4.56||5.09||10.09|
|Avg. session length||2.95||3.13||3.25|
Each dataset is then split into training, validation and test sets. Sessions of each developer are ordered by time in ascending order and the sessions within the last two years (about 104 weeks) are randomly split into validation and test sets. The rest sessions are used as training set. Note that when splitting datasets, we ensure all packages in validation/test sets appear in the training set. The detailed statistics of these three datasets are shown in Table 2.
To evaluate the performance of our proposed SSDRec, we compare it with the following baselines. As software recommendation usually utilizes conventional recommendation models with specific features, we choose the two classic recommendation modes, i.e., BPR [rendle2012bpr] and NCF [he2017neural] as baselines. Morever, considering the dynamic nature we also choose two session-based recommendation models, i.e., RNN [hidasi2015session] and DGRec [song2019session]. A brief introduction of these baselines are depicted as follows:
BPR [rendle2012bpr]: a classical MF-based method optimized with a ranking objective.
NCF [he2017neural]: uses neural network instead of inner product to model relationship between users and items of MF.
RNN [hidasi2015session]: captures users’ session-level dynamic interests with RNN.
DGRec [song2019session]: a state-of-the-art model for session-based social recommendation. It utilizes RNN and GAT to model dynamic interests and dynamic social influence.
5.1.3 Parameter Settings
Our proposed model is implemented using TensorFlow[abadi2016tensorflow]. Adam [kingma2014adam] is chosen as optimizer and parameters are initialized as suggested in [abadi2016tensorflow]. The batch size and dropout are set to 200 and 0.2, respectively. The embedding dimensions of both developers and software packages for all models are set to 100. For both social network and dependency network, we employ the graph attention networks with 2 layers, i.e., and for each node 10 one-hop neighbors and 5 two-hop neighbors are sampled as recommended in [hamilton2017inductive]. All experiments are run on a machine with a GeForce RTX2080Ti GPU.
5.1.4 Evaluation Metrics
The performance of SSDRec and all baselines are evaluated with two well-known metrics: Hit Rate and Normalized Discounted Cumulative Gain .
measures the proportion of users who get correct recommendation results and it can be formulated as
where and are the sets of top-K recommendation list and ground truth list for user , respectively. is the indicator function and when otherwise it is 1. |U| is the total number of users.
considers the orders of users’ preferred items in the ranked Top-N recommendation list for users usually pay attention to only the top few items recommended by a recommendation system. It can be formulated as:
where is the order of the top ranked item of ’s ground truth items in the recommendation list.
5.2 Overall performance (RQ1)
|Improvement rate (%)||7.97||9.44||5.02||9.49||9.96||6.98|
|Improvement rate (%)||9.08||10.96||6.02||8.93||10.40||7.43|
|Improvement rate (%)||8.32||9.71||5.05||5.02||7.52||4.92|
5.3 Ablation Studies (RQ2)
As SSDRec is composed of several components, we conduct the ablation studies by comparing the performance of the variants of SSDRec to demonstrate the effectiveness of different components.
5.3.1 Effect of social network and dependency network (RQ2(a))
SSDRec utilizes two graph attention networks to capture the social influence among developers and dependency constraints among software packages. To illustrate their impact on the recommendation performance, we compare the performance of the two variants of SSDRec, i.e., SSDRec-social and SSDRec-dependency with SSDRec. SSDRec-social and SSDRec-dependency exclude the graph attention network for dependency network and social network, respectively. Actually, SSDRec-social only considers the social influence and is identical to DGRec. The detailed modifications of SSDRec variants are shown in Table 4.
5.3.2 Effect of dynamic interests and static interests (RQ2(b))
5.4 Hyperparameter Analysis (RQ3)
To ensure the flexibility of the proposed SSDRec, it employs several hyperparameters. In this section, we conduct experiments to show how the hyperparameters affect the performance of our model.
5.4.1 Neighborhood sample size
Due to the heterogeneity of both social and dependency networks, we utilize the sampling technique proposed in [hamilton2017inductive] to ensure the training efficiency of the two graph attention networks. (|) neighbors are sampled for the first layer of the graph attention network for social network (dependency network) and the neighborhood sample size of each layer is half of the previous layer. In the analysis, we measure the recommendation performance using and under different neighborhood sample sizes as shown in Figure 3.
From Figure 3, it can be found that neighborhood sample size for both graph attention networks for social and dependency networks has an impact on the recommendation performance, and too large or too small neighborhood sample size will both decrease the performance. The optimal neighbor sample sizes and for PHP are both 10, while they are 8 and 14 for Ruby, respectively.
5.4.2 Session length
Developers usually focus on a specific technical field during a certain length of time period, which is modeled as a session. Session length determines the granularity of dynamic interest modeling. In this analysis, we demonstrate the impact of session length by comparing the performance of SSDRec with different session lengths. From the result shown in Figure 4, it can be observed that the performance decreases with the increase of session length and finally lies in a steady state. The reason is that longer session length means more coarse-grained granularity of dynamic interest. In an extreme setting where the whole interactions of a developer are segmented into one session, the RNN component of SSDRec captures the long-term static interest of the developer instead of short-term dynamic interest.
5.5 Attention Visualization
Section 5.4.2 has demonstrated session can capture fine-grained interests. In this section, we further verify the hypothesis that developers’ interests are relatively stable within a session but evolve across sessions by visualizing the attention weights of our model from the perspectives of both the overall development community and individual developer.
5.5.1 The overall development community
5.5.2 Individual developer
In this section, we conduct a case study of an individual developer to show his/her behaviors within and across sessions. This developer has 8 test sessions and at least 5 friends in PHP community and the visualization results are shown in Figure 6. From the results, the following observations can be found:
(1) The developer’s interest is generally stable within the session and his/her technical choices are mainly influenced by friends 0 and 5.
(2) The developer’s interest evolves across sessions. During the first three sessions, i.e., session 0, 1 and 2, the developer’s technical choices are mainly influenced by friends 0 and 3 while then in the three sessions that follow friend 5 begins to influence him/her. Maybe the developer begins to pay attention to a new technical field in session 3. Finally, the influence of friend 5 disappears in the last two sessions. The developer may leave this technical field after some investigation.
In this article, we study the software recommendation problem and propose the Session-based Social and Dependency-aware software Recommendation model SSDRec to model the dynamic interests of developers with both social influence and dependency constraints in a unified framework. Especially, a RNN is employed to model the short-term dynamic interests and two GATs are utilized to capture social influence from friends and dependency constraints from dependent software packages, respectively. The experiments on real world datasets verify the effectiveness of all three components of our model. In future, we will consider higher order relations in both social and dependency networks.
CRediT authorship contribution statement
Dengcheng Yan: Conceptualization, Methodology, Writing - Original Draft. Tianyi Tang: Software, Validation, Data Curation, Writing - Original Draft. Wenxin Xie: Software, Validation. Yiwen Zhang: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
This work is supported by the National Natural Science Foundation of China (Grant No. 61872002, U1936220), the University Natural Science Research Project of Anhui Province (Grant No. KJ2019A0037).