ALFAA: Active Learning Fingerprint Based Anti-Aliasing for Correcting Developer Identity Errors in Version Control Data

01/10/2019
by   Sadika Amreen, et al.
0

Graphs of developer networks are important for software engineering research and practice. For these graphs to realistically represent the networks, accurate developer identities are imperative. We aim to identify developer identity errors from open source software repositories in VCS, investigate the nature of these errors, design corrective algorithms, and estimate the impact of the errors on networks inferred from this data. We investigate these questions using over 1B Git commits with over 23M recorded author identities. By inspecting the author strings that occur most frequently, we group identity errors into categories. We then augment the author strings with 3 behavioral fingerprints: time-zone frequencies, the set of files modified, and a vector embedding of the commit messages. We create a manually validated set of identities for a subset of OpenStack developers using an active learning approach and use it to fit supervised learning models to predict the identities for the remaining author strings in OpenStack. We compare these predictions with a commercial effort and a leading research method. Finally, we compare network measures for file-induced author networks based on corrected and raw data. We find commits done from different environments, misspellings, organizational IDs, default values, and anonymous IDs to be the major sources of errors. We also find supervised learning methods to reduce errors by several times in comparison to existing methods and the active learning approach to be an effective way to create validated datasets and that correction of developer identity has a large impact on the inference of the social network. We believe that our proposed Active Learning Fingerprint Based Anti-Aliasing (ALFAA) approach will expedite research progress in the software engineering domain for applications that depend upon graphs of developers or other social networks.

READ FULL TEXT
research
03/18/2020

A Dataset and an Approach for Identity Resolution of 38 Million Author IDs extracted from 2B Git Commits

The data collected from open source projects provide means to model larg...
research
05/09/2019

Supporting Software Engineering Research and Education by Annotating Public Videos of Developers Programming

Software engineering has long studied how software developers work, buil...
research
04/06/2023

A Unified Active Learning Framework for Annotating Graph Data with Application to Software Source Code Performance Prediction

Most machine learning and data analytics applications, including perform...
research
08/02/2011

On the Evaluation Criterions for the Active Learning Processes

In many data mining applications collection of sufficiently large datase...
research
10/31/2019

RLINK: Deep Reinforcement Learning for User Identity Linkage

User identity linkage is a task of recognizing the identities of the sam...
research
10/04/2019

Investigating the Effectiveness of Word-Embedding Based Active Learning for Labelling Text Datasets

Manually labelling large collections of text data is a time-consuming, e...

Please sign up or login with your details

Forgot password? Click here to reset