Exploring Software Reusability Metrics with Q A Forum Data
Question and answer (Q A) forums contain valuable information regarding software reuse, but they can be challenging to analyse due to their unstructured free text. Here we introduce a new approach (LANLAN), using word embeddings and machine learning, to harness information available in StackOverflow. Specifically, we consider two different kinds of user communication describing difficulties encountered in software reuse: 'problem reports' point to potential defects, while 'support requests' ask for clarification on software usage. Word embeddings were trained on 1.6 billion tokens from StackOverflow and applied to identify which Q A forum messages (from two large open source projects: Eclipse and Bioconductor) correspond to problem reports or support requests. LANLAN achieved an area under the receiver operator curve (AUROC) of over 0.9; it can be used to explore the relationship between software reusability metrics and difficulties encountered by users, as well as predict the number of difficulties users will face in the future. Q A forum data can help improve understanding of software reuse, and may be harnessed as an additional resource to evaluate software reusability metrics.
READ FULL TEXT