Representation Learning for Stack Overflow Posts: How Far are We?

03/13/2023
by   Junda He, et al.
0

The tremendous success of Stack Overflow has accumulated an extensive corpus of software engineering knowledge, thus motivating researchers to propose various solutions for analyzing its content.The performance of such solutions hinges significantly on the selection of representation model for Stack Overflow posts. As the volume of literature on Stack Overflow continues to burgeon, it highlights the need for a powerful Stack Overflow post representation model and drives researchers' interest in developing specialized representation models that can adeptly capture the intricacies of Stack Overflow posts. The state-of-the-art (SOTA) Stack Overflow post representation models are Post2Vec and BERTOverflow, which are built upon trendy neural networks such as convolutional neural network (CNN) and Transformer architecture (e.g., BERT). Despite their promising results, these representation methods have not been evaluated in the same experimental setting. To fill the research gap, we first empirically compare the performance of the representation models designed specifically for Stack Overflow posts (Post2Vec and BERTOverflow) in a wide range of related tasks, i.e., tag recommendation, relatedness prediction, and API recommendation. To find more suitable representation models for the posts, we further explore a diverse set of BERT-based models, including (1) general domain language models (RoBERTa and Longformer) and (2) language models built with software engineering-related textual artifacts (CodeBERT, GraphCodeBERT, and seBERT). However, it also illustrates the “No Silver Bullet” concept, as none of the models consistently wins against all the others. Inspired by the findings, we propose SOBERT, which employs a simple-yet-effective strategy to improve the best-performing model by continuing the pre-training phase with the textual artifact from Stack Overflow.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/21/2022

PTM4Tag: Sharpening Tag Recommendation of Stack Overflow Posts with Pre-trained Models

Stack Overflow is often viewed as the most influential Software Question...
research
01/27/2022

Aspect-Based API Review Classification: How Far Can Pre-Trained Transformer Model Go?

APIs (Application Programming Interfaces) are reusable software librarie...
research
07/14/2023

Are Large Language Models a Threat to Digital Public Goods? Evidence from Activity on Stack Overflow

Large language models like ChatGPT efficiently provide users with inform...
research
06/04/2021

BERT based sentiment analysis: A software engineering perspective

Sentiment analysis can provide a suitable lead for the tools used in sof...
research
03/10/2019

DeepTagRec: A Content-cum-User based Tag Recommendation Framework for Stack Overflow

In this paper, we develop a content-cum-user based deep learning framewo...
research
05/26/2023

Automated Summarization of Stack Overflow Posts

Software developers often resort to Stack Overflow (SO) to fill their pr...
research
11/02/2018

The Evolution of Stack Overflow Posts: Reconstruction and Analysis

Stack Overflow (SO) is the most popular question-and-answer website for ...

Please sign up or login with your details

Forgot password? Click here to reset