Are We Building on the Rock? On the Importance of Data Preprocessing for Code Summarization

by   Lin Shi, et al.

Code summarization, the task of generating useful comments given the code, has long been of interest. Most of the existing code summarization models are trained and validated on widely-used code comment benchmark datasets. However, little is known about the quality of the benchmark datasets built from real-world projects. Are the benchmark datasets as good as expected? To bridge the gap, we conduct a systematic research to assess and improve the quality of four benchmark datasets widely used for code summarization tasks. First, we propose an automated code-comment cleaning tool that can accurately detect noisy data caused by inappropriate data preprocessing operations from existing benchmark datasets. Then, we apply the tool to further assess the data quality of the four benchmark datasets, based on the detected noises. Finally, we conduct comparative experiments to investigate the impact of noisy data on the performance of code summarization models. The results show that these data preprocessing noises widely exist in all four benchmark datasets, and removing these noisy data leads to a significant improvement on the performance of code summarization. We believe that the findings and insights will enable a better understanding of data quality in code summarization tasks, and pave the way for relevant research and practice.


Neural Code Summarization: How Far Are We?

Source code summaries are important for the comprehension and maintenanc...

An Empirical Survey on Long Document Summarization: Datasets, Models and Metrics

Long documents such as academic articles and business reports have been ...

On the Reliability and Explainability of Automated Code Generation Approaches

Automatic code generation, the task of generating new code snippets from...

On the Importance of Building High-quality Training Datasets for Neural Code Search

The performance of neural code search is significantly influenced by the...

A Face Preprocessing Approach for Improved DeepFake Detection

Recent advancements in content generation technologies (also widely know...

ConvoSumm: Conversation Summarization Benchmark and Improved Abstractive Summarization with Argument Mining

While online conversations can cover a vast amount of information in man...

TranS^3: A Transformer-based Framework for Unifying Code Summarization and Code Search

Code summarization and code search have been widely adopted in sofwarede...

Please sign up or login with your details

Forgot password? Click here to reset