The Evolution of Stack Overflow Posts: Reconstruction and Analysis

11/02/2018 ∙ by Sebastian Baltes, et al. ∙ The University of Adelaide 0

Stack Overflow (SO) is the most popular question-and-answer website for software developers, providing a large amount of code snippets and free-form text on a wide variety of topics. Like other software artifacts, questions and answers on SO evolve over time, for example when bugs in code snippets are fixed, code is updated to work with a more recent library version, or text surrounding a code snippet is edited for clarity. To be able to analyze how content on SO evolves, we built SOTorrent, an open dataset based on the official SO data dump. SOTorrent provides access to the version history of SO content at the level of whole posts and individual text or code blocks. It connects SO posts to other platforms by aggregating URLs from text blocks and comments, and by collecting references from GitHub files to SO posts. In this paper, we describe how we built SOTorrent, and in particular how we evaluated 134 different string similarity metrics regarding their applicability for reconstructing the version history of text and code blocks. Based on different analyses using the dataset, we present: (1) insights into the evolution of SO posts, e.g., that post edits are usually small, happen soon after the initial creation of the post, and that code is rarely changed without also updating the surrounding text; (2) a qualitative study investigating the close relationship between post edits and comments, (3) a first analysis of code clones on SO together with an investigation of possible licensing risks. Finally, since the initial presentation of the dataset, we improved the post block extraction and our predecessor matching strategy.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 11

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stack Overflow (SO) is the most popular question-and-answer website for software developers. As of December 2017, its public data dump (Stack Exchange Inc, 2017) lists over 38 million posts and over 8 million registered users. Many answers contain code snippets together with explanations (Yang et al., 2016). Similar to other software artifacts such as source code files and documentation (Lehman, 1980; Chapin et al., 2001; Mens and Demeyer, 2008; Godfrey and German, 2008), text and code snippets on SO evolve over time, e.g., when the SO community fixes bugs in code snippets, clarifies questions and answers, and updates documentation to match new API versions. Since the inception of SO in 2008, a total of 13.9 million SO posts have been edited after their creation—19,708 of them more than ten times. While many SO posts contain code, the evolution of code snippets on SO differs from the evolution of entire software projects: Most snippets are relatively short (on average 12 lines, see Section 6.1) and many of them cannot compile without modification (Yang et al., 2016). In addition, SO does not provide a version control or bug tracking system for post content, forcing users to rely on the commenting function or additional answers to voice concerns about a post.

Recent studies have shown that developers use SO snippets in their software projects, regardless of maintainability, security, and licensing implications (Baltes et al., 2017b; An et al., 2017; Yang et al., 2017; Gharehyazie et al., 2017; Abdalkareem et al., 2017; Xia et al., 2017; Fischer et al., 2017; Acar et al., 2016). Assuming that developers copy and paste snippets from SO without trying to thoroughly understand them, maintenance issues arise. For instance, it may later be more difficult for developers to refactor or debug code that they did not write themselves. Moreover, if no link to the SO post is added to the copied code, it is not possible to check the SO thread for a corrected or improved solution in case problems occur. The same holds for code clones within Stack Overflow, which themselves may have been copied from external sources into Stack Overflow posts. These complicated relationships may not only lead to issues affecting the maintainability of the code snippets on Stack Overflow or their copies in software projects or documentation resources, but also to licensing issues when people do not adhere to the license of the original content.

The SO data dump keeps track of different versions of entire posts, but does not contain information about differences between versions at a more fine-grained level. In particular, it is not trivial to extract different versions of the same code snippet from the history of a post to analyze its evolution or compare code snippets between posts. To address these challenges, we have created the open dataset SOTorrent (Baltes et al., 2018), which enables researchers to analyze the version history of SO posts at the level of whole posts and individual post blocks, and their relation to corresponding source code in GitHub repositories. Beside describing how we created that dataset, we use it to answer four research questions about the evolution of SO posts:

  • [labelindent=labelwidth=RQ1:, label=RQ1:, leftmargin=*, align=parleft, parsep=0pt, partopsep=0pt, topsep=1ex, noitemsep]

  • How do Stack Overflow posts evolve?

  • Which posts get edited?

  • Which edit and communication patterns exist?

  • What are the implications of code clones on Stack Overflow?

While answering the first two questions will further our understanding of the phenomenon of SO post evolution, the third question aims at finding a connection between post edits and other events on the SO platform. The fourth question transfers a well-known software engineering problem affecting the maintainability of software (Juergens et al., 2009; Thummalapenta et al., 2010) to code snippets on SO.

The first two research question have already been covered in our previous conference paper (Baltes et al., 2018), but we added a comparison of the evolution of questions and answers (end of Section 6.1). The results from our previous work motivated a further investigation of the relationship between post edits and comments, which yielded the edit and communication patterns we present in this paper (Section 7). The fourth research question demonstrates how the SOTorrent dataset can be used to analyze code snippets on SO, but also points to an underexplored phenomenon: code clones on Stack Overflow (Section 8). Finally, we analyzed all false positive and false negative results that our previous post history reconstruction approach yielded for our test dataset (Section 5.5) and revised the matching strategy according to our observations (Section 5.6). In the end, we were able to resolve the matching issues for 90.4% of the affected posts.

We found that SO posts grow over time in terms of their number of text and code blocks, but the size of the individual blocks is relatively stable. Many edits () just modify a single line of text or code, but only in of the cases are code blocks changed without also changing text content; post edits usually happen shortly after the creation of the post. Our research suggests that comments and post edits are closely related: Some comments might trigger edits, others might be made in response to the edits. We investigated 213 edit and comment events from 58 different SO posts and describe six edit and communication patterns that we observed. Regarding the code clones, we used SOTorrent to detect them, qualitatively investigated the source of 50 frequently copied snippets, and started a discussion in the SO community about possible implications and strategies to handle code clones.

2 The SOTorrent Dataset

To answer our research questions, and to support other researchers in answering similar questions, we build SOTorrent, an open dataset based on data from the official SO data dump (Stack Exchange Inc, 2017) and the Google BigQuery GitHub (GH) dataset (Google Cloud Platform, 2018). SOTorrent provides access to the version history of SO content at the level of whole posts and individual post blocks. A post block can either be a text or a code block, depending on how the author formatted the content (see Figure 1 for an example). Beside providing access to the version history, the dataset links SO posts to external resources in two ways: (1) by extracting linked URLs from text blocks and comments on SO and (2) by providing a table with links to SO posts found in the source code of GitHub projects. This table can be used to connect SOTorrent and GH datasets such as GHTorrent (Gousios, 2013). Our dataset is available on Zenodo as a database dump (Baltes and Dumani, 2018a), including instructions on how to import the dataset, and as a public BigQuery dataset.111https://bigquery.cloud.google.com/dataset/sotorrent-org:2018_09_23 We also published the source code of the software that we used to build (Baltes and Dumani, 2018c; Baltes, 2018b) and analyze (Baltes, 2018a, f) SOTorrent.

Figure 1: Exemplary Stack Overflow answers with code blocks (top, 3758880) and with inline code (bottom, 4888400). The LocalId represents the position in the post.
Figure 2: Connection of SOTorrent tables to other resources.

SOTorrent release 2018-08-28, for example, contains the version history of all 40,606,950 questions and answers in the official SO data dump published June 5, 2018 (Stack Exchange Inc, 2017). It contains 63,914,798 post versions, 122,673,430 text block versions, and 77,578,494 code block versions, ranging from the creation of the first post on July 31, 2008 until the last edit on June 3, 2018. We extracted links to 11,775,659 distinct URLs from 20,518,181 different post block versions and 4,104,869 distinct URLs from 6,856,777 different comments. Moreover, we identified 6,035,737 links to SO posts in 436,615 public GH repositories. Our project website222http://sotorrent.org lists all dataset versions and contains more information on the database layout, including the complete database schema. In the following sections, we provide information about SOTorrent’s data storage and collection process, before we use the dataset to answer our research questions.

3 Database Schema

Figure 3: Database schema of SOTorrent release 2018-02-16: The tables from the official SO dump (Stack Exchange Community Wiki, 2018-02-27) are marked gray, the additional tables are marked blue. Not all tables from the official SO dump and not all foreign key constraints are shown. The most recent version of the database schema is always available on the SOTorrent project page.

SOTorrent contains all tables from the official Stack Overflow data dump (see database schema in Figure 3). Figure 2 visualizes how the SOTorrent tables are connected to the SO dump, external resources on the web, and projects on GitHub. The official data dump only provides the version history at the level of whole posts as Markdown-formatted text. To analyze how individual text or code blocks evolve, we needed to extract individual blocks from that content. This extraction also enabled us to collect links to external sources from the identified text blocks. In the SO dump, one version of a post corresponds to one row in the table PostHistory. However, that table does not only document changes to the content of a post, but also changes to metadata such as tags or title. Since our goal was to analyze the evolution of SO posts at the level of whole posts and individual post blocks, we had to filter and process the available data. First, we selected edits that changed the content of a SO post, identified by their PostHistoryTypeId (Stack Exchange Community Wiki, 2018-02-27) (: Initial Body, : Edit Body, : Rollback Body). We linked each filtered version to its predecessor and successor and stored it in table PostVersion.

The content of a post version is available as Markdown-formatted text. We split the content of each version into text and code blocks and extracted the URLs from all text blocks using a regular expression (table PostVersionUrl). We also extracted the URLs from all comments in the SO data dump (table CommentUrl). Beside the extracted URLs, those tables provide information about the link type (e.g., bare, Markdown, or HTML), link position (top, middle, or end of post/comment), and certain URL components such as the root domain, query string, or fragment identifier (if present).

To reconstruct the version history of individual post blocks, we established a linear predecessor relationship between the post block versions (table PostBlockVersion) using a string similarity metric that we selected after a thorough evaluation (see Section 5.4). For each post block version, we computed the line-based difference to its predecessor, which is available in table PostBlockDiff. We also extracted the version history of question titles from table PostHistory. Table TitleVersion links all title versions to their predecessors and successors and further provides the corresponding Levenshtein distances (columns PredEditDistance and SuccEditDistance).

One row in table PostReferenceGH represents one link from a file in a public GH repository to a post on SO. To extract those references, we utilized Google BigQuery, which allows to execute SQL queries on various public datasets, including a dataset with all files in the default branch of GH projects (Google Cloud Platform, 2018). To find references to SO, we again applied a regular expression and mapped all extracted URLs to their corresponding sharing link (ending with /q/<id> for questions and /a/<id> for answers), storing that link together with information about the file and the repository in which the link was found in table PostReferenceGH. We ignored other links referring to, e.g., users or tags on SO.

4 Post Block Extraction

Our goal was to analyze the evolution of individual text and code blocks, for example to trace changes to particular code snippets, to find code clones on SO, or to identify bug fixes for code on SO. Moreover, the differentiation between the two post block types allowed us to extract links to external resources only from text blocks, not from code blocks. The latter may, for example, contain XML namespace links or links to stylesheet files, which we do not consider to be external sources of the post. The first step towards reconstructing the version history of individual post blocks is their extraction from the Markdown-formatted text that SO uses for the content of posts. In our notion, a code block is not a short inline code fragment embedded into a text block (see Figure 1 for an example), but a continuous code snippet. We consider inline-code to be part of the surrounding text block. According to SO’s Markdown specification (Stack Exchange Inc, 2018), code blocks are indented by four spaces and inline code is framed by backtick characters. However, as we found during our research, users are free to use other Markdown specifications or HTML tags, which are not officially supported, but correctly parsed and displayed on the SO website. We iteratively tested and refined our post block extraction approach using a random sample of over 100,000 SO posts (). We ran the extraction, randomly checked the extracted posts blocks, and added a new test case if the result differed from the rendering on the SO website (class PostVersionHistoryTest (Baltes and Dumani, 2018c)). We then updated the extraction such that all test cases passed and re-ran the extraction on the test data. The final version of our post block extraction method was able to detect various notations that SO authors used to mark code blocks, including SO Markdown (indented by 4 spaces), code fencing Markdown (enclosed by three backticks), SO stack snippets (enclosed by <!--begin/end snippet-->), stack snippet language tags (prepended by <!--language:...-->), HTML code tags (enclosed by <pre><code>), and HTML script tags (enclosed by <script>).

5 Post Block Matching

After successfully extracting the post blocks from a post version, we had to map the extracted post blocks to their predecessors in the previous post version to reconstruct their version history. Since this mapping had to work for text and code content, the latter in various programming languages, we decided to utilize syntax-based similarity metrics. We implemented 134 different string similarity metrics (see Section 5.1), which we evaluated regarding their correctness and performance using the manually validated version history of 600 SO posts (see Sections 5.2 and 5.4). In case of multiple matches, we had to choose between different predecessor candidates. Thus, we developed a matching strategy that considers the location and context of a post block (see Section 5.3).

5.1 Similarity Metrics

Type Metric
edit levenshtein damerauLevenshtein
longestCommonSubsequence (LCS) optimalAlignment (OA)
set nGram{JaccardDiceOverlap} nShingle{JaccardDiceOverlap}
token{JaccardDiceOverlap}
profile cosineNGram{BoolTFNormalizedTF} manhattanNGram
cosineNShingle{BoolTFNormalizedTF} manhattanNShingle
cosineToken{BoolTFNormalizedTF} manhattanToken
fingerprint winnowingNGram{JaccardDiceOverlapLCSOA}
equal equal tokenEqual
Table 1: Overview of all implemented base similarity metrics ().
Type Variants
edit with/without normalization
set ,

with/without normalization, padding (nGram)

profile , with normalization (both) and without (cosine)
fingerprint , with/without normalization
equal with/without normalization
Table 2: Overview of all evaluated variants of the implemented similarity metrics ().

A similarity metric maps two input strings to a value in , where corresponds to inequality and corresponds to equality. We implemented five different types of similarity metrics: edit-based metrics (e.g., Levenshtein), set-based metrics

(e.g., n-grams with Jaccard coefficient),

profile-based metrics

(e.g, cosine similarity),

fingerprint-based metrics (Winnowing), and equality-based metrics, which served as a baseline in the metrics evaluation (see Section 5.4). Our Java implementation of all metrics is available on GitHub (Baltes and Dumani, 2018d). Tables 1 and 2 show all metrics that we implemented and evaluated.

The edit-based metrics define the similarity of two strings based on the number of edit operations needed to transform one string into the other. Optimal string alignment (OA) allows the two operations ‘insertion of one character’ and ‘deletion of one character’. The Levenshtein distance further allows ‘substitution of one character’. The Damerau-Levenshtein distance is similar to Levenshtein, but additionally allows the operation ‘swap two neighboring characters’. The longest common subsequence (LCS) of two strings is the longest sequence of characters (order irrelevant) that can be found in both strings. It can be interpreted as a variant of Damerau-Levenshtein with the additional restriction that each character can only be modified once (e.g., swapping two characters and then replacing one of them is not possible). To derive a similarity metric from the number of edit operations and the longest common subsequence, we used the following approaches:

Definition 1 (Edit/LCS Similarity)

Let , be two strings, be the edit distance and be the longest common subsequence between the two strings: . The edit-based and the LCS-based similarity functions are then defined as

The profile-based metrics

consider each distinct token, n-gram, or n-shingle in the two input strings as one dimension of a vector space. Tokens can be extracted from a string by a tokenization with whitespaces as delimiter,

-grams split the string in sequences of consecutive characters, -shingles split the string in sequences of consecutive words or tokens. One input string is then characterized as one vector in the vector space. In the simplest form (bool), the values of the dimensions can either be 1 (token, n-gram, or n-shingle present in the string) or 0 (not present). Alternatively, one can consider the number of occurrences of each token, n-gram, or n-shingle as the value of the dimensions (term frequency). We also considered the BM15 weighting scheme ((Manning et al., 2008)

, which intends to lower the effect of very frequent terms skewing the comparison. The similarity of the two strings is then defined as the cosine or Manhattan distance between the two vectors that have been derived from the strings using one of the three approaches described above.

For the set-based metrics, we considered all distinct tokens, n-grams and n-shingles in the strings as elements of sets. We used three coefficients to compare the resulting sets:

Definition 2 (Similarity Coefficients)

Let , be sets of tokens, n-grams, or n-shingles.

Figure 4: Histogram and boxplot showing the number of Stack Overflow questions and answers with a certain version count (PostHistoryTypeIds 2, 5, 8); based on the SO data dump 2017-06-12; vertical line is median.

The fingerprint-based metrics apply a hash function to substrings of the input strings and then use the computed hash values to determine the similarity. The Winnowing algorithm is one approach to calculate and compare the fingerprints of two strings (Schleimer et al., 2003; Duric and Gasevic, 2013). Winnowing is often used for plagiarism detection, e.g., in the source code comparison software MOSS (Burrows et al., 2007; Martins et al., 2014; Lancaster and Culwin, 2004). We implemented different variants of the algorithm described by Schleimer et al. (Schleimer et al., 2003), e.g., using different n-grams sizes and different approaches to compare the fingerprints.

We implemented each metric in different variations. In the variants with normalized input strings, we used different approaches for different metric types: For the edit metrics, we unified the whitespace characters, i.e., reduced them to a single space, and converted all characters to lower case. For the n-gram metrics, we converted all characters to lower case, removed all whitespace, and removed some special characters ({};) (see Section 5.5 for the characters we later added to this set). For the shingle metrics, we again converted all characters to lower case, unified the whitespace characters, and removed all non-word characters ([^a-zA-Z_0-9]). We used common n-gram and shingle sizes (Burrows et al., 2007) and also implemented an optional n-gram padding that emphasizes the beginning and the end of the input strings. All these variations lead to a total number of 134 different similarity metrics.

Figure 5: App developed to create ground truth for similarity metric evaluation.
Figure 6: Post with multiple equal predecessors (13064858).

5.2 Ground Truth

To evaluate the correctness of the post block mappings retrieved using different string similarity metrics, we created a set of 600 manually validated post version histories. Figure 5 shows a screenshot of the tool we developed to create those manually validated histories (available on GitHub (Dumani and Baltes, 2017)). It visualizes a post version (right) and its predecessor (left). Post blocks with equal content and type that are unique in the two versions are automatically connected. For the other post blocks, the user has to choose a match by clicking on a post block of the same type in each version; the tool then visualizes the line-based difference between the connected blocks. It is also possible to add comments for individual post blocks, e.g., in case the user is not confident in his or her mapping, or in case the post block extraction failed.

We drew four different samples from the SO data dump released June 12, 2017. The first sample with 200 posts () was randomly drawn from all SO questions and answers with at least two versions (otherwise no mapping is needed). Since there are many posts with only two versions (see Figure 4

), we decided to draw another sample of 200 posts from SO questions and answers with at least seven versions (99% quantile). We denote this sample

. As the initial focus of our research was on Java, we also drew a sample with 200 Java posts () from all SO questions tagged with <java> or <android>, and the corresponding answers. The last sample (), which contains 100 posts with multiple possible predecessors, was not used to evaluate the metrics, but to evaluate our matching strategy (see Section 5.3). In this sample, we included posts which had at least two possible matches (two post blocks of the same type with identical content) in two adjacent versions.

The validated version histories of the samples were created by a graduate student, and then later discussed with two of the authors. The student was introduced to the app and told to comment on all post blocks where he is not sure about the mapping. Together, we looked at all post blocks with comments indicating an unclear mapping () and tried to find a mapping we all agreed on. If that was not possible, we moved the post to a new sample , which we separately analyzed. After discussing all 38 posts, contained 17 posts (4 from , 8 from , and 5 from ). All samples are available on Zenodo (Baltes et al., 2017a).

5.3 Matching Strategy

Our goal was to establish a linear predecessor relationship for all post block versions, thus each post block version can only have one predecessor. The reason for this decision was the fact that we rarely observed splits and merges in the post version histories we manually analyzed. Moreover, even if multiple predecessors have equal or similar content, usually only one of them is the actual predecessor (see Figure 6 for an example). To correctly choose the predecessor from different candidates, we had to develop a matching strategy for post block versions, which we present in this section. In the database, we not only store the matched predecessor, but also the number of possible predecessors and successors, to be later able to identify post version histories that could contain splits or merges. For our analysis (see Section 6), we consider post block lifespans, i.e., chains of connected post block versions that are predecessors of each other. Those lifespans can be easily retrieved from the database, because each post block version has a RootPostBlockVersionId, which is the id of the first post block version in the chain. Those chains can likewise be retrieved using the columns RootPostHistoryId and RootLocalId, which also uniquely identify the first post block version in a post block lifespan.

As mentioned above, we utilized a dedicated sample to evaluate how well our matching strategy can handle posts with multiple possible connections. In case of differences between the ground truth and the results of our approach, we wrote unit tests replicating the issue and then updated the strategy until all unit tests passed. We further used the sample to test the strategy’s scalability. To be able to describe our matching strategy, we define our notation for post versions, post block versions, and possible predecessors:

Definition 3 (Post Version)

Let be a post with versions. Then denotes one post version and denotes the number of post blocks in for .

Definition 4 (Post Block Version)

Let be one post version and be a post block type. Then denotes one post block of type with local id for . The function maps a post version to the local ids of the post blocks of type in that version.

Definition 5 (Possible Predecessors)

Let , be post blocks of the same type in subsequent post versions,

be a function that tests if the post blocks’ contents are equal, and

be the similarity of the two post blocks’ contents according to the similarity metric . Let be a threshold for . Then, we define the set of equal predecessors as

We define the maximum predecessor similarity as

In case no predecessor with a similarity above the threshold exists, we define . We define the set of matched predecessors as

Finally, we define the set of possible predecessors as

The set of possible successors is defined analogously.

As can be seen in the above definition, we need two different similarity metrics ( and ) and two different similarity thresholds ( and ). We only compute the similarity if the content of the post blocks is not equal, because we want to be able to distinguish equal post block versions from post block versions with similarity according to the metric. Before we describe our matching strategy, we present two methods that we use in case of multiple possible predecessors. Both methods iterate over all post blocks in a post version that do not have a predecessor yet. They follow different strategies for selecting a predecessor:

tries to select a predecessor using the post blocks before and after , i.e., the blocks with local ids and . Please note that those blocks usually have a different post block type than . In case the predecessors of those neighboring blocks are already set and one post block has the predecessors of those two post blocks as neighbors (local ids and in version ), the function sets as predecessor of and returns true. If no predecessor has been set, it returns false. In case of parameter ABOVE, only the post block above (local id ) is taken into account; in case of parameter BELOW, only the post block below (local id ) is taken into account. Examples for posts that motivated this strategy are answer 32841902 (mapping of version 2 to 1) and answer 37196630 (mapping of version 2 to 1).

sets the post block with , i.e., the post block with the local id closest to , as predecessor of . If two possible predecessors have the same , the method chooses the one with the smallest local id. This approach is based on our observation that the order of post blocks rarely changes (see Section 6.1). Examples for posts that motivated this strategy are question 18276636 (mapping of version 2 to 1) and answer 2581754 (mapping of version 3 to 2).

The complete matching strategy that selects (at most) one predecessor for each post block in a post version can be found as pseudo code in Algorithm 1. The actual source code can be found in method processVersionHistory of class PostVersionList in the corresponding GitHub project (Baltes and Dumani, 2018c).

for all  do
     // set predecessors where only one candidate exists
     for all  do
         if  then
              Let be the equal or similar predecessor
              if  then
                  Set as predecessor of
                  continue
              end if
         end if
     end for
     // set predecessors using context
     
     while  do
         
     end while
     while  do
         
     end while
     while  do
         
     end while
     // set predecessors using position
     
end for
Algorithm 1 Initial Matching Strategy

5.4 Metrics Evaluation

The matching strategy described above depends on the results of the similarity metrics and and the thresholds and . To select the best metrics for reconstructing the version history of post blocks, we evaluated all 134 metrics in different combinations with different thresholds using our ground truth samples , , and . Please note that the correctness of and cannot be evaluated independently, because the neighboring post blocks that

takes into account usually have different types. To assess the performance, we measured the runtime of the post history extraction for each configuration. To assess the correctness of the extracted post block history, we regarded each metric configuration as a binary classifier that either assigns the predecessor of a post block version correctly or not (compared to the ground truth). To calculate the number of true/false positives/negatives, we consider the set of

predecessor connections, i.e., all (, ) from that have been connected with a certain metric configuration. We then compare those connections with the connections from the ground truth:

Definition 6 (Metric Evaluation)

Let be the set of predecessor connections of type in the ground truth, be the set of predecessor connections of type determined using a certain metric configuration, and be the number of possible predecessor connections of type . We define the number of true positives , false positives , true negatives , and false negatives as:

After each comparison run, we ranked the configurations according to their Matthews correlation coefficient ((Matthews, 1975), which takes , , , and into account. If two configurations had the same value, we ranked them according to their runtime. is the preferred measure when evaluating binary classifiers (Chicco, 2017) and should be chosen over evaluation measures such as recall, precision, or F-measure (Powers, 2011). In our case, it correlates the connections from the ground truth and the connections set by a certain metric configuration. The values are in range ; a total disagreement is represented by , a perfect agreement by . The source code of the tool we used for the metrics evaluation is available on GitHub (Baltes and Dumani, 2018b).

In the first comparison run, we configured and chose . This resulted in 1,474 different configurations. The first run took about 24 hours on a regular desktop PC (Intel Core i7-7700, 64 GB RAM, 512 GB SSD).

For the second run, we selected the metrics which, for a particular threshold, achieved an value in the 95% quantile of all three samples either for text or for code blocks. Some metrics cannot be applied to very short strings (e.g., if string length n-gram size). For the final implementation, we wanted to have a backup metric that works for all input strings. We filtered edit- and token-based metrics and selected the best candidates according to the criterion described above. Finally, we selected 27 regular and 4 backup metrics for the second run. We also added the equal metric as a baseline. We tested those metrics again with , but this time we chose Thus, the second run tested 3,232 different configurations, which took about 20 hours.

As motivated above, the results of the text and code metrics depend on each other. In the third and last run, we tested all combinations of the best (99% quantile) text and code configurations together with the best backup configurations. This was the first run with and with a backup metric for text and code blocks. Those backup metrics were only used if the input strings were too short for the configured metrics. The run, which took about 14 hours, tested all combinations of 13 text configurations, 3 text backup configurations, 15 code configurations, and 2 code backup configurations, resulting in 1,170 combinations in total. For the final selection, we ranked the combinations according to the sum of their scores for text and code blocks.

The final configuration that we used to match post block predecessors for the SOTorrent dataset was:

( )
( )
( )
( )
Figure 7: Performance of selected metrics: manhattanFourGramNormalized for text (blue) and winnowingFourGramDiceNormalized for code (red); selected thresholds: 0.17 for text and 0.23 for code (dotted lines).

Figure 7 shows the performance of the selected metrics for different thresholds with , compared to the baseline metric equals. The final configuration achieved an value of for text (true positive rate , false positive rate ) and for code (true positive rate , false positive rate ).

5.5 Analysis of False Positive and False Negative Predecessor Matches

While the performance of our matching strategy together with the selected metrics was already good, we were eager to further reduce the number of false positives and negatives. Therefore, we added a feature to our ground truth application that enabled us to display the difference between the ground truth and the mapping that our matching strategy with the default metrics produced (see Figure 8 for an example). The source code of this revised application is available on GitHub (Dumani and Baltes, 2018). We then systematically investigated all 31 posts with false positive or negative code block mappings, and then followed a similar approach as before to improve our matching strategy: First, we decided whether an improved matching strategy could solve the observed matching problem and in case we agreed that it could, we created a test case reproducing the error. This systematic approach lead to different improvements to the post block extraction, the matching strategy, and the default similarity metrics, which we outline below.

In the end, we were able to solve the matching problem for 30 out of 31 posts. In one post, the predecessor assignment in the ground truth was semantically correct, but syntactically too different to be detectable using our approach. In 14 cases, we (also) updated the ground truth because we considered the metric-based mapping to be more appropriate. Afterwards, we applied the same systematic approach to check the 62 posts with false positives or negatives in text blocks. We noticed that the changes we implemented based on the code block errors also considerably improved the results for text blocks. For 16 posts, our updated matching strategy removed the false positive and the false negative matches. Only in 8 text block version comparisons, our strategy was not able to achieve the mapping described in the ground truth, because the predecessor assignment of the text blocks was semantically correct, but syntactically too different to be detectable using our approach. We updated the ground truth of 41 posts where we considered the metric-based mapping to be more appropriate. Considering all 83 distinct posts with false positive or false negative matches for either code or text blocks, only eight of them (9.6%) could not be correctly matched by our revised matching strategy due to the difference between semantic and syntactical difference. In all other cases, either the revised matching strategy resolved the issues or the ground truth had to be adjusted (dataset available on Zenodo Baltes et al. (2017a)). Our next step will be to re-run the complete metrics evaluation (Section 5.4) to see if, with our revised matching strategy, adjusted thresholds or different metrics yield even better results.

Figure 8: Issue with previous matching strategy in case the equal match is not available anymore (version 6 and 7 of question 17158055): Orange/blue rectangles are connections in ground truth, lines are connections set by the previous matching strategy in combination with the selected default metrics; the connection between code blocks C’v6 and Cv7-1 is missing, because Cv7-1 has an equal match in the previous version (Cv6), which is not available anymore; C’v6 is very similar, but not equal to Cv7-1.

5.6 Revised Matching Strategy and Post Block Extraction

To address the observed issues, we first changed the post block extraction to also detect code blocks that are formatted as inline code, but are the only content in a line and thus formatted as code blocks (see, for example, code block C’ in Figure 8). We further updated the default similarity metrics as follows: We unified the normalization for edit- and n-gram-based metrics and extended the set of special characters by adding colons, commas, and periods. The reason for this was that yielded a similarity of for the strings “to” and “to:”, because they were two different tokens, even after normalization. We noticed this when checking the false negative matches in question 38463455 between versions 3 and 4. In the same post, we further observed a case were the Winnowing algorithm did not trigger the backup metric correctly in case one of the input strings was too short for the configured window size. We fixed this to resolve the corresponding false negative.

The changes to the matching strategy were more complex. One of the main issues was that we only considered equal predecessors or predecessors with maximum similarity as possible predecessors. However, those predecessor candidates may not be available anymore at the time our algorithm reaches a certain post block. Figure 8 shows an exemplary false negative match caused by this behavior. The connection between code blocks C’v6 and Cv7-1 is missing, because Cv7-1 has an equal match in the previous version (Cv6) that is not available anymore at the time the algorithm tries to set the predecessor. Code block C’v6 is very similar to code blocks Cv7-1 and Cv7-2, but not equal. Thus, the set of possible predecessors only contains Cv6, but not . We updated the matching strategy as follows to address the above-mentioned issue:

Definition 7 (Runner-up Predecessors)

Let , be post blocks of the same type in subsequent post versions and

be a function that tests if a post block is still available, meaning that it has not been assigned as predecessor of a post block in the succeeding version yet. , , , and have already been defined above.

We define the set of runner-up predecessors as

We define the best runner-up match as

Using the above definitions, we can now define a new matching strategy that also works in case the optimal match is not available anymore:

sets the post block if . Please note that the successor set of is empty, because the selected post block did not have the maximum predecessor similarity for any of the post blocks in the succeeding version. If , the strategy does not set any predecessors.

Algorithm 2 shows the complete revised matching strategy (new parts are marked by a new comment). We use the new matching strategy two times in the algorithm: At the beginning in case a unique match is not available anymore and in the end after all other strategies were not able to set a predecessor.

for all  do
     // set predecessors where only one candidate exists
     for all  do
         if  then
              Let be the equal or similar predecessor
              if  then // new
                  if  then
                       Set as predecessor of
                       continue
                  end if
              else// new
                   // new
              end if
         end if
     end for
     // set predecessors using context
     
     while  do
         
     end while
     while  do
         
     end while
     while  do
         
     end while
     // set predecessors using position
     
     // set runner-up predecessors for the remaining post blocks
      // new
end for
Algorithm 2 Revised Matching Strategy

6 Evolution of Stack Overflow Posts

After describing how we reconstructed the version history for individual text and code blocks, we come back to our initial research questions. We first characterize the phenomenon of SO post evolution, and in particular the evolution of individual post blocks (RQ1). To find out if edited posts share common characteristics, we analyzed if certain measures such as score or number of comments correlate with the number of edits (RQ2). We also investigated if those measures have a temporal relationship with the edits, in particular if comments happen immediately before or after edits and whether their relationship follows patterns (RQ3, see Section 7). Finally, we utilized SOTorrent to analyze code clones on SO (RQ4, see Section 8).

As descriptive statistics, we use mean (

), standard deviation (

), median (

), and the first and third quartiles (

, ). To test for significant differences, we applied the nonparametric two-sided Wilcoxon rank-sum test (Wilcoxon, 1945) and report the corresponding p-value (). To measure the effect size, we used Cohen’s  (Cohen, 1988; Gibbons et al., 1993). Our interpretation of is based on the thresholds described by Cohen (Cohen, 1992): negligible effect (), small effect (), medium effect (), otherwise large effect. We used the nonparametric Spearman’s rank correlation coefficient ((Spearman, 1904) to test the statistical dependence between two variables. Our interpretation of is based on Hinkle et al.’s scheme (Hinkle et al., 1979): low correlation (), moderate correlation (), high correlation (), and very high correlation ().

6.1 Quantitative Analysis

In the following, we describe different properties of post blocks and post block versions either for their most recent version in the SOTorrent release 2018-02-16, or for different versions over time:

Post Block Count:

Half of all posts in the SOTorrent dataset contain between one and two text blocks and between zero and two code blocks (). There are only few posts without text blocks (), but over a third of all posts do not have code blocks (). Examples for such posts include conceptual questions and answers, but also posts with inline code that we considered to be part of the text blocks. If we compare the first and the last version of edited posts, we can observe a statistically significant difference in the number of text and code blocks (); posts tend to grow over time. However, the effect is only small ().

Post Block Length:

Code blocks tend to be larger than text blocks. Figure 9 visualizes the difference measured in number of lines. The average text block contains lines (, ) and characters (, ); the average code block contains lines (, ) and characters (, ). We compared the length of post blocks in the first and the last version and found no effect. Thus, we can conclude that posts tend to become longer over time in terms of their number of post blocks, but the length of individual post blocks is relatively stable.

Post Block Versions:

For our analysis of post block versions, we retrieved all post block lifespans in the dataset, but only considered the initial versions and later versions where the content of the blocks changed (not all blocks are edited in all versions). We found that about half of all post blocks were edited after their creation (see Figure 10). On average, text blocks have and code blocks versions. We analyzed the line-based differences between post block versions and found that of all edits modify only one line ( for text blocks and for code blocks). There is a significant difference in the size of changes when comparing text and code blocks () with a medium effect ( for the number of added lines and for the number of deleted lines): Changes in code blocks are larger, which is expectable due to the larger size of code compared to text blocks.

Post Block Co-change:

We were also interested in the co-change of text and code blocks, i.e., if text and code is edited together. On average, text blocks and code blocks were edited in each post version ( and for both types). We found that text and code blocks were either edited together ( of all post versions), or just the text blocks were edited (). Only in of all post versions, code blocks were changed without also editing text blocks. This could indicate that SO authors document changes to their code snippets in the text blocks or update the description of the modified code.

Figure 9: Boxplots showing the line count of text and code blocks in the latest version of Stack Overflow posts ( for text and for code).
Figure 10: Histogram and boxplot showing the number of post block versions (vertical line visualizes the median value 1).
Figure 11: Bar chart visualizing all edit timespans between one and eight weeks ( of all values, ); the other are spread over a range of 475 weeks.

Order of Post Blocks:

To check our assumption that the order of post blocks rarely changes, we computed the difference between the local ids of all post blocks versions and their predecessors. We found that of all post block versions have the same local id as their predecessor. Of all absolute differences, two was the most common one (), which is expectable, because text and code blocks usually alternate. Thus, e.g., swapping two blocks of the same type leads to a local id difference of two in the next version.

Timespan Between Edits:

For the posts that have been edited after their creation, we analyzed the timespan between the edits. of the first post edits happen on the same day as the creation of the post, within one week ( and days), within one year ( and days), and more than one year after the creation. If we only consider the second or later edits, not much changes: of them happen on the same day, within one week, within one year, and more than one year after the creation. Overall, of all edits happen on the same day, i.e., soon after the creation of the post, and happen on the same day or within the first week after the creation (see Figure 11).

Post Editors:

On SO, either the author of a post or a moderator, i.e., a SO user with a reputation of at least 2,000, can make edits. We found that of all edits were conducted by the post authors themselves and by moderators. We found no effect of the authors’ reputation on the fact that a moderator edits the post. We consider an analysis of typical moderator changes to be an interesting direction for future work.

Questions vs. Answers:

To compare questions and answers, we split the posts according to their post type and then analyzed the three properties Post Block Count, Post Block Length, and Post Block Versions for the most recent version of the posts. Regarding the post block count, we found that answers have significantly less text and code blocks than questions (). The average number of text blocks is ( and ) for questions and ( and ) for answers; the average number of code blocks is ( and ) for questions and ( and ) for answers. Both effects are small ( for text and for code). We found no effect when comparing the length of text blocks. However, code blocks in answers tend to be smaller than code blocks in questions. The average length of answer code blocks was lines ( and ) compared to lines for question code blocks ( and ). The difference was significant () and the effect was small (). The difference is also significant when measured in characters instead of lines (, ). We did not observe a significant difference in the number of versions for questions compared to answers.

6.2 Properties of Edited Posts

Versions Age Score Comments GHMatches
Versions
Age
Score
Comments
GHMatches
n 38.4m 38.4m 38.4m 38.4m 137k
Table 3: Correlation table with Spearman’s correlation coefficients for different properties of Stack Overflow posts (p-value for all combinations).

To investigate which properties edited posts possess, we searched for monotonic relationships between the version count of a post and other properties such as the age of the post, its score, comment count, or the number of distinct files on GH referring to the post. Table 3 shows the correlation coefficients () for those relationships based on SOTorrent release 2018-02-16. There was no correlation that exceeded the threshold for a low correlation (). However, the relationship between the version count and the number of comments drew our attention as it had the highest correlation coefficient in the table. We decided to explore this relationship using a quasi-experiment: We compared the number of comments of all posts with only one version to all posts with more than one version (version count over all posts: , , ). The difference was significant () and the effect size was medium (). We also compared the opposite relationship, i.e., the number of versions of all posts with at most one comment to all posts with more than one comment (comment count over all posts: , , ). Again, the difference was significant (), but the effect size was small ().

7 Communication and Edit Patterns

Our findings so far suggest that a relationship exists between Stack Overflow post edits and communication events such as comments. To identify common communication and edit patterns in Stack Overflow threads, we first conducted a quantitative analysis of the temporal connection between edits and comments. A follow-up qualitative study motivated the design of a visual analysis tool that we then used to manually annotate a sample of Stack Overflow threads.

7.1 Quantitative Analysis

As a first step in exploring the relationship between comments and post edits, we looked at their temporal connection, i.e., if comments usually happen before or after edits. First, we aggregated all edits (including post creation) and all comments per post id and day. Thus, our units of observation were all days where posts were either created, edited or commented. We found that in of the cases, the posts were created or edited and commented; in of the cases they were only created, in of the cases only edited, in of the cases only created and edited, and in of the cases only commented. If we focus on the comments, we see that of them happened on a day where the post had either been created or edited. We then further focused on those days and calculated the time difference between a comment and the closest edit. If a comment was closer to the creation than to an edit, we assigned the comment to the creation. We found that of the comments were related to the creation of the post and were related to an edit. Of the latter, were made before an edit and afterwards. Moreover, the comments were usually made right before (, , ) or soon after the edits (, , ).

7.2 Qualitative Analysis

To further investigate the connection between post edits and comments that are made immediately before or after edits, we conducted a qualitative analysis. We drew a random sample of 50 posts, 25 posts for which at least one comment had been made at most 10 minutes before an edit and 25 posts for which at least one comment had been made at most 10 minutes after an edit. We qualitatively analyzed the posts and found that, in the majority of cases, the comments and edits were clearly related (34 of 50 posts in our sample) and that the edit added or modified a code block (30/50). We classified a small set of comments as bug reports (10/50) and found that in some cases, the edit was explicitly documented in the post (11/50, e.g., by prefixing content with “EDIT:”). Comments often asked for additional information (22/50), and in cases where comments happened shortly before the edits, the comment was often a clarifying question (14/25). Answer 15437937333https://stackoverflow.com/a/15437937 represents a typical example: In a timespan of 35 minutes, a user answered a question, edited the answer three times, and commented on it once in response to three comments from the user asking the question. To analyze such communication structures in more detail, we used SOTorrent to aggregate edit and comment events for whole threads and built a visual analysis tool to identify patterns.

7.3 Visual Analysis Tool

We first aggregated all edit and comment events in SOTorrent release 2018-09-23 as described in this blog post.444http://empirical-software.engineering/blog/sotorrent-edithistory We then drew a random sample of 50 threads with at least one post edit and one comment (see retrieval and analysis scripts on GitHub (Baltes, 2018b, c)). This sample contained 255 edit and 319 comment events from 140 different posts. We qualitatively analyzed 20 of those threads, which means that we investigated the relationship of 101 edit and 112 comment events from 58 different posts in detail. To this end, we utilized a web-based visual analysis tool that we specifically designed to analyze the evolution of Stack Overflow threads. Two authors analyzed a subset of the sample and agreed on an annotation strategy, after which one author continued the analysis.

Figure 12 shows the two main views of our visual analysis tool. The tool provides an overview of the edit and comment events in a thread (upper part of the figure). It displays the question of a thread in the first row and the answers sorted by their creation date below. All edit events (I: initial version, E: body edit, TE: title edit) and comment events (C) are plotted using discrete time, with each new day shown as a vertical line. A circle border in the same color as the circle fill indicates an edit/comment by the post author, a red border indicates an edit/comment by another user. The currently selected event is highlighted using an additional yellow border, and its event id is also shown in the header. When hovering over events, a tooltip shows the exact timestamp of the event. Clicking on an event opens a focused view that uses continuous instead of discrete time, grouped in time frames of one hour (see middle part of Figure 12 and Figure 13). Pressing the ‘alt’ key while clicking on an event in the main view or just clicking on an event in the focused view opens the edit or comment on the Stack Overflow website (see lower part of Figure 12). While the overview enables users to explore the complete evolution of a post, the focused view makes it easier to spot (temporally) related events. In the example shown in Figure 12, the comment and the edit on the left and the agglomeration of edits and comments on the right form two separate groups. The source code of the tool together with our remarks for the 20 analyzed threads can be found on GitHub (Baltes, 2018e). A live demo of the tool is also available.555http://research.sbaltes.com/so-edit-viz/

Figure 12: Post evolution visualization: The so-edit-viz tool enabled us to visually explore the relationship of edits and comments in Stack Overflow threads (here: thread for question 7953840).

7.4 Patterns

Our analysis revealed six communication and edit patterns, which we describe in the following.

Figure 13: Time line of the burst of commenting and editing activity shortly after Stack Overflow question 11252831 was posted.
Figure 14: Excerpt of the comment and edit history of Stack Overflow thread 376732.

Burst of Activity:

Several comments and edits occur within minutes of each other. This pattern was very common in our sample of twenty threads: sixteen of the threads contained at least one burst of activity.

Figure 13 shows part of the time line of Stack Overflow question 11252831 to illustrate this pattern: After the initial question was posted, the thread attracted two answers, seven comments, and one post edit within less than 40 minutes. This burst in activity started with a clarification question posted as a comment to the question, followed by the first answer (posted by a different user), and a response to the clarification question by the user who had started the thread. One minute later, the same user then asked a clarification question by commenting on the first answer, in response to which the user who had posted this answer edited it and explained the edit in a comment. Three minutes after that, the third answer was posted, followed by thank-you comments on both answers from the user who had started the thread. Interestingly, the user referred to the edit in the first answer from their comment on the second answer, before the user who had posted the second answer commented that they were planning to update documentation elsewhere to further clarify the issue.

Comment explains Edit:

A comment is used to explain and/or make others aware of an edit. This pattern occurred five times in our sample of twenty threads.

A comment on Stack Overflow question 8687577 illustrates this pattern: In response to a jQuery-related question by a new user, another user commented “1) Welcome to SO. 2) It’s not clear what you want to know / are trying to do.” The user asking the question then proceeded to edit the question to clarify, and left a comment to make the community aware of the edited content: “I think it should be clearer now [after the] post edit. Thanks again.” A similar example occurred in Stack Overflow thread 24987992: A user asked a question about how to draw a particular line in D3.js, and another user asked for clarification through a comment: “Can you post also some image of your wanted output, it’s hard to imagine what image you want?” In response to this comment, the user who had started the thread then edited the question to add a link to an image showing a sketch of the current output and the desired output, and a few minutes later, posted a comment to increase awareness of the edit: “I upload[ed] the image [url], please take a look”.

Comment triggers Edit:

A post is edited in response to a comment, which happened in four out of the twenty threads in our sample.

For example, Stack Overflow thread 376732, which is visualized in Figure 14, contains two instances of this pattern: The first comment on the question asks “What do you have in your .htaccess?”, in response to which the user who had asked the question edited it, adding a six-line code snippet along with the text “EDIT: This is the current htaccess:”. A similar pattern occurred in the same thread almost three years later: A user commented on the accepted answer, stating “I don’t think it’s a valid solution, because with the 404 error you’ll be serving the page OK but in the header response you’ll see the 404 status code, so it will mess up with your SEO, right?” The next day, the user who had posted the answer updated it in response to the comment, and also left a new comment explaining the edit (cf. previous pattern): “You are right I have changed the example accordingly […]”.

Question Edit triggers Answer:

An answer is posted shortly after the question has been edited. This pattern occurred twice in our sample.

Stack Overflow thread 13864443 is a good example of this pattern. The user who had asked the question did not receive a response right away, and proceeded to make various edits to the question, including the addition of an extra tag and an explanation of the particular constraints of their situation. Within minutes of one of these edits, the first answer to the question was posted – more than 15 hours after the time the question was originally asked.

Overlap between Comment and Edit:

Text and/or code is copied between comments and post edits, which occurred in two out of the twenty threads in our sample.

In Stack Overflow thread 3529744, the user who had originally asked the question copied a clarification comment they had made in response to another comment into the question text itself: “It is stand [alone] code. As is. There is no [query] before or after this code.” A more extreme example of this copy-and-paste pattern occurred in Stack Overflow thread 16245209. The user who had asked the question initially did not include one important code snippet, and was asked for this code snippet in both comments and answers. They then proceeded to edit the question to include a 19-line code snippet, and also added the snippet in form of a comment to the question and the answer.

Comment announces Edit:

A comment is used to announce a subsequent edit by the same user. We identified two instances of this pattern in our sample.

In both cases, this announcement was made in the context of an ongoing discussion. In Stack Overflow thread 20849332, the user who asked the question commented in responses to a suggestion received in a previous comment on the question: “[…] I’ll update the question in a minute with more detail and some output.” They proceeded to make the promised edit nine minutes later. In Stack Overflow thread 17591278, the user who had asked the question commented in response to an answer: “[…] I tried your suggestion with some modification and it worked in a certain way (I’ll edit my post in few minutes) […]”, and the corresponding edit was made less than an hour after this comment.

8 Code Clones on Stack Overflow

Code clones have been extensively studied in the software engineering research community. Juergens et al. found that inconsistent code clones can be a major problem during the development and maintenance of software projects, unless “special care is taken to find and track existing clones and their evolution” (Juergens et al., 2009). Stack Overflow threads frequently serve as crowd-sourced software documentation (Parnin et al., 2012; Treude et al., 2011), often containing code snippets together with explanations (Yang et al., 2016). Despite the fact that code clones on Stack Overflow can suffer from similar issues like code clones in software projects, their role has not been investigated yet. In this section, we present a first analysis of code clones on Stack Overflow, based on the SOTorrent dataset. We will focus on duplicates of code snippets copied from external sources into SO and on duplicates of code snippets within SO. The usage and attribution of code snippets copied from SO in open source software projects is already covered by our previous work (Baltes and Diehl, 2018). We were particularly interested in the licensing of snippets copied into Stack Overflow and whether their license status allows redistribution on Stack Overflow.

8.1 Data Retrieval and Quantitative Analysis

To detect code clones on Stack Overflow, we utilized the BigQuery version of SOTorrent release 2018-09-23. First, we selected all code blocks from the most recent post versions and normalized the whitespaces. To this end, we: (1) replaced sequences of new lines with a single new line character, (2) removed new lines at the end of the last line, and (3) removed lines only containing brackets (()[]{}). Using this normalized content, we calculated the normalized line count of those code blocks (NLOC). Afterwards, we further normalized the content to only contain numbers and digits (character class [a-zA-Z0-9]) and calculated a fingerprint of the normalized code block content using the FARM_FINGERPRINT function. This yielded 43,942,960 distinct fingerprints—that is normalized code blocks—in total. We then used this fingerprint as a GROUP BY argument to determine the posts using a certain snippet, and finally aggregated that information per thread. The corresponding retrieval script can be found on GitHub (Baltes, 2018b).

As a first filtering step, we selected code blocks that are present in at least two different threads, which was true for 909,323 (2.1%) of all distinct fingerprints. Those code clones had an average length of 5.4 normalized lines (, , ) and were present in 3.5 different threads (, , ). To select only non-trivial code snippets, we first used a threshold of six normalized lines of code, as proposed by Bellon et al. (Bellon et al., 2007). We ranked the remaining 215,746 code snippets according to the number of threads they were found in and according to their normalized length. Then, we qualitatively analyzed the first 50 snippets in that list. Since we rated 25 of those snippets as non-code (main category were configuration files) or too trivial, we decided to adjust the threshold for the normalized line count to 20.

Figure 15: Normalized line count of non-trivial code blocks ( NLOC) with at least one clone, i.e., present in at least two threads.
Figure 16: Presence of non-trivial code blocks ( NLOC) in multiple threads.

The stricter filtering led to a second sample with 46,818 code snippets. Those snippets had an average length of 42.6 normalized lines (, , ) and were present in 2.3 different threads (, , )—13.4% of the snippets were present in more than two threads. Figures 15 and 16 visualize the length and thread count distribution in this sample. We provide the coding for both samples ( NLOC and NLOC) on Zenodo (Baltes, 2018d). The analysis scripts are available on GitHub (Baltes, 2018c).

8.2 Qualitative Analysis

Figure 17: Snippet view of so-clones tool showing a code snippet that has likely been copied from the website androidhive into Stack Overflow.

In the second sample, we again ranked the code snippets according to their thread count and length to qualitatively analyze the first 50 snippets in the resulting list. We also implemented a web tool666http://research.sbaltes.com/so-clones/ that allowed us to explore that list. The tool enables users not only to browse the complete list, but it also to focus on a single snippet in a dedicated view. This view (see Figure 17) shows the snippet, its fingerprint, the posts containing the snippet sorted by their creation date, other Stack Overflow posts linked from those posts, and linked external sources. The latter information helped us a lot in identifying if and from where a snippet has been copied into Stack Overflow. The source code of the tool is available on GitHub (Baltes, 2018d).

While there were still ten snippets that we categorized as configuration files, 29 snippets were non-trivial source code snippets (mainly Java and VB/VBA). Other categories included XML GUI definitions for Android, JSON/XML examples, and HTML files. Except for two cases, we were able to identify the (or at least a) source of the snippet. Only in four cases, we considered the snippets to be originally from Stack Overflow. The main external sources were a website providing Android tutorials777https://www.androidhive.info/ (ten snippets) and the official Android documentation888http://developer.android.com/ (4 snippets). We identified a possible licensing conflict in 31 cases, either because the website did not provide any license or because the content was distributed under a restrictive license or under restrictive terms of use. In the following, we are going to describe the two main external sources in more detail. The independent Android website androidhive has rather restrictive terms of use999https://www.androidhive.info/terms-of-service/:

“Our Website is also protected under international copyright laws. The copying, redistribution, use or publication by you of any portion of our Website is strictly prohibited.”

Nevertheless, only few Stack Overflow posts attribute this source (3 out of 45 posts in the example shown in Figure 17). It is unclear if the snippet has actually been copied from this external source (the creation of the posts on androidhive and Stack Overflow was around April/May 2012). But if this is the case, the 45 snippets copied on Stack Overflow could be problematic. In fact, we identified four more variants of that same code snippet among the 50 snippets we analyzed. On the other hand, if Stack Overflow is the original source, the usage on androidhive does not adhere to Stack Overflow’s CC BY-SA license (Baltes and Diehl, 2018).

The snippets copied from the official Android documentation are licensed under CC BY 2.5101010https://developer.android.com/license. This license allows usage under Stack Overflow’s CC BY-SA license, but only when attributing the original source. However, only in few cases the users added a link to the Android documentation to their posts. Thus, also those usages could lead to licensing issues.

Leaving the licensing implications aside, the code clones within Stack Overflow are also problematic for the platform’s usability. Those duplicates could indicate that different threads solved a similar problem. However, if there is no link between those threads, the information is spread and hard to capture for readers. Stack Overflow recommends to “always quote the most relevant part of an important link, in case the target site is unreachable or goes permanently offline”.111111https://stackoverflow.com/help/how-to-answer While it makes sense to quote the main points of an external source or pseudo code of algorithms, it can be questioned if it is reasonable to have several independent copies of non-trivial code snippets on Stack Overflow. Assuming the snippet in the reference documentation is updated, all copies on Stack Overflow (14 in this example) must also be updated. Again, only few Stack Overflow authors link to other posts that already provided the same snippet, making it even harder to update them.

To discuss how to best approach those licensing and usability issues, we created a post on Stack Overflow Meta (Stack Overflow Meta, 2018) to involve the community. We asked, for example, if it would make sense to point Stack Overflow users to related threads based on the similarity of the code blocks posted in a thread, which could be done before users post a question or integrated into the website for existing posts. The post got upvoted to a score of 28 (as of October 30, 2018) and is being discussed in the comments, but there is no answer yet. Stack Overflow user Martijn Pieters, for example, wrote121212https://meta.stackoverflow.com/questions/375761/how-to-handle-code-clones-on-stack-overflow#comment641119_375761:

“I see this a lot in Java (especially Android) code when researching serial plagiarists. There is a lot of example code floating around that is free to copy, but there seems to be an endemic culture that sees copying as a legitimate method of developing software. […] answers should primarily be your own work, not someone else’s.”

One preliminary result of the discussion is that there are comments in favor of adding the missing attribution for the external source to the Stack Overflow posts. However, this would only solve the licensing issue for snippets licensed under a rather permissive license. Moreover, the clones on Stack Overflow would still be isolated from each other. Depending on the outcome of the discussion on Stack Overflow Meta, we plan to implement the approach that the community favors, for example by automatically proposing post edits to add the missing attribution.

9 Discussion

The SOTorrent dataset has allowed us to study the phenomenon of post editing on SO in detail (RQ1). We found that a total of 13.9 million SO posts ( of all posts) have been edited at least once. Many of these edits () modify only a single line of text or code, and while posts grow over time in terms of the number of text and code blocks they contain, the size of these individual blocks is relatively stable. Interestingly, only in of all cases are code blocks changed without corresponding changes in text blocks of the same post, suggesting that SO users typically update the textual description accompanying code snippets when they are edited. We also found that edits are mostly made shortly after the creation of a post ( of all edits are made on the same day when the post was created), and the vast majority of edits are made by post authors ()—although the remaining will be of particular interest for our future work. The number of comments on posts without edits is significantly smaller than the number of comments on posts with edits, suggesting an interplay of these two features (RQ2). We find evidence which suggests that commenting on a post on SO helps to bring attention to it (RQ3). Of the comments that were made on the same day as an edit, were made before an edit and afterwards, typically (median value) only 18 minutes before or after the edit.

Motivated by this quantitative analysis of the temporal relationship between edits and comments, we conducted a qualitative study and developed a visual analysis tool to explore the communication structure of Stack Overflow threads. Our analysis using this tool revealed several communication and edit patterns (RQ3) that provide further evidence for the connection between post edits and comments. We found comments which explain, trigger, and announce edits as well as content overlap between edits and comments. The fact that Stack Overflow users rely on the commenting feature to make others aware of post edits—and in some cases even duplicate content between comments and posts—suggests that users are worried that content evolution will be missed if it is buried in a comment or has been added to a post later via an edit. At the same time, we found evidence that edits can play a vital role in attracting answers to a question. In our future work, we will explore how changes to Stack Overflow’s user interface could make the evolution of content more explicit and remove the need for users to repurpose the commenting feature as an awareness mechanism.

Besides, we presented a first investigation of code clones on Stack Overflow (RQ4) that revealed that, just like in regular software projects, code clones on Stack Overflow can affect the maintainability of posts and lead to licensing issues. Depending on the outcome of the discussion we started on Stack Overflow Meta, we plan to implement means to add the missing attribution to posts and mark threads as related based on the similarity of the code blocks they contain.

10 Related Work

Over the past years, there have been various research papers on leveraging knowledge from SO, e.g., to support post edits (Chen et al., 2017), to automate the search (Ponzanelli et al., 2013; Campbell and Treude, 2017), or to augment API documentation (Treude and Robillard, 2016). Regarding the population of SO users, studies described properties such as gender (Vasilescu et al., 2012) and age (Morrison and Murphy-Hill, 2013). Wang et al. (Wang et al., 2013) analyzed the asking and answering behavior of SO users and found that most of them only answer or ask one question. We complement those results with our finding that post edits happen soon after post creation and that comments are closely linked to edits. Xia et al. (Xia et al., 2017) describe that it is common for developers to search for reusable code snippets on the web. Yang et al. (Yang et al., 2016) found that SO Python and JavaScript snippets are more usable in terms of parsability, compilability and runnability, compared to Java and C#. Yang et al. (Yang et al., 2017) analyzed code clones between Python snippets from SO and Python projects on GH and found a considerable number of non-trivial clones, which may have a negative impact on code quality (Abdalkareem et al., 2017). Baltes and Diehl (Baltes and Diehl, 2018) investigated the usage and attribution of SO code snippets in GH projects and found that at most a quarter of the usages are attributed as required by SO’s license. Moreover, they point to possible licensing issues, similar to what we described in Section 8. Other studies aimed at identifying API usage in SO code snippets (Subramanian and Holmes, 2013), describing characteristics of effective code examples (Nasehi et al., 2012), investigating whether SO code snippets are self-explanatory (Treude and Robillard, 2017), or analyzing the impact of copied SO code snippets on application security (Acar et al., 2016; Fischer et al., 2017). There has also been work on the interplay between user activity on SO and GH (Vasilescu et al., 2013; Silvestri et al., 2015; Badashian et al., 2014). SOTorrent enables researchers to further investigate this connection by collecting links from public GH projects to SO posts. To describe topics of SO questions and answers, different methods like manual analysis (Treude et al., 2011) and Latent Dirichlet Allocation (Wang et al., 2013; Allamanis and Sutton, 2013) have been used. Automatically identifying high-quality posts has been another research direction, where metrics based on the number of edits on a question (Yang et al., 2014), author popularity (Ponzanelli et al., 2014), and code readability (Duijn et al., 2015) yielded good results. With our dataset, the evolution of such high-quality posts can easily be analyzed. German et al. (German et al., 2009) investigated how code siblings, code clones that evolve in a different system than the original code, flow between systems with different licenses. Tracing the flow of siblings between GH projects, posts on SO, and external sources is another possible direction for future work that SOTorrent can support. Two fields related to our research are source code plagiarism detection (Lancaster and Culwin, 2004) and code clone detection (Roy et al., 2009), which both rely on determining the similarity of code fragments.

11 Conclusion

In this paper, we described how we reconstructed and analyzed the evolution of Stack Overflow posts. We presented the open dataset SOTorrent that enables researchers to analyze the evolution of SO content at the level of whole posts and individual text and code blocks. We described how we evaluated 134 different string similarity metrics regarding their suitability to match text and code blocks to their predecessor versions. For text blocks, a profile-based metric using the Manhattan distance yielded the best results; for code blocks, a fingerprint-based metric using the Winnowing algorithm (Schleimer et al., 2003; Duric and Gasevic, 2013) outperformed the other metrics. Since multiple predecessor candidates may exist, we also developed a matching strategy that we iteratively refined using random samples of SO posts. After an analysis of false positive and false negative matches, we further improved this strategy.

Our quantitative and qualitative analyses based on the dataset provided new insights into the evolution of SO posts, and in particular the relationship between post edits and comments and the presence of code clones on SO. In future work, we want to deepen our understanding of how code snippets are maintained on SO, and how code clones affect their maintainability. Moreover, as SOTorrent also collects links from SO posts to other websites and from public GH projects to SO posts, we want to explore how code flows from and to external sources like blog posts and open source software projects. Beside the investigation of new research questions, we will continue to improve and maintain the dataset, for example by developing means to automatically detect code blocks that are not used for code, but for markup (see, e.g., second code block in Figure 1). We welcome bug reports and ideas for improvements, especially by researchers who use SOTorrent to investigate the evolution of SO posts and their connection to other platforms and resources. Everyone can provide feedback simply by creating an issue on GitHub.131313https://github.com/sotorrent/db-scripts/issues

Acknowledgements.
The authors would like to thank Florian Reitz for his help with database-related issues and Tobias Zeimetz for creating the post history ground truth.

References

  • Abdalkareem et al. (2017) Abdalkareem R, Shihab E, Rilling J (2017) On code reuse from StackOverflow: An exploratory study on Android apps. Information and Software Technology 88:148–158
  • Acar et al. (2016) Acar Y, Backes M, Fahl S, Kim D, Mazurek ML, Stransky C (2016) You Get Where You’re Looking For: The Impact Of Information Sources on Code Security. In: Locasto M, Shmatikov V, Erlingsson Ú (eds) 2016 IEEE Symposium on Security and Privacy (S&P 2016), IEEE Computer Society, San Jose, CA, USA, pp 289–305
  • Allamanis and Sutton (2013) Allamanis M, Sutton C (2013) Why, when, and what: Analyzing Stack Overflow questions by topic, type, and code. In: Zimmermann T, Di Penta M, Kim S (eds) 10th International Working Conference on Mining Software Repositories (MSR 2013), IEEE, San Francisco, CA, USA, pp 53–56
  • An et al. (2017) An L, Mlouki O, Khomh F, Antoniol G (2017) Stack Overflow: A Code Laundering Platform? In: Pinzger M, Bavota G, Marcus A (eds) 24th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER 2017), IEEE Computer Society, Klagenfurt, Austria, pp 283–293
  • Badashian et al. (2014) Badashian AS, Esteki A, Gholipour A, Hindle A, Stroulia E (2014) Involvement, Contribution and Influence in GitHub and Stack Overflow. In: Ng J, Li J, Wong K (eds) 24th International Conference on Computer Science and Software Engineering (CASCON 2014), IBM / ACM, Markham, ON, Canada, pp 19–33
  • Baltes (2018a) Baltes S (2018a) SOTorrent: Reconstructing and Analyzing the Evolution of Stack Overflow Posts — Supplementary Material. URL http://doi.org/10.5281/zenodo.1201553
  • Baltes (2018b) Baltes S (2018b) sotorrent/db-scripts on GitHub. URL https://doi.org/10.5281/zenodo.1116346
  • Baltes (2018c) Baltes S (2018c) sotorrent/r-scripts on GitHub. URL https://doi.org/10.5281/zenodo.1048185
  • Baltes (2018d) Baltes S (2018d) sotorrent/so-clones on GitHub. URL https://doi.org/10.5281/zenodo.1472948
  • Baltes (2018e) Baltes S (2018e) sotorrent/so-edit-viz on GitHub. URL https://doi.org/10.5281/zenodo.1474203
  • Baltes (2018f) Baltes S (2018f) Usage and Attribution of Stack Overflow Code Snippets in GitHub Projects — Supplementary Material. URL https://doi.org/10.5281/zenodo.1148069
  • Baltes and Diehl (2018) Baltes S, Diehl S (2018) Usage and Attribtion of Stack Overflow Code Snippets in GitHub Projects. Empirical Software Engineering Online First:1–37
  • Baltes and Dumani (2018a) Baltes S, Dumani L (2018a) SOTorrent Data Set Version 2018-02-16. URL http://doi.org/10.5281/zenodo.1196296
  • Baltes and Dumani (2018b) Baltes S, Dumani L (2018b) SOTorrent GitHub Page. URL https://github.com/sotorrent
  • Baltes and Dumani (2018c) Baltes S, Dumani L (2018c) sotorrent/metrics-comparison on GitHub. URL https://doi.org/10.5281/zenodo.1045823
  • Baltes and Dumani (2018d) Baltes S, Dumani L (2018d) sotorrent/so-posthistory-extractor on GitHub. URL https://doi.org/10.5281/zenodo.835046
  • Baltes et al. (2017a) Baltes S, Dumani L, Zeimetz T (2017a) Dataset with manually validated version histories of Stack Overflow posts. URL http://doi.org/10.5281/zenodo.884909
  • Baltes et al. (2017b) Baltes S, Kiefer R, Diehl S (2017b) Attribution required: Stack overflow code snippets in GitHub projects. In: Uchitel S, Orso A, Robillard MP (eds) 39th International Conference on Software Engineering (ICSE 2017), Companion Volume, IEEE Computer Society, Buenos Aires, Argentina, pp 161–163
  • Baltes et al. (2018) Baltes S, Dumani L, Treude C, Diehl S (2018) SOTorrent: Reconstructing and Analyzing the Evolution Stack Overflow Posts. In: Zaidman A, Hill E, Kamei Y (eds) 15th International Conference on Mining Software Repositories (MSR 2018), ACM, Gothenburg, Sweden, pp 319–330
  • Bellon et al. (2007) Bellon S, Koschke R, Antoniol G, Krinke J, Merlo E (2007) Comparison and evaluation of clone detection tools. IEEE Transactions on Software Engineering 33(9):577–591
  • Burrows et al. (2007) Burrows S, Tahaghoghi SMM, Zobel J (2007) Efficient plagiarism detection for large code repositories. Software—Practice and Experience 37(2):151–176
  • Campbell and Treude (2017) Campbell BA, Treude C (2017) NLP2Code: Code Snippet Content Assist via Natural Language Tasks. In: Mei H, Zhang L, Zimmermann T (eds) 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME 2017), IEEE Computer Society, Shanghai, China, pp 628–632
  • Chapin et al. (2001) Chapin N, Hale JE, Khan KM, Ramil JF, Tan WG (2001) Types of software evolution and software maintenance. Journal of Software Maintenance 13(1):3–30
  • Chen et al. (2017)

    Chen C, Xing Z, Liu Y (2017) By the Community & For the Community: A Deep Learning Approach to Assist Collaborative Editing in Q&A Sites. Proceedings of the ACM on Human-Computer Interaction 1:32:1–32:21

  • Chicco (2017)

    Chicco D (2017) Ten quick tips for machine learning in computational biology. BioData mining 10(1):35

  • Cohen (1988) Cohen J (1988) Statistical Power Analysis for the Behavioral Sciences, 2nd edn. Routledge, Mahwah, NJ, USA
  • Cohen (1992) Cohen J (1992) A power primer. Psychological bulletin 112(1):155
  • Duijn et al. (2015) Duijn M, Kucera A, Bacchelli A (2015) Quality Questions Need Quality Code: Classifying Code Fragments on Stack Overflow. In: Di Penta M, Pinzger M, Robbes R (eds) 12th Working Conference on Mining Software Repositories (MSR 2015), IEEE Computer Society, Florence, Italy, pp 410–413
  • Dumani and Baltes (2017) Dumani L, Baltes S (2017) sotorrent/so-posthistory-gt on GitHub. URL https://doi.org/10.5281/zenodo.1045935
  • Dumani and Baltes (2018) Dumani L, Baltes S (2018) sotorrent/posthistory-comparator-gt-cs on GitHub. URL https://doi.org/10.5281/zenodo.1474238
  • Duric and Gasevic (2013) Duric Z, Gasevic D (2013) A source code similarity system for plagiarism detection. The Computer Journal 56(1):70–86
  • Fischer et al. (2017) Fischer F, Böttinger K, Xiao H, Stransky C, Acar Y, Backes M, Fahl S (2017) Stack Overflow Considered Harmful? The Impact of Copy&Paste on Android Application Security. In: Butler KRB, Erlingsson Ú, Parno B (eds) 2017 IEEE Symposium on Security and Privacy (S&P 2017), IEEE Computer Society, San Jose, CA, USA, pp 121–136
  • German et al. (2009) German DM, Di Penta M, Gueheneuc YG, Antoniol G (2009) Code siblings: Technical and legal implications of copying code between applications. In: Godfrey MW, Whitehead J (eds) 6th International Working Conference on Mining Software Repositories (MSR 2009), IEEE Computer Society, Vancouver, BC, Canada, pp 81–90
  • Gharehyazie et al. (2017) Gharehyazie M, Ray B, Filkov V (2017) Some From Here, Some From There: Cross-Project Code Reuse in GitHub. In: Gonzalez-Barahona JM, Hindle A, Tan L (eds) 14th International Conference on Mining Software Repositories (MSR 2017), IEEE Computer Society, Buenos Aires, Argentina, pp 291–301
  • Gibbons et al. (1993)

    Gibbons RD, Hedeker DR, Davis JM (1993) Estimation of effect size from a series of experiments involving paired comparisons. Journal of Educational Statistics 18(3):271–279

  • Godfrey and German (2008) Godfrey MW, German DM (2008) The past, present, and future of software evolution. In: Muller H, Tilley S, Wong K (eds) Frontiers of Software Maintenance (FoSM 2008), IEEE, Beijing, China, pp 129–138
  • Google Cloud Platform (2018) Google Cloud Platform (2018) GitHub Data. URL https://cloud.google.com/bigquery/public-data/github
  • Gousios (2013) Gousios G (2013) The GHTorrent dataset and tool suite. In: Zimmermann T, Di Penta M, Kim S (eds) 10th International Working Conference on Mining Software Repositories (MSR 2013), IEEE, San Francisco, CA, USA, pp 233–236
  • Hinkle et al. (1979) Hinkle DE, Wiersma W, Jurs SG (1979) Applied statistics for the behavioral sciences. Rand McNally College Publishing, Skokie, IL, USA
  • Juergens et al. (2009) Juergens E, Deissenboeck F, Hummel B, Wagner S (2009) Do Code Clones Matter? In: Fickas S, Atlee JM, Inverardi P (eds) 31st International Conference on Software Engineering (ICSE 2009), IEEE Computer Society, Vancouver, BC, Canada, pp 485–495
  • Lancaster and Culwin (2004) Lancaster T, Culwin F (2004) A comparison of source code plagiarism detection engines. Computer Science Education 14(2):101–112
  • Lehman (1980) Lehman MM (1980) Programs, life cycles, and laws of software evolution. Proceedings of the IEEE 68(9):1060–1076
  • Manning et al. (2008) Manning CD, Raghavan P, Schutze H (2008) Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA
  • Martins et al. (2014) Martins VT, Fonte D, Henriques PR, Cruz Dd (2014) Plagiarism Detection: A Tool Survey and Comparison. In: Pereira MJV, Leal JP, Simoes A (eds) 3rd Symposium on Languages, Applications and Technologies (SLATE 2014), Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Bragança, Portugal, OpenAccess Series in Informatics (OASIcs), vol 38, pp 143–158
  • Matthews (1975) Matthews BW (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA) – Protein Structure 405(2):442–451
  • Mens and Demeyer (2008) Mens T, Demeyer S (eds) (2008) Software Evolution. Springer, Berlin, Germany
  • Morrison and Murphy-Hill (2013) Morrison P, Murphy-Hill E (2013) Is programming knowledge related to age? An exploration of Stack Overflow. In: Zimmermann T, Di Penta M, Kim S (eds) 10th International Working Conference on Mining Software Repositories (MSR 2013), IEEE, San Francisco, CA, USA, pp 69–72
  • Nasehi et al. (2012) Nasehi SM, Sillito J, Maurer F, Burns C (2012) What makes a good code example? A study of programming Q&A in StackOverflow. In: Tonella P, Di Penta M, Maletic JI (eds) 28th IEEE International Conference on Software Maintenance (ICSM 2012), IEEE Computer Society, Trento, Italy, pp 25–34
  • Parnin et al. (2012) Parnin C, Treude C, Grammel L, Storey MA (2012) Crowd documentation: Exploring the coverage and the dynamics of API discussions on Stack Overflow. Georgia Institute of Technology, Technical Report
  • Ponzanelli et al. (2013) Ponzanelli L, Bacchelli A, Lanza M (2013) Seahawk: Stack Overflow in the IDE. In: Notkin D, Cheng BHC, Pohl K (eds) 35th International Conference on Software Engineering (ICSE 2013), IEEE Computer Society, San Francisco, CA, USA, pp 1295–1298
  • Ponzanelli et al. (2014) Ponzanelli L, Mocci A, Bacchelli A, Lanza M (2014) Understanding and classifying the quality of technical forum questions. In: Wong WE, McMillin B (eds) 14th International Conference on Quality Software (QSIC 2014), IEEE, Allen, TX, USA, pp 343–352
  • Powers (2011) Powers DM (2011) Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies 2(1):37–63
  • Roy et al. (2009) Roy CK, Cordy JR, Koschke R (2009) Comparison and evaluation of code clone detection techniques and tools: A qualitative approach. Science of Computer Programming 74(7):470–495
  • Schleimer et al. (2003) Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: Local algorithms for document fingerprinting. In: Halevy AY, Ives ZG, Doan A (eds) 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD 2003), ACM, San Diego, CA, USA, pp 76–85
  • Silvestri et al. (2015) Silvestri G, Yang J, Bozzon A, Tagarelli A (2015) Linking Accounts across Social Networks: The Case of StackOverflow, GitHub and Twitter. In: Armano G, Bozzon A, Giuliani A (eds) 1st International Workshop on Knowledge Discovery on the WEB (KDWeb 2015), CEUR-WS.org, Cagliari, Italy, CEUR Workshop Proceedings, pp 41–52
  • Spearman (1904) Spearman C (1904) The proof and measurement of association between two things. American Journal of Psychology 15(1):72–101
  • Stack Exchange Community Wiki (2018-02-27) Stack Exchange Community Wiki (2018-02-27) Database schema documentation for the public data dump and SEDE. URL https://meta.stackexchange.com/a/2678
  • Stack Exchange Inc (2017) Stack Exchange Inc (2017) Stack Exchange Data Dump 2017-12-01. URL https://archive.org/details/stackexchange/
  • Stack Exchange Inc (2018) Stack Exchange Inc (2018) Markdown help. URL https://stackoverflow.com/editing-help
  • Stack Overflow Meta (2018) Stack Overflow Meta (2018) How to handle code clones on Stack Overflow? URL https://meta.stackoverflow.com/q/375761
  • Subramanian and Holmes (2013) Subramanian S, Holmes R (2013) Making sense of online code snippets. In: Zimmermann T, Di Penta M, Kim S (eds) 10th International Working Conference on Mining Software Repositories (MSR 2013), IEEE, San Francisco, CA, USA, pp 85–88
  • Thummalapenta et al. (2010) Thummalapenta S, Cerulo L, Aversano L, Di Penta M (2010) An empirical study on the maintenance of source code clones. Empirical Software Engineering 15(1):1–34
  • Treude and Robillard (2016) Treude C, Robillard MP (2016) Augmenting API Documentation with Insights from Stack Overflow. In: Dillon L, Visser W, Williams L (eds) 38th International Conference on Software Engineering (ICSE 2016), ACM, Austin, TX, USA, pp 392–403
  • Treude and Robillard (2017) Treude C, Robillard MP (2017) Understanding Stack Overflow Code Fragments. In: Mei H, Zhang L, Zimmermann T (eds) 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME 2017), IEEE Computer Society, Shanghai, China, pp 509–513
  • Treude et al. (2011) Treude C, Barzilay O, Storey MAD (2011) How do programmers ask and answer questions on the web? In: Taylor RN, Gall HC, Medvidovic N (eds) 33rd International Conference on Software Engineering (ICSE 2011), ACM, Waikiki, Honolulu, pp 804–807
  • Vasilescu et al. (2012) Vasilescu B, Capiluppi A, Serebrenik A (2012) Gender, Representation and Online Participation: A Quantitative Study of StackOverflow. In: Aberer K, Flache A, Jager W, Liu L, Tang J, Gueret C (eds) 4th International Conference on Social Informatics (SocInfo 2012), Springer, Lausanne, Switzerland, Lecture Notes in Computer Science, pp 332–338
  • Vasilescu et al. (2013) Vasilescu B, Filkov V, Serebrenik A (2013) StackOverflow and GitHub: Associations between Software Development and Crowdsourced Knowledge. In: Chang LW, Srivastava J, Zhan J (eds) 2013 International Conference on Social Computing (SocialCom 2013), IEEE Computer Society, Washington, DC, USA, pp 188–195
  • Wang et al. (2013) Wang S, Lo David, Jiang L (2013) An empirical study on developer interactions in StackOverflow. In: Shin SY, Maldonado JC (eds) 28th Annual ACM Symposium on Applied Computing (SAC 2013), ACM, Coimbra, Portugal, pp 1019–1024
  • Wilcoxon (1945) Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1(6):80–83
  • Xia et al. (2017) Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017) What do developers search for on the web? Empirical Software Engineering 22(6):3149–3185
  • Yang et al. (2016) Yang D, Hussain A, Lopes CV (2016) From Query to Usable Code: An Analysis of Stack Overflow Code Snippets. In: Kim M, Robbes R, Bird C (eds) 13th International Conference on Mining Software Repositories (MSR 2016), ACM, Austin, TX, USA, pp 391–402
  • Yang et al. (2017) Yang D, Martins P, Saini V, Lopes CV (2017) Stack Overflow in Github: Any Snippets There? In: Gonzalez-Barahona JM, Hindle A, Tan L (eds) 14th International Conference on Mining Software Repositories (MSR 2017), IEEE Computer Society, Buenos Aires, Argentina, pp 280–290
  • Yang et al. (2014) Yang J, Hauff C, Bozzon A, Houben GJ (2014) Asking the right question in collaborative Q&A systems. In: Ferres L, Rossi G, Almeida VAF, Herder E (eds) 25th ACM Conference on Hypertext and Social Media (HT 2014), ACM, Santiago, Chile, pp 179–189