Crowdsourcing the State of the Art(ifacts)

In any field, finding the "leading edge" of research is an on-going challenge. Researchers cannot appease reviewers and educators cannot teach to the leading edge of their field if no one agrees on what is the state-of-the-art. Using a novel crowdsourced "reuse graph" approach, we propose here a new method to learn this state-of-the-art. Our reuse graphs are less effort to build and verify than other community monitoring methods (e.g. artifact tracks or citation-based searches). Based on a study of 170 papers from software engineering (SE) conferences in 2020, we have found over 1,600 instances of reuse; i.e., reuse is rampant in SE research. Prior pessimism about a lack of reuse in SE research may have been a result of using the wrong methods to measure the wrong things.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/04/2018

Cut to the chase: Revisiting the relevance of software engineering research

Software engineering (SE) research should be relevant to industrial prac...
08/28/2020

Researcher Bias in Software Engineering Experiments: a Qualitative Investigation

Researcher Bias (RB) occurs when researchers influence the results of an...
08/21/2021

Term Interrelations and Trends in Software Engineering

The Software Engineering (SE) community is prolific, making it challengi...
04/07/2022

Impact of Software Engineering Research in Practice

Existing work on the practical impact of software engineering (SE) resea...
05/05/2021

Software Engineering for AI-Based Systems: A Survey

AI-based systems are software systems with functionalities enabled by at...
10/18/2021

Use and Misuse of the Term Experiment in Mining Software Repositories Research

The significant momentum and importance of Mining Software Repositories ...
05/29/2018

Better Metrics for Ranking SE Researchers

This paper studies how SE researchers are ranked using a variety of metr...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

Introduction

According to Popper (Popper, 2014), the ideas we can most trust are those that have been most tried and most tested. For that reason, many of us are involved in the process called “Science” that produces trusted knowledge by sharing one’s ideas, and trying out and testing others’ ideas. Science and scientists form communities where people do each other the courtesy of curating, clarifying, critiquing and improving a large pool of ideas.

Prior to this study, the standard conclusion was that researchers in the field of software engineering are rarely reusing research results (e.g., da Silva et al. reported that from 1994-2010, only 72 studies had been replicated by 96 new studies (da Silva et al., 2012)). If true, this is a significant problem since not knowing the state of the art complicates both research and graduate education.

We argue in this paper that, at least in the area of software engineering, this “reuse problem” is more apparent, than real. We describe a successful experiment where teams of researchers from around the world read 170 recent (2020) conference papers from software engineering. This work generated the “reuse graph” of Figure 1. In that figure, each edge connects papers to the prior work that they are (re-)using. As discussed below, when compared to other community monitoring methods (e.g. artifact tracks or bibliometric searches (Mathew et al., 2018; Baldassarre et al., 2019)), these reuse graphs are less effort to build and verify. For example, it took around 12 minutes per paper for our team from Hong Kong, Canada, the United States, Italy, Sweden, Finland, and Australia to apply this reuse graph methodology to software engineering111That team included the authors of this paper plus Jacky Keung from City University (Hong Kong); Greg Gay from Chalmers University (Sweden); Burak Turhan from Oulu University (Finland); and Aldeida Aleti from Monash University (Australia). We gratefully acknowledge their work, and that of their graduate students. In particular, we especially call out the work of Afonso Fontes from Chalmers University (Sweden)..

Figure 1. From the web-site https://reuse-dept.org. The 1,635 arrows in this diagram connects reuser to reused. Blue dots denote the 714 sources found with with a digital object identifier (a DOI); e.g. any paper from a peer-reviewed source. Red dots denote the 48 papers we found without DOIs (e.g. those from arxiv.org). Gray and green denote the 297 websites and 57 Github repositories (respectively) reused in this sample. The black squares pull out four examples where a paper has reused material from dozens of other sources. Data collection for this graph is on-going and at the time of this writing, that data comes from 40% of the papers published in the 2020 technical program of ICSE, ASE, FSE, ICMSE, MSR, and ESEM.

There are many other methods to map the structure of SE research such as (a) manual or automatic citations searchers or (b) “artifact evaluation committees” that aim to foster the generation and sharing of research products (for more on artifact evaluation committees, see below). Such studies can lag significantly behind current work. For example, our citation analysis of SE (Mathew et al., 2018) only went up to 2016, the study itself was conducted in 2017, but not fully published till 2018. Given the enormous effort required for that work, we have vowed never to do it again. Reuse graphs, on the other hand, are faster to keep up to date since the work of any one individual working on these graphs is minimal.

Another reason to favor reuse graphs is that they are community comprehensible, community verifiable, and community correctable. All the data used for our reuse graphs is community-collected. All the data can be audited at https://reuse-dept.org and if errors are detected, issue reports can be raised in our GitHub repository (and the error corrected). The same may not be true for studies based on citation servers run by professional bodies and for-profit organizations (e.g., see Table 1).

EXAMPLE #1: At the time of this writing, one of us has an entry in Google Scholar metrics software systems saying that their paper “How to” has 80 citations in the last five years at IEEE Transactions on Software Engineering. That link connects to some other paper, not written by any of us. Yet there is no ”help” button at Google Scholar where this error can be reported. This is disappointing since it is suspected that this mysterious “How to” paper references work that might be the most cited from IEEE TSE in the last five years. That would be a significant achievement, if we could document it (but using Google Scholar, we cannot).
EXAMPLE #2: In our recent large scale text-mining studies of 30,000+ SE papers (Mathew et al., 2018) , it was found that papers can appear in our citation server, but not in others. Also, examples were found where some papers had twenty times the citations in one server than in another. Here again, we were unable to contact anyone working on those citation servers to fix those errors.
EXAMPLE #3: Sometimes it is possible to contact the owners of these citations sites, but even then they may not fix errors. For example, there is no accepted convention for how to typeset a hypen, so different venues add in zero to one space before or after (and some even typeset the hyphen as two dashes). Hence, Zhou et al. found that (a) papers with hyphens in the title get reported as different papers in different venues; which means that (b) those papers get fewer citations (Zhou et al., 2021). Zhou et al. report that when they contacted the owners of these citation servers, rather than fix the errors, those owners started lobbying for the Zhou et al. paper not to be published.
Table 1. Examples of errors in citation servers.

R2.4 What is the value of a verified, continually updated, snapshot of some current research area? Once our reuse graph covers several years (and not just 2020 publications), we foresee several applications:

  1. Graduate students could direct their attention to research areas that are both very new (nodes from recent years) and very productive (nodes with an unusually large number of edges attached);

  2. The organizers of industrial and research conferences could select their keynote speakers from that space of new and productive artifacts.

  3. When applying for promotion or hiring, research faculty or industrial workers could document the impact of their work beyond papers, including tools, datasets, and innovative methods;

  4. Growth patterns might guide federal government funding priorities or departmental hiring plans.

  5. Venture capitalists could use these graphs to detect emergent technologies, perhaps even funding some of those.

  6. Conference organizers could check if their program committees have enough members from currently hot topics.

  7. Further, those same organizers could create new conference tracks and journals sections in order to service active research communities that are under-represented in current publication venues.

  8. Journal editors could find reviewers with relevant experience.

  9. Educators can use the graphs to guide their teaching plan.

Further to the last point, we are planning an immediate application of Figure 1 graph for our Fall’21 graduate SE classes. There, we will tell students that understanding the current state of the art will be their challenge for the rest of their career. But, using reuse graphs, it is possible for a community to find and maintain a shared understanding of that state-of-the-art. To demonstrate this, in our Fall’21 classes, we are leaving the second half of the lecture plan blank. We will let students find and define what cutting edge techniques will be discussed there. To do so, their homework for the initial three weeks of class is to, at first, learn this reuse graph approach by performing our standard “reuse graph 101” exercise222https://github.com/bhermann/DoR/blob/main/workflow/training.md. Then in week 2, read some papers to find their reuse (if any); Finally, in week 3, they should check someone else’s reuse findings from other papers.

Studying Reuse

In our reuse study, we targeted papers from the 2020 technical programs of six major international SE conferences: Software Engineering (ICSE), Automated Software Engineering (ASE), Joint European Software Engineering Conference / Foundations of Software Engineering (ESEC/FSE), Software Maintenance and Engineering (ICSME), Mining Software Repositories (MSR), and Empirical Software Engineering and Measurement (ESEM). These conferences were selected using advice from (Mathew et al., 2018), but our vision is to expand; for example, by looking at all top-ranked SE conferences. GitHub issues were used to divide up the hundreds of papers from those conferences into “work packets” of ten papers each. Reading teams were set up from software engineering research teams from around the globe in Hong Kong, Istanbul (Turkey), Victoria (Canada), Gothenburg (Sweden), Oulu (Finland), Melbourne (Australia), and Raleigh (USA). Team members would assign themselves work packets and then read the papers looking for the kinds of reuse enumerated below. Once completed, a second person (from any of our teams) would do the same and check for consistency. Fleiss Kappa statistics are then computed to track the level of reader disagreement. All interaction was done via the GitHub issue system (see Figure 2).

Figure 2. Controlling data collection for building the reuse graphs.

Teams were asked to record six kinds of reuse:

  1. Most papers have to benchmark their new ideas against some prior recent state-of-the-art paper. That is, they reuse old papers as stepping stones towards new results.

  2. Another thing that is often reused are statistical methods. Here we do not mean “we use a two-tailed t-test” or some other decades-old widely-used statistical method. Rather, we refer instead to statistical methods for recent papers that propose statistical guidance for the kinds of analysis seen in SE . Perhaps because this kind of analysis is very rare, these people are exceedingly high cited; e.g.

    • A 2008 paper Benchmarking Classification Models for Software Defect Prediction (Lessmann et al., 2008) currently has 1,178 citations;

    • A 2011 paper A practical guide for using statistical tests to assess randomized algorithms in software engineering (Arcuri and Briand, 2011) currently has 778 citations.

  3. Metrics and Method descriptions (which may be guidelines, with no tools);

  4. R3.2a Data sets;

  5. Sanity checks (justification for why a particular approach works or is reasonable to avoid bad data);

  6. and, indeed, the software packages of the kind currently being reviewed by SE conference artifact evaluation committees (tools and replications).

We can report that it is not difficult to read papers in order to detect these kinds of reuse:

  • It is fast to find the above six kinds of reuse. Our graduate students report that reading their first paper might take up to an hour. But after two or three papers, the median reading time drops to around 12 minutes (see Figure 2(a)).

  • When we compare the reuse reported by different readers, we get Figure 2(b). In our current results, the median Fleiss Kappa score (for reviewer agreement) is one (i.e. very good).

(a) Reading times
(b) Fleiss Kappa scores
(c) Years that the works were created. R3.1 .
Figure 3. Reading time results (left). Agreement scores (center). Yearly prevalence of reused papers (right).

Of course, there any many other items being reused than the six listed above333See https://pasteboard.co/Ke4tKgO.png. It is an open question, worthy of future work, to check if those other items can be collected in this way.

Related Work

Apart from software engineering, many other disciplines are actively engaged in artifact creation, sharing, and re-use. Artifacts are useful for building a culture of replication and reproducibility, already acknowledged as important in SE (Santos et al., 2021). Fields such as psychology have had many early results thrown into doubt because of a failure to replicate the original findings (Schimmack, 2020). Sharing research protocols and data allows for other research teams to conduct severe tests of the original studies (Mayo, 2018), strengthening (or rejecting) these initial findings. In medicine, drug companies are mandated to share the research protocols and outcomes of their drug trials, something that has become of vital recent importance (albeit not without challenges (DeVito et al., 2020)). In physics and astronomy, artifact sharing is so commonplace that large community infrastructures exist solely to ensure data sharing, not least because governments which fund these costly experiments insist on it.

Furthermore, software plays such a vital role in this enterprise that many fields have begun training and developing science-specific research software engineers, for example, at the UK’s Software Sustainability Institute444https://www.software.ac.uk, or the USA’s NSF-funded Molecular Science Software Institute555https://molssi.org. Indeed, a founding impetus for the World Wide Web was the need for CERN (Center for Nuclear Research) to facilitate knowledge sharing (Berners-Lee, 1990)666

And Tim Berners-Lee himself has said “Had the technology been proprietary, and in my total control, it would probably not have taken off.” Another argument for artifacts!

—a mission now continued by Zenodo, also operated by CERN.

In more theoretical areas of CS, pioneering use of preprint servers has enabled ‘reuse’ of proofs, essential to progress. In machine learning, replication is focused on stepping-stones, enabled by highly successful benchmarks such as ImageNet

(Russakovsky et al., 2015). However, recent advances with extremely costly training regimens have called replicability into question777https://www.technologyreview.com/2020/11/12/1011944/artificial-intelligence-replication-crisis-science-big-tech-google-deepmind-facebook-openai/.

In the specific case of software engineering research, prior to this paper, there was little recorded and verified evidence of reuse. Many researchers have conducted citation studies that find links to highly cited papers (e.g. (Mathew et al., 2018)). As stated in our introduction, such studies can lag behind the latest results. Also, recalling Table 1, we have cause to doubt the conclusions from such citation studies.

What about Artifact Evaluation Committees?

R1.1 Another practice that is becoming increasingly common is for conferences to run artifact evaluation committees. The authors of accepted conference papers submit software packages that, in theory, let others re-execute that work. These evaluation committees award “badges” as shown in Table 2.

Available Functional Reusable Reproduced Replicated





Archived in a public repository with a long-term retention policy. A DOI needs to be provided. Artifacts need to be documented, consistent, complete, exercisable, and include appropriate evidence of verification and validation. Only available to artifacts already qualifying for the functional badge. Needs to significantly exceed minimal functionality. Results of this paper have been reproduced by a different team using the original artifact. Results of this paper have been replicated by a different team without the original artifact.
Table 2. Badges currently awarded at ACM conferences (Association for Computing Machinery, 2020). This table is for the ACM and analogous tables are used at other conferences.
Figure 4. Artifact evaluation committee sizes 2011-2019. From Hermann et al. (Hermann et al., 2020)

Artifact evaluation is something of a “growth industry” in SE (and programming languages, or PL, community). Figure 4 shows the increasing number of people evaluating artifacts 2011 to 2019. As to more recent data: (a) at PLDI’20, 61 of 77 papers offered artifacts (Donaldson and Torlak, 2020), and at ECOOP’21, all 20 research papers offered software artifacts (Møller and Sridharan, 2021); (b) at ASE’21, they have a 60 person artifact evaluation track.

The question has to be asked: are all the people of Figure 4 making the best use of their time? Perhaps not. We note that most artifacts are assigned the badges requested by the authors. Given that, it might be safe to ask some of the personnel from Figure 4 to (e.g.) spend less time on evaluating conference artifacts and spend more time working on Figure 1.

Also, we suspect that the badges of Table 2 need refinement since much time can be squandered on minor issues with little practical effect. For example, it can be hard to distinguish “functional” from “reusable” (in fact, some artifact evaluation committees just ignore the “functional” badge; e.g. see the artifact evaluation process at ICSE’21, ICMSE’21, and ASE’21).

Further, as to checking for “functional” and “reusable”, that requires downloading then installing then running the software. This can be a very long process (taking hours to days), especially for quirky research prototypes where (e.g.) the scripts have one letter typos and/or the install instructions are missing small, but crucial pieces of information.

But most importantly, it is not clear that the artifact evaluation process is creating reused artifacts. If we query ACM Portal for “software engineering” and “artifacts” in the range 2015 to 2020, we find that most of the recorded artifacts are R1.4 not reused in replications or reproductions888 As of December 10 2020, that search returns 2,535 SE papers with artifact badge. Of these, 43%, 30%, 20%, 5%, 2% are available, functional, reusable, reproduced and replicated artifacts (respectively).. Specifically, only th are reproduced and only th are replicated. R1.2 One possibility for these results, which we can quickly discount, is that SE artifacts are a new idea that will take a while to catch on. If this were the case, we would expect that older SE artifacts are reproduced more, since they have been around for longer. But this is not the case. Look at Figure 3c, we see that most of the reused artifacts were created very recently.

Perhaps we need to change the definition of the badges and say an artifact is “reusable” if it is reused (and not before). Also, it might be useful to reflect more on what is actually being reused (as we have done, above).

Next Steps for Reuse Graphs

While Figure 1 is a promising start, to scale up from here, we need to organize a larger reading population. Our goal is to analyze 200, 2000, 5000 papers in 2021, 2022, 2023 (respectively) by which time we would have covered most of the major SE conferences in the last five years. After that, our maintenance goal would be to read 500 (ish) papers per year to keep up to date with the conferences. R3.6 Based on Figure 3a, then assuming each paper is read by two people, then maintenance goal would be achievable by a team of twenty people working two hours per month on this task. To organize this work, we have created the “ROSE initiative”, see Table 3.

Researchers that reuse the most from other papers will be applauded and awarded a “R-index” (reuse index).
Researchers that build the artifacts that are most reused will be applauded (even louder) and be awarded an “R+-index” indicating that they are the people producing the artifacts that are most used by the rest of the community.
In between each conference, the ROSE initiative will co-ordinate an international team of volunteers incrementally updating the SE reuse graph.
This reuse graph will be displayed at a publicly available web site (reuse-dept.org). Individual researchers can browse that site, check their entries, and propose corrections and extensions.
All reports of reuse will be double checked and disputed claims will be then be tripled-checked.
All the tools used to create that web site will be freely available for download. Hence, if the SE community does not like how we are running these reuse graphs, they can take all our code and data and do something else.
Also, researchers from other disciplines can take our tools and apply them to their own community.
Table 3. The ROSE initiative: Recognizing and Rewarding Open Science in Software Engineering: an international multi-conference workshop that will continually report updates to the SE reuse graphs.

If that work interests you, then there are many ways you can get involved:

  • If you are a researcher and wish to check that we have accurately recorded your contribution, please visit https://reuse-dept.org.

  • If you want to apply reuse graphs to your community, please use our tools at https://github.com/bhermann/DoR/.

  • If you are interested in joining this initiative and contributing to an up-to-minute snapshot of SE research, then please (a) take our how-to-read-for-reuse tutorial999https://github.com/bhermann/DoR/blob/main/workflow/training.md; (b) then visit the dashboard at Figure 2, find an issue with no one’s face on it, and assign yourself a task.

  • Better yet, if you are an educator teaching a graduate SE class, then get your students to do the three week reading assignments shown in the introduction. As a result, students will join an international team exploring reuse in SE that will keep them informed and updated about the state-of-the-art in SE for many years to come. Also, as a side-effect, they will also see first hand the benefit of open source tools that can be shared by teams working around the globe.

We see this effort as one part of the broader open science effort, in addition to helping the community identify the state of the art (e.g., patterns of growth in the reuse graph). Among the goals of open science are to increase confidence in published results, and to acknowledge that science produces more types of artifacts than just publications: researchers also produce method innovations, new datasets, and improved tools. If we take an agile view of SE science, then as researchers we should focus on generating these artifacts and rapidly securing critique, curation, and clarification from our peers and the public.

References

  • A. Arcuri and L. Briand (2011) A practical guide for using statistical tests to assess randomized algorithms in software engineering. In Proceedings of the 33rd International Conference on Software Engineering, ICSE ’11, New York, NY, USA, pp. 1–10. External Links: ISBN 9781450304450, Link, Document Cited by: 2nd item.
  • Association for Computing Machinery (2020) Artifact Review and Badging. Note: https://www.acm.org/publications/policies/artifact-review-and-badging-currentAccessed: 2020-12-08 Cited by: Table 2.
  • M. T. Baldassarre, D. Caivano, S. Romano, and G. Scanniello (2019) Software models for source code maintainability: a systematic literature review. In 2019 45th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 252–259. Cited by: Introduction.
  • T. Berners-Lee (1990) External Links: Link Cited by: Related Work.
  • F. Q. B. da Silva, M. Suassuna, A. C. C. França, A. M. Grubb, T. B. Gouveia, C. V. F. Monteiro, and I. E. dos Santos (2012) Replication of empirical studies in software engineering research: a systematic mapping study. Empirical Software Engineering. External Links: Document, Link Cited by: Introduction.
  • N. J. DeVito, S. Bacon, and B. Goldacre (2020) Compliance with legal requirement to report clinical trial results on ClinicalTrials.gov: a cohort study. The Lancet 395 (10221), pp. 361–369. External Links: Document Cited by: Related Work.
  • A. F. Donaldson and E. Torlak (Eds.) (2020) PLDI 2020: proceedings of the 41st acm sigplan conference on programming language design and implementation. Association for Computing Machinery, New York, NY, USA. External Links: ISBN 9781450376136, Document Cited by: What about Artifact Evaluation Committees?.
  • B. Hermann, S. Winter, and J. Siegmund (2020) Community expectations for research artifacts and evaluation processes. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2020, New York, NY, USA, pp. 469–480. External Links: ISBN 9781450370431, Link, Document Cited by: Figure 4.
  • S. Lessmann, B. Baesens, C. Mues, and S. Pietsch (2008) Benchmarking classification models for software defect prediction: a proposed framework and novel findings. IEEE Transactions on Software Engineering 34 (4), pp. 485–496. External Links: Document Cited by: 1st item.
  • G. Mathew, A. Agrawal, and T. Menzies (2018) Finding trends in software research. IEEE Transactions on Software Engineering (), pp. 1–1. External Links: Document Cited by: Table 1, Introduction, Introduction, Studying Reuse, Related Work.
  • D. G. Mayo (2018) Statistical inference as severe testing: how to get beyond the statistics wars. Cambridge University Press. Cited by: Related Work.
  • A. Møller and M. Sridharan (2021) Front Matter, Table of Contents, Preface, Conference Organization. In 35th European Conference on Object-Oriented Programming (ECOOP 2021), A. Møller and M. Sridharan (Eds.), Leibniz International Proceedings in Informatics (LIPIcs), Vol. 194, Dagstuhl, Germany, pp. 0:i–0:xxiv. Note: Keywords: Front Matter, Table of Contents, Preface, Conference Organization External Links: ISBN 978-3-95977-190-0, ISSN 1868-8969, Link, Document Cited by: What about Artifact Evaluation Committees?.
  • K. Popper (2014) Conjectures and refutations: the growth of scientific knowledge. Routledge. Cited by: Introduction.
  • O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge.

    International Journal of Computer Vision (IJCV)

    115 (3), pp. 211–252.
    External Links: Document Cited by: Related Work.
  • A. Santos, S. Vegas, M. Oivo, and N. Juristo (2021) Comparing the results of replications in software engineering. Empirical Software Engineering 26 (2). External Links: Document, Link Cited by: Related Work.
  • U. Schimmack (2020) A meta-psychological perspective on the decade of replication failures in social psychology.. Canadian Psychology/Psychologie canadienne 61 (4), pp. 364–376. External Links: Document Cited by: Related Work.
  • Z. Q. Zhou, T. H. Tse, and M. Witheridge (2021) Metamorphic robustness testing: exposing hidden defects in citation statistics and journal impact factors. IEEE Transactions on Software Engineering 47 (6), pp. 1164–1183. External Links: Document Cited by: Table 1.