Software and their Dependencies in Research Citation Graphs

06/14/2019 ∙ by Stephan Druskat, et al. ∙ 0

Software is essential for a lot of research, but it is not featured in citation graphs which have the potential to assign credit for software contributions. This is due to a traditionalistic focus on textual research products. In this paper, I propose an updated model for citation graphs that include software. This model takes into account intrinsic properties of software, and requirements for a robust system of software citation. I also give an outlook on future work to implement transitive credit, which is at the core of a fair system of academic citation which accounts for software on par with other research products.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Research does not happen in a vacuum, it builds on other research. In research products, the research that has been built upon for the published research is cited and referenced. This facilitates understanding of research in the context of its predecessors and precedents; it enables traceability of outcomes, both over the past to understand how present knowledge was established, and into the future to understand how present knowledge is being used. Citation has also been used as a metric for calculating the impact of single research communications, and summary impact of research publications such as journals. Furthermore, cited references can be used to trace the evolution of research in a specific area, or the impact specific ideas and advances have had within a research field. Citation can also provide credit for the researchers who have contributed to a piece of research. And finally, citation metrics can establish summary authority of research groups and institutions within a field.

Traditionally, citation has covered textual research communications such as papers and monographs. This is changing, as modern research increasingly relies on digital research outcomes and products, such as software and data, which are increasingly acknowledged as valid research products in their own right. To enable the different uses of citation for them, they, too, must be recorded as references, and duly cited.

Together, different research products, and the reference relations between them, form an abstract “research citation graph” (RCG). To open research citation graphs up for effective analyses, e.g., bibliographic and epistemological studies, back- and forward-tracking of research knowledge, analyses of contributing parties, and through them the provision of academic acknowledgment and credit, they need to be made accessible to - and retrievable through - computational methods.

This is particularly important for the citation of software, on which modern research is increasingly based, but for which a robust citation system is not yet fully established. Unless the importance of software to research can be quantified through automated analyses of the research citation graph, which in turn informs the current system of academic credit, the activities necessary to create and maintain research software cannot be stimulated for lack of incentive. A system of software citation that implements the provision of credit for direct and indirect contributions to research software, on the other hand, can help drive a culture change towards better recognition of the essential role of software for research, and of the people that create and maintain research software. Better recognition, in turn, will likely lead to increased activity around research software, which will lead to better software-enabled research [1].

In this paper, I describe research citation graphs and how software and their dependencies feature in them. In the process, I lay out specific properties of software in contrast to other research products, and how they affect potential instantiations of a research citation graph. Finally, I describe future work to model software-specific sub-graphs of the research citation graph to support the provision of credit for direct and indirect contributions to research software.

2 Research Citation Graphs

Research citation graphs record references between research products (i.e., papers, books, software, data, etc.), and can be used to analyze, e.g., the context and predecessors of research, or the influence a researcher or institution has had on research via a research product. Influence metrics in turn are used to provide academic credit.

In an initial iteration based on a traditional understanding of research products as papers or monographs, a research citation graph (RCG) can be modeled as a directed graph , where is a set of vertices (or “nodes”), and

is a set of ordered pairs of nodes, called directed edges. The nodes in

represent three different types of objects: research products, researchers, institutions that researchers are affiliated with. The edges in at this stage represent three different types of relations between nodes: reference (cite) to another research product, authorship (auth) of a research product, and affiliation (affil) of an author with an institution.

An example of a simple RCG model is given in Figure 1. It models the outgoing relations of a single research product, which references two other products. Even at this level of simplicity, the RCG enables different analyses:

cite

cite

affil

affil

affil

affil

affil

affil

auth

auth

auth

auth

auth
Fig. 1: Simplified RCG for a paper which references two other papers. Nodes: =Research product, =Author, =Affiliated institution; Edges: cite=“citation” relation, auth=“authored by” relation, affil=“affiliated with” relation.
  1. References to other research products can be traced by traversing the RCG, starting from and only following outgoing edges of type cite.

  2. Citation can be tracked by traversing the RCG from either or , and following incoming edges of type cite.

  3. Self-citation can be analyzed by finding nodes of type with more than one incoming edge of type auth in the RCG, and looking at whether the sources of those edges are connected directly with an edge of type cite.

  4. Influence of an institution on the research represented in the RCG can be analyzed by finding nodes of type , and counting incoming edges of type affil.

More generally, an RCG helps understand what other research products a specific product relied on (“back-tracking”), or has led to (“forward-tracking”), in order to build on this understanding in research, or conduct evaluations or measurements.

For most analyses, a specific node in the RCG will act as the entry point for the analysis. Both back-tracking and forward-tracking from this node are simple, provided suitable metadata: Both consist of traversal of the RCG through the incoming and/or outgoing edges of a node.

2.1 Challenges for the Retrieval of Research Citation Graphs

Automation of analyses as those mentioned above requires automated retrieval of RCGs. Automated retrieval relies on the existence of machine-readable and machine-actionable metadata for each of the objects represented in an RCG. For retrieval of an RCG, a software must be able to (a) retrieve the metadata for each element in the graph, (b) identify unique elements, and (c) identify and follow relations between nodes.

For textual research products such as papers and monographs, the citation-relevant metadata is identified and established, and often exists in machine-readable form. Additionally, the widespread allocation of unique identifiers for digital products, such as DOIs, make it theoretically possible to identify and follow reference relations to other products. Unfortunately, a lot of citation data for research products such as papers is not openly accessible, or not accessible via a unified interface. Related research therefore exists on automated extraction of citation data [2], and on the provision of an open repository of citation data [3].

For software, the citation and metadata situation is different. Although the principles of software citation have been established [4], there is not yet a commonly acknowledged practice of software citation which is based on available metadata.

This starts with the lack of a common publication practice for software. The publication process for textual research products is well-established, and involves discrete steps: text creation, submission, and peer-review, and on acceptance of a text typesetting, compilation and completion of citation-relevant metadata, and the assignment of an identifier.

No such process is yet in place for software, where peer-review of code is considered a best practice but not enforced, and sufficient metadata is not standardly provided or elicited. In general, the software creation process is usually iterative. Software may be versioned and “released”, but not published in the academic sense, as a standardized way of providing the public with a uniquely identifiable version of the software intended for use and citation. An authoritative set of citation metadata is not reliably compiled at any stage. Different approaches to alleviate this situation include

  • the creation of software journals, which apply the established workflow for papers to software , e.g., the Journal of Open Source Software [5], or the Journal of Open Research Software [6];

  • technical solutions allowing creators of software to automatically publish releases of their software to a data repository, which provides a DOI and landing page for the software publication, e.g., the Zenodo repository automating archival of software releases on GitHub (guides.github.com/activities/citable-code/);

  • reliance on software archives harvesting open source code repositories, and providing unique identifiers for artifacts, e.g., Software Heritage [7].

Of these options, only software journals seem to provide a reliable way of generating curated, correct metadata suitable for the citation of software papers published in them. On the other hand, these metadata only include the primary reference to the described software itself, and do not provide reference links [8], so reference relations between research product nodes cannot be retrieved (cf. section 3.1). All of the approaches include assignment of unique identifiers. Both source code repository links to archival repositories, and archives harvesting source code repositories, potentially provide resolvable reference links, although not natively: The reference links will have to be provided in citation metadata in the origin source code repository and made accessible.

Even if software is published in a way that makes citation possible, it is often still not cited at all [9], or not cited according to the software citation principles, i.e., like any other research product [4, 10]. Also, software hardly ever publicizes its own references as citation data.

Missing unique and machine-actionable identifiers is a problem for all types of research product. This obviously applies to historical research products. It still applies to the many researchers and institutions, which have not been allocated an ISO 27729 International Standard name Identifier, such as an ORCID iD (support.orcid.org/hc/en-us/articles/360006897674

). And it applies to the majority of software, which is not allocated an identifier during publication. This is either because formal publication was not deemed a necessary part of the release workflow, as is probably the case for most software projects not aimed at research, or because a common good practice for research software publication is not yet established.

In short, the lack of complete and machine-actionable metadata for research products currently impedes the retrieval of complete research citation graphs. To alleviate this, a robust system for citation must be implemented. Additionally, a change in practice towards rigorous documentation of citation-relevant information in machine-actionable metadata is needed.

Progress regarding the implementation of robust software citation, and initiating the culture change necessary to establish an accepted software citation practice, is made in research community initiatives such as the FORCE11 Software Citation Implementation Working Group (www.force11.org/group/software-citation-implementation-working-group). A format for software citation metadata files [11], and a software metadata exchange format [12] have already been, and continue to be, developed. If applied widely, these metadata formats, together with a better publication practice for software, can enable retrieval of more complete RCGs. These RCGs will be suitable to record software and other digital products, and thus reflect the current state of research practice more accurately.

In order to leverage the above-mentioned optimized RCGs for academic credit, the citation metadata that the RCG builds on also needs to record references from software to other software, i.e., software dependencies. A “dependency” of a software is a software component to which exhibits a degree of coupling. Phrased differently, relies on functions of another software , without which will not function as intended by the authors of . In practice, this usually means that calls functions from that component, or uses the component’s API in another way, e.g., by inheriting from its classes or interfaces. Dependencies can take different forms, as libraries, code fragments, or algorithms. The defining quality of a dependency is that it is not part of the original, directly contributed, source code of a software. Therefore, functions defined in a file that are called from functions defined in another file , are part of a dependency iff file is not part of the same codebase as file . This may include that file has other authors than file . Hence it can be said that original source code and dependency source code form the common codebase of a software research product. At runtime, dependencies are treated as part of the same ”software object”, as execution paths transcend boundaries between a software and its dependencies. This poses the following question: When source codes by different authors form the common codebase for a software product, who are the authors of a publication of that product? Dependencies therefore also influence the concept of software as a research product, and challenge the current system of providing academic credit, along with other software-intrinsic properties.

In the following sections, I will therefore discuss the concept of software as a research product (section 3.1), and outline how to model RCGs (a) for the inclusion of software based on software-intrinsic properties (section 3.3), and (b) for support of a fair system to provide academic credit for software (section 3.2.2).

One of the challenges that is not solvable by provision of better metadata alone, is the complexity of retrieval of RCGs for forward-tracking. Given optimal metadata, retrieving a back-tracking RCG is straightforward and requires the retrieval of metadata for objects that go into the set of RCG nodes (research products, researchers, institutions) exactly once. Also, identifying and following links to other relevant objects, where these links go into the set of RCG edges , must also be done only once.

Retrieving an RCG for forward-tracking, on the other hand, can be much more complex, and must be supported by forward-citing data such as the data collected by Crossref through Cited-by Linking (www.crossref.org/services/cited-by/), as forward-citing data is not part of the default citation metadata for a research product.

3 Software and their Dependencies in Research Citation Graphs

The first iteration of a model for research citation graphs presented in section 2 is not optimal in at least two ways: (1) For all research products, it does not map all relations that can exist between its nodes. It also does not make explicit the metadata objects needed for software citation. Furthermore, being based on a traditional notion of authorship, it fails to acknowledge other contributions to research products, at least in its terminology. (2) It is not suitable for the inclusion of software, or the provision of academic credit for software. The reasons for this lie in the specific properties of software as compared to other research products, and in the state of acceptance of software as a valid research product.

3.1 Software as a Research Product in Software Citation

Research software, i.e., software that embeds research knowledge, implements algorithms, models, and research methods, and enables research, presents a significant, and increasingly vital, intellectual contribution to research. It should therefore be treated as a research product on par with papers and monographs. In terms of citation, this is reflected in the first software citation principle, “Importance”:

Software should be considered a legitimate and citable product of research. Software citations should be accorded the same importance in the scholarly record as citations of other research products, such as publications and data; they should be included in the metadata of the citing work, for example in the reference list of a journal article, and should not be omitted or separated. Software should be cited on the same basis as any other research product such as a paper or a book, that is, authors should cite the appropriate set of software products just as they cite the appropriate set of papers. [4, p. 1]

Adherence to this principle allows authors of research software to receive academic credit for their contributions to a research product. This, in turn, may help further the careers of software contributors, and incentivize the development of new, or improvement of existing, research software, which enables more and better research, cf. [1, p. 3].

Contributions to a research product can be active or passive, and direct or indirect. In direct active contributions to textual research products, contributors influence the contents and form of the product directly through the contribution of text, analyses, ideas, etc. In the same type of contribution to software, contributions can take the form of source code, code comments, documentation, architectural design, API design, UI design, tests, code reviews, bug reports, etc. With direct passive contributions, a textual research product uses another product or parts thereof, by building on it, refuting it, refining its analyses, contextualizing its findings, etc. Direct passive contributions to a research software product are its dependencies. Indirect contributions to a research product are direct or indirect contributions to passive contributions to that product. Indirect software contributions to software are the latter’s transitive dependencies.

In academia, direct active contributions are recognized through authorship, and direct passive contributions through citation. Indirect contributions are not formally recognized in current practice, but they can be discovered in RCGs.

3.1.1 Dependencies as Contributions to Research Products

Recent efforts in software citation, such as the work of the FORCE11 Software Citation Implementation Working Group, address the implementation of citation practices around software as direct passive contributions, mostly to text publications, and the realization of academic credit for software authorship. In this paper, I aim to contribute a focus on dependencies, in addition to that on software products, as direct passive and indirect contributions to research software, and how they may be recognized as indirect contributions to other research products, and be included in research citation graphs.

3.2 Principles of and Requirements for Software Citation

Treating research software the same as other research products implies that research software must provide reliable metadata to enable correct citation in other products. It must also cite, i.e., provide correct references to, the indirect contributions, similar to a list of references in a paper. What makes metadata reliable, and references correct, is defined in the software citation principles. For the purposes of citation discussed here – i.e., use software for a paper, use software in/with new software, get credit for software development, cf. [4, p. 6] – the relevant principles are specificity, unique identification, and attribution and credit. A more detailed resolution of these principles explains the differences between software as a research product, and other research products.

3.2.1 Specificity and Unique Identification

To understand how specificity and unique identification should be applied to software for the purposes of citation, we must define the meaning of “software” as a product. Academic papers as products are perceived as single, static objects. They are available in a “final version”, i.e., the version that has been accepted for publication, and has been typeset, and published together with its metadata. These single public artifacts exist in a single state, which is used in research, i.e., read and cited; This is notwithstanding the fact that this single state can be interpreted in different ways, which is of no consequence for citation purposes. The single state of a paper can be identified through a single unique identifier such as a DOI. In contrast, software is dynamic with respect to its use, and to its public artifacts.

Defined as a “set of instructions that direct a computer to do a specific task” [13, p. 2], software is “functionally active”[14, p. 2], i.e., it performs functions on data. Regarding use, a software may have different states, and execute along different paths at runtime. The actual (final) states and execution paths depend on the configuration, user interaction, and perhaps also the data that is being processed. States and execution paths therefore arguably define the “product” that is used to perform a specific software task. It is the task of future research to solve the complex problem of unique identification of a software “product” that takes into account states and execution paths. Research in research reproducibility, e.g., [15], suggests that one approach to solve this is to record configurations and processed data sets along with the software that has been used to conduct the research for a publication, and provide them in virtual machines or containers. This approach will still need to solve the existing issues in software citation and credit, as creating another layer of “product” (here: container or virtual machines) may in fact obfuscate the objects that make up a product.

The software citation principles acknowledge that “information such as configurations and platform issues are also needed” in addition to primary product metadata to achieve reproducibility [4, p. 3]. This places the specification of the exact runtime software used for a research task outside of the scope of software citation. While this counteracts a more general principle of specificity, it is also argued in [4, p. 1] that “software identification should be as specific as necessary […]” rather than as specific as possible. As this paper focuses on aspects of reference, attribution and credit in software citation rather than reproducibility, as does [4], arguably, levels of specificity beyond versions or commits will not be further considered here.

In terms of public artifacts, the definition of “software” as a citable product is equally complex. Software development processes produce artifacts at different stages. When software is developed in the open, using a public repository for version control, for example, each commit (or “revision”) to the version control system produces a public artifact. Commits may also be tagged as “versions” or “releases” of a software, but both version or release identifiers and commits may not be permanent. Build processes may also produce binary artifacts that are published. Build products may contain different “phenotypes” of the software for different target operating systems. Releases may consist of collections of source code files, of single or multiple different binary artifacts, or a mixture of both (as in GitHub releases, cf. help.github.com/en/articles/about-releases). Finally, “software” may also refer to the “concept” of a software, rather than a specific software artifact. “Microsoft Excel”, for example, refers to the concept of a spreadsheet application rather than to a single version, commit, or artifact. Some archival platforms such as Zenodo provide unique identifiers for concepts, which work as a parent for the realizations of the concept in versions of a software.

In terms of artifacts, the software citation principles suggest that when possible, a specific version of the source code should be cited via a DOI, but they also acknowledge that there is a need for a way to cite the concept of a software [4, p. 13]. In fact, this also applies to papers, which may be published both as a preprint and as a “full” publication. Some preprint platforms, such as arXiv (arxiv.org/) also support versions of preprints.

There is a difference in relationships between the versions and publications of textual research products, and those of software products. Versions of preprints and “full” text publications may or may not differ in content: A preprint may have the same content as the accepted publication, including metadata apart from the unique identifier, and differ only in typeset, layout, and the existence of copyright or other intellectual property information. There may also be a difference in content between the only or latest version of a preprint and a “full” publication. Additionally, a “full” publication may be assumed to be more trustworthy or of higher quality than a preprint, as it has undergone peer-review. In contrast, a publication of a software will in most cases represent a specific version of the source code of this software. Also, current software publication practice does not assume that the software has undergone a peer-review process.

There is also a difference between text and software research products, in that while the concept of a text product can be said to exist across preprints and publications, it will likely not be cited, even in bibliographic or bibliometric research. Software concepts, on the other hand, may be cited, e.g., to provide a framework to understand the development of a software.

3.2.2 Attribution and Credit

Citation should attribute contributions to research products to all respective contributors. It should also provide academic credit to all contributors to a research product. There is increasing acknowledgment of the fact that direct contributions to research products can take different forms, other than text production, and that citation should represent different contributions. This is the case for all types of research product, and specifically for software, where creditable contributions include not just the writing of source code, but also contributions to the architecture, design, documentation, engineering, management, verification, validation, repair, maintenance, etc., of a software. A robust citation system for software should reflect these contributions, and they should be recorded in the citation metadata. The same is true for other types of research product, where the diversity of roles participating in the production should be better reflected. One initiative that works on providing a taxonomy of contribution roles for research products is CASRAI CRediT [16]. It is a task for future research to investigate whether the CRediT taxonomy covers all contribution roles for research software, and to improve the taxonomy if necessary.

There is one further significant difference between software and other research products, with respect to their respective relationship to the direct passive contributions, and indirect contributions, that they build upon. Direct passive contributions to a paper, for example, are recognized through citation and inclusion in a list of cited references in the paper. They do not become part of the paper itself, and the relation between papers and their references remains one of reference. Indirect contributions to a paper are direct and indirect contributions to the cited references of that paper, and therefore have an indirect reference relation to the paper. They are not formally recognized in the paper at all.

The direct passive contributions to a software include its dependencies, and possibly other research products, e.g., publications that describe algorithms, models or methods implemented in the software. According to this contribution type, all of these contributions should be cited in the software. Reference lists and in-text citations are not usually a constituent of research software under the same formal criteria that apply to text publications. Instead, references to these contributions can be recorded in reference maps in machine-actionable citation metadata files. The equivalent of in-text citations to text publications in software is an inclusion of the reference in close proximity to the places in the source code where it is used, e.g., in a comment of a function; in-text citations to other software, i.e., dependencies, are essentially realized as calls of the API of these dependencies.

However, while direct passive text contributions to a software have the same reference relation to the product using them, as they would to a paper, the same is not true for dependencies. Instead, dependencies become part of the research product at build time or runtime at the latest, and have a part-of relation to the software product. In fact, the same is true for transitive dependencies. The higher degree of coupling in part-of relations, as compared to reference relations, support the argument that transitive dependencies should also be recorded in the credit map of a software. The question for both direct and transitive dependencies is: Should they be recognized through authorship or citation? This seems to depend on whether citation looks at software as a static object, i.e., the software’s source code, where the contribution of dependencies and transitive dependencies is clearly passive, or a dynamic object, i.e., the software at runtime, where the contribution is arguably direct through a part-of relationship. The software citation principles suggest that software citation should generally address the source code of software, which in the case of dependencies is a passive contribution. Under this assumption, dependencies should be cited.

The question can also be addressed from another angle: the definition of authorship. The International Committee of Medical Journal Authors suggests that authorship should be based on four criteria [17]:

  • Substantial contributions to the conception or design of the work; or the acquisition, analysis, or interpretation of data for the work; AND

  • Drafting the work or revising it critically for important intellectual content; AND

  • Final approval of the version to be published; AND

  • Agreement to be accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved.

Dependencies arguably make a substantial contribution to a software, as they are essential for the software to work as intended. While authors of dependencies may not draft or critically revise the source code of the depending software, they do so for their own source code, which – if we assume a direct contribution at runtime at least – is a part of the work. A software will never be finally approved directly by the authors of a dependency. Even if the license of the dependency permits use of it in another software, the “version to be published” is the software that uses a dependency, whether the latter is regarded as direct or indirect contribution. Also, authors of dependencies may be accountable for aspects of their own work, the dependency, although especially open source licenses tend to proactively exclude accountability for use of a software by third parties. In any case, they will never be accountable for all aspects of the software using their dependency, whether in source code form or at runtime. Authorship of research software products for authors of its dependencies is therefore categorically not an option. Instead, dependencies should be cited in a software product, which enables the provision of credit for their authors.

Providing credit for software and their dependencies

As mentioned above, software that contributes to a research product should be cited, and a reference to it provided, in the research product. In textual research products, the latter can be achieved by listing software in a list of references. The reference must include all necessary metadata to identify and access the software used to produce the research product. This necessary metadata includes [4, p. 6]:

  • The unique identifier for the software,

  • the name of the software,

  • the authors of the software,

  • the version number of the software,

  • the release date of the version of the software,

  • the location where the software can be accessed.

In software research products, the same metadata must be provided for the software’s dependencies. This metadata should be provided in a machine-actionable metadata file, such as a CITATION.cff file in the Citation File Format [11]. This file should also hold the primary citation metadata for the software product itself.

When software is cited in a textual research product, the provision of credit for that software works the same way as for other research products, given the acknowledgment that software is a valid research product. Credit can for example be assigned based on the number of citations a software has accumulated. However, this acknowledgment is not yet universally present in the academic system. Therefore, software is not yet reliably cited, and the contribution of software to research remains largely hidden. In other words, software is rarely recorded in research citation graphs, and therefore does not generate credit for its creators and contributors. A consequence of this is a lack of incentive to produce software for and in research. Given the increasing acknowledgment that research today can often not be conducted without software, this lack of incentive has the potential to drastically impede research endeavors.

A more robust system, which records the contributions of digital products such as software to research, and makes it possible to track the usage of these products, in turn making it possible to attribute contributions of – and assign credit for – software, is transitive credit [1]. The idea of transitive credit is to assign fractional credit to all contributions to a research product, thus creating a credit map for that research product. The credit map for a research product then feeds into the credit map for all research products that use . This way, the contributors to can also be given credit for the products that use .

Applied to software, this makes it possible to record software contributions to research that were hitherto hidden, including dependencies which may not have been originally intended for use in research. Transitive credit also makes it possible to quantify the contribution of software and their creators and contributors to research in general, and bring the influence and impact of software on research to light.

In practice, fractional credit for direct contributions to a research product can be recorded in the machine-actionable citation metadata for that research product, cf. [18]. To do this, textual reference formats must be updated to record fractional credit, as must citation metadata formats for software and other digital products.

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

[meta]

cite

impl

impl

realize

realize

precede

cite

cite

cite

cite

precede

impl

eng

test
Fig. 2: Updated RCG, kept incomplete for reasons of legibility.

Initially, it may be helpful for software citation metadata to also record indirect contributions, i.e., transitive dependencies, and their respective fractional credit. This can support the initiation of a culture change towards better recognition of software in research, and would help to increase the visibility of software and their creators which is currently hidden and may not initially be created for use in research. Recording complete transitive credit graphs for software dependencies in a software research product also enables forward-tracking in research citation graphs, which is not yet easily facilitated for software research products. Future work towards this is described below in Section 4.

At a later stage, once software provide machine-actionable references to their direct dependencies, transitive dependencies may not have to be recorded in the metadata for a software product, but can instead be retrieved from an RCG including that software product.

The following section describes, how the discussed properties of software as a research product, and the resulting requirements for software citation, can be modeled in research citation graphs.

3.3 Optimizing the Research Citation Graph Model for Software

The properties and principles discussed in Section 3.1 inform the requirements for research citation graphs that can model software citation as follows.

  1. A suitable RCG model must include model elements to represent the necessary metadata for citation of research products, including software.

  2. RCGs must be able to include both versions of a research product, and product concepts. Versions here can be software or preprint versions identified by a version identifier, or source code commits identified by a commit hash, revision number, or similar. RCGs must also define relations between versions and concepts.

  3. For relations between versions such as preprints and formal publications where an overarching concept is not identifiable via a unique identifier such as a concept DOI, the model can provide edges that express the relation between versions. Alternatively, this can be left to uses of the RCG where these relations are extracted from the versions’ metadata, e.g., via release dates.

  4. RCGs must be able to record fractional credit values for all contributions to a research product.

  5. An RCG model should implicitly or explicitly include other contributions to research products than the traditional concept of authorship to make sure that all contributions can be credited. Also, multiple contributions by the same contributor to the same research product should be possible.

Concerning 1: Metadata objects for the different objects in the RCG can be modeled as labels on the respective RCG elements (represented by [meta] labels in Figure 2). The metadata objects for research products contain the credit map for the product, i.e., both the primary citation data for the product itself, and the reference and credit map for the cited products which constitute contributions to the product. The metadata objects for researchers and institutions contain the metadata needed to uniquely identify the respective person or institution, including a machine-actionable unique identifier for the person or institution.

Concerning 2: Versions and concepts are both modeled as research product nodes. The edges between versions and concepts are modeled as realization relations, where versions are realizations of a concept (realize). Different analyses can choose to include either one or more of the versions of a product (e.g., a single software version for a reproduction of a research task, preprints and formal publication of a paper for a bibliographic study), or a concept (e.g., to calculate summary attribution and credit over all versions).

Concerning 3: Version nodes can be connected by order relations, representing one version as the predecessor of another version (precede).

Concerning 4: Fractional credit of a research product for another research product can be modeled as a label on edges representing citation relations or contribution relations.

Concerning 5: Edges representing authorship relations (auth) should be diversified to represent the different possible contribution types to research products. The labels for these edges should come from a machine-actionable taxonomy which includes all relevant contribution types to research products. It should be possible to have two contribution edges between a person and a research product to represent multiple contributions of different types by the same person.

An example for the updated model is presented in Figure 2. and are papers, is a preprint of . , , , , , are versions of a software, is a software concept. and are both software versions that realize , and precedes . is the engineer and implementer of . For the engineering contribution, is credited with fractional credit for , for the implementation contribution with fractional credit, adding up to a total fractional credit of . is the tester of , and is credited for this work with fractional credit for . Together, and hold fractional credit for . cites three research products, of which the direct dependency takes fractional credit for , and the direct dependency takes fractional credit. The paper , which describes a novel algorithm implemented in a smaller function of takes fractional credit for . and are both credited with fractional credit for their implementation contributions to , and the direct dependency of , takes fractional credit. Traversing the graph, transitive credit for the direct and indirect contributions to a research product can now be calculated.

as a transitive dependency of , for example, takes credit for , which in turn takes credit for , so the indirect contribution of to is , or . The fractional credit for the direct and transitive dependencies of , as well as for itself, for can be calculated in the same way. takes fractional credit. takes fractional credit. takes fractional credit. Similarly for the contributors to, e.g., : takes fractional credit for , takes fractional credit, etc.

One of the main challenges in the implementation of a transitive credit system is the determination of appropriate fractional credit values for single contributions. Contributions may be potentially or effectively non-quantifiable. Additionally, personal or political factors can skew values. Dependencies are somewhat immune to the latter effect, as their contributions to a software research product are quantifiable to a certain degree.

4 Future work

In future work, I will aim to quantify the creditable impact of dependencies on a software through static and dynamic analyses of the software and their dependencies. To this end, dependency trees can be resolved through mining software repositories, which is also a valid method to retrieve part of the metadata needed for identifying dependencies for citation purposes. Through static analyses such as the extraction of call graphs, the frequency of calls to dependencies and transitive dependencies in a software can be determined. The frequency metric is subsequently refined with complexity metrics for software such as cyclomatic, branching, data flow, and decisional complexity. Further work in this context includes the extension of the Citation File Format [11] with a schema to represent fractional credit metrics, and implementing refined contribution roles.

5 Conclusion

In this paper, I have introduced a directed graph model to instantiate research citation graphs. As a model based on traditional abstractions of citation graphs is not suitable to represent the impact that software has on modern research – due to its focus on textual research products – I have proposed an updated model that takes into account properties of software, and requirements for a more robust system of software citation. The model includes the representation of transitive credit, which can help incentivize contributions to research through software, as it brings to light hitherto hidden contributions, and enables the provision of credit for the respective contributors. Finally, I have given an outlook on future work to quantify transitive credit metrics for software dependencies through software engineering methods.

Acknowledgments

I would like to thank the discussion group on citation and rewarding systems at the Workshop on Sustainable Software Sustainability 2019 on 25 April 2019 in The Hague, Netherlands (www.software.ac.uk/wosss19). Discussion within the group has helped me to better understand the context for embedding software in the citation graph of research. The members of this group were: Neil Chue Hong, Gerard Coen, James Davenport, Leyla Garcia, Robert Haines, Catherine Jones, Adriaan Klinkenberg, Rachael Kotarski, Mateusz Kuzak, Brett Olivier, Esther Plomp, Shoaib Sufi, Stephanie van de Sandt, and Bettine van Willigen.

References

  • [1] D. Katz, “Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products,” Journal of Open Research Software, vol. 2, no. 1, p. e20, Jul. 2014. [Online]. Available: http://openresearchsoftware.metajnl.com/articles/10.5334/jors.be/
  • [2] Z. Nasar, S. W. Jaffry, and M. K. Malik, “Information extraction from scientific articles: A survey,” Scientometrics, vol. 117, no. 3, pp. 1931–1990, Dec. 2018. [Online]. Available: https://doi.org/10.1007/s11192-018-2921-5
  • [3] D. Shotton, A. Dutton, S. Peroni, and T. Gray, “Setting our bibliographic references free: Towards open citation data,” Journal of Documentation, vol. 71, no. 2, pp. 253–277, Feb. 2015. [Online]. Available: https://www.emeraldinsight.com/doi/full/10.1108/JD-12-2013-0166
  • [4] A. M. Smith, D. S. Katz, K. E. Niemeyer, and FORCE11 Software Citation Working Group, “Software citation principles,” PeerJ Computer Science, vol. 2, no. e86, 2016. [Online]. Available: https://doi.org/10.7717/peerj-cs.86
  • [5] A. M. Smith, K. E. Niemeyer, D. S. Katz, L. A. Barba, G. Githinji, M. Gymrek, K. D. Huff, C. R. Madan, A. C. Mayes, K. M. Moerman, P. Prins, K. Ram, A. Rokem, T. K. Teal, R. V. Guimera, and J. T. Vanderplas, “Journal of Open Source Software (JOSS): Design and first-year review,” PeerJ Computer Science, vol. 4, p. e147, Feb. 2018. [Online]. Available: https://peerj.com/articles/cs-147
  • [6] “Journal of Open Research Software.” [Online]. Available: http://openresearchsoftware.metajnl.com/
  • [7] J.-F. Abramatic, R. Di Cosmo, and S. Zacchiroli, “Building the Universal Archive of Source Code,” Commun. ACM, vol. 61, no. 10, pp. 29–31, Sep. 2018. [Online]. Available: http://doi.acm.org/10.1145/3183558
  • [8] B. Plale, M. Jones, and D. Thain, “Software in Science: A Report of Outcomes of the 2014 National Science Foundation Software Infrastructure for Sustained Innovation (SI2) Meeting,” p. 8. [Online]. Available: http://ccl.cse.nd.edu/research/papers/software-nsf-2014.pdf
  • [9] H. Park and D. Wolfram, “Research software citation in the Data Citation Index: Current practices and implications for research software sharing and reuse,” Journal of Informetrics, vol. 13, no. 2, pp. 574–582, May 2019. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1751157718302372
  • [10] J. Howison and J. Bullard, “Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature,” Journal of the Association for Information Science and Technology, vol. 67, no. 9, pp. 2137–2155, 2016. [Online]. Available: http://dx.doi.org/10.1002/asi.23538
  • [11] S. Druskat, N. Chue Hong, R. Haines, and J. Baker, “Citation File Format (CFF) - Specifications,” Aug. 2018. [Online]. Available: https://doi.org/10.5281/zenodo.1003149
  • [12] M. B. Jones, C. Boettiger, A. C. Mayes, A. Smith, P. Slaughter, K. Niemeyer, Y. Gil, M. Fenner, K. Nowak, M. Hahnel, L. Coy, A. Allen, M. Crosas, A. Sands, N. C. Hong, P. Cruse, D. Katz, and C. Goble, CodeMeta: An Exchange Schema for Software Metadata. Version 2.0, 2017, published: KNB Data Repository. [Online]. Available: https://doi.org/10.5063/schema/codemeta-2.0
  • [13] W. H. K. Chun, “On Software, or the Persistence of Visual Knowledge,” Grey Room, vol. 18, pp. 26–51, Jan. 2005. [Online]. Available: https://www.mitpressjournals.org/doi/10.1162/1526381043320741
  • [14] D. S. Katz, K. E. Niemeyer, A. M. Smith, W. L. Anderson, C. Boettiger, K. Hinsen, R. Hooft, M. Hucka, A. Lee, F. Löffler, T. Pollard, and F. Rios, “Software vs. data in the context of citation,” PeerJ Inc., Tech. Rep. e2630v1, Dec. 2016. [Online]. Available: https://peerj.com/preprints/2630
  • [15] S. R. Piccolo and M. B. Frampton, “Tools and techniques for computational reproducibility,” GigaScience, vol. 5, Jul. 2016. [Online]. Available: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4940747/
  • [16] A. Brand, L. Allen, M. Altman, M. Hlava, and J. Scott, “Beyond authorship: Attribution, contribution, collaboration, and credit,” Learned Publishing, vol. 28, no. 2, pp. 151–155, 2015. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1087/20150211
  • [17] International Committee of Medical Journal Editors, “ICMJE Recommendations: Defining the Role of Authors and Contributors.” [Online]. Available: http://www.icmje.org/recommendations/browse/roles-and-responsibilities/defining-the-role-of-authors-and-contributors.html#two
  • [18] D. S. Katz and A. M. Smith, “Transitive Credit and JSON-LD,” Journal of Open Research Software, vol. 3, no. 1, p. e7, Nov. 2015. [Online]. Available: http://openresearchsoftware.metajnl.com/articles/10.5334/jors.by/