Aspirations and Practice of Model Documentation: Moving the Needle with Nudging and Traceability

Machine learning models have been widely developed, released, and adopted in numerous applications. Meanwhile, the documentation practice for machine learning models often falls short of established practices for traditional software components, which impedes model accountability, inadvertently abets inappropriate or misuse of models, and may trigger negative social impact. Recently, model cards, a template for documenting machine learning models, have attracted notable attention, but their impact on the practice of model documentation is unclear. In this work, we examine publicly available model cards and other similar documentation. Our analysis reveals a substantial gap between the suggestions made in the original model card work and the content in actual documentation. Motivated by this observation and literature on fields such as software documentation, interaction design, and traceability, we further propose a set of design guidelines that aim to support the documentation practice for machine learning models including (1) the collocation of documentation environment with the coding environment, (2) nudging the consideration of model card sections during model development, and (3) documentation derived from and traced to the source. We designed a prototype tool named DocML following those guidelines to support model development in computational notebooks. A lab study reveals the benefit of our tool to shift the behavior of data scientists towards documentation quality and accountability.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

02/15/2021

A Data Quality-Driven View of MLOps

Developing machine learning models can be seen as a process similar to t...
03/01/2021

Practices for Engineering Trustworthy Machine Learning Applications

Following the recent surge in adoption of machine learning (ML), the neg...
04/29/2022

Data+Shift: Supporting visual investigation of data distribution shifts by data scientists

Machine learning on data streams is increasingly more present in multipl...
05/24/2022

Assessing the Quality of Computational Notebooks for a Frictionless Transition from Exploration to Production

The massive trend of integrating data-driven AI capabilities into tradit...
09/30/2021

Out-of-Distribution Detection for Medical Applications: Guidelines for Practical Evaluation

Detection of Out-of-Distribution (OOD) samples in real time is a crucial...
08/03/2021

Tutorials on Testing Neural Networks

Deep learning achieves remarkable performance on pattern recognition, bu...
06/26/2019

Verifying Robustness of Gradient Boosted Models

Gradient boosted models are a fundamental machine learning technique. Ro...

1. Introduction

Documentation serves as the primary resource to understand, evaluate, and use software components when adopting them in developing applications (Aghajani et al., 2020)

. Machine-learned (ML) models increasingly are integrated as components into software systems and would benefit from similar documentation. Stakeholders, including data scientists, AI engineers, domain experts, and software engineers, rely on the documentation to answer questions such as what use cases are supported, what performance to expect, and what ethical and safety impacts to consider once the model is deployed in applications at scale. Nevertheless, ML models shared as pretrained models or services are often poorly documented. Still they are reused in many applications, sometimes in applications for which they were not designed. Serious issues related to misuse of ML models have been observed in various applications, notably in face recognition and tracking 

(Buolamwini and Gebru, 2018), recruitment (Dastin, 2018), and criminal risk assessment  (Dressel and Farid, 2018), leading to broader concerns about their impact on social justice.

As a reaction to observed problems in model reuse and accountability, significant efforts towards model (Mitchell et al., 2019; Arnold et al., 2019) and data (Gebru et al., 2018) reporting have been proposed. This work has attracted considerable attention – for example, the paper of model cards published at FAccT 2019 (Mitchell et al., 2019) is heavily cited and the popular model hosting site HuggingFace has adopted the term model card in their user interface and guides their users to provide documentation (Face, 2021b). Yet, it is largely unknown how these proposals impacted the practice of documenting ML models and datasets.

In this work, we take a close examination of the gap between the recommendations for models documentation and its practice. While past work has already shown often limited documentation during model development, such as few markdown cells in public notebooks and often missing README files in notebook repositories on GitHub (Rule et al., 2018; Pimentel et al., 2019)

, we focus on external documentation of reusable models and ML services. We start by investigating the quality of the documentation for ML models made available, in particular, how they meet on the model cards proposal. Our study reveals that despite adopting the model card terminology, most model development teams fail to provide meaningful and comprehensive documentation that can support scrutiny for model adoption. Certain aspects of documentation are especially limited, such as information regarding the data collection process, evaluation statistics explanation, and concrete ethical measurements, across different contexts of model development (i.e. open-source and proprietary).

Given the observed low adoption rate of documentation proposals and frequent low documentation quality even when concrete proposals are followed, we explore how we could improve adoption and encourage good documentation practices. To this end, it is important to look at the context and process of how such documentation is created and updated, the incentives and tools involved, and how to provide more effective documentation support. Building on the rich work of software documentation, digital nudging, and traceability, we propose a set of design guidelines for designing ML documentation tools. As a prototype, we design and implement an documentation tool for data scientists, named DocML, that supports creating and updating the documentation of models while they develop the models within a computational notebook. A user study with 16 participants demonstrates the behavior change with using DocML towards the quality and maintainability of model documentation as well as the consideration of ethical impacts.

Our work makes the following contributions to understanding and supporting ML model documentation practice:

  1. [topsep=0pt]

  2. Results from our empirical study delineates the current practice of public model documentation and highlights a clear gap between the information needed for model users and information provided by model developers;

  3. A rubric for evaluating documentation, developed and used in our study, which can be adopted by model developers and users as a documentation guideline or quality assessment tool;

  4. Concrete design guidelines to improve documentation practice and model accountability and can be adopted across the machine learning development pipeline, drawn from previous work on technology design, software documentation quality, and digital nudging;

  5. A JupyterLab extension, DocML, implementing these design guidelines to support data scientists to create more comprehensive and accurate documentation, evaluated in a user study.

The artifacts created in this study including the rubric, list of assessed model cards, user study design, and DocML source code, are shared as supplementary materials111https://anonymous.4open.science/r/MLDoc-0D03/ to support future investigation on improving ML documentation.

2. Background and Related Work

2.1. Software Documentation

The study of software documentation concentrates mostly on the aspects of documentation property and quality (Aghajani et al., 2019; Prana et al., 2019; Arya et al., 2020), documentation search and discovery (Stylos and Myers, 2006; Stolee et al., 2016), content augmentation (Treude and Robillard, 2016; Robillard et al., 2017), and documentation creation support (Moreno et al., 2013; Head et al., 2020; Hellman et al., 2021). Among them, our work is most relevant to the previous inquiry on documentation quality and interactive documentation creation support.

The understanding of software quality is mainly acquired from investigating how software stakeholders consider and discuss documentation related problems. Through a survey study with 323 software professionals in IBM, Uddin and Robillard identified ten common API documentation problems that manifested in practice. Among those problems, incomplete and ambiguity were considered as the most frequent problems that caused severe impact. A recent study by Aghajani et al. examined documentation problems through a data-driven approach (Aghajani et al., 2019). Aghajani et. al. developed a taxonomy of documentation issues by analyzing the documentation-related discussion from various resources, including mailing lists, Stack Overflow discussions, issue repositories, and pull requests. Completeness and up-to-dateness are frequently mentioned. Together with correctness, they constitute the category of issues concerning documentation content. At the same time, issues beyond documentation content are extremely common and have profound implications to the documentation writers, readers, and maintainers, such as how the content of the documentation is written and organized (e.g., documentation usability and maintenance), documentation process (e.g., traceability and contribution), and documentation tool (e.g., bugs, supports, and improper tool usage). This taxonomy illustrates the complexity of documentation concerns and calls for a consideration of documentation within the context of software development.

A large body of work on supporting documentation creation aims to automate content generation. Examples include generating progress-related documentation such as commit messages (Cortés-Coy et al., 2014) and summarizing method (McBurney and McMillan, 2014), files (Moreno et al., 2013), or even the whole project (Hellman et al., 2021)

. This work normally relies on heuristic or machine learning methods to extract or synthesize content from the input artifacts and inevitably introduce both errors and biases. Since our work emphasizes documentation quality, a more interactive approach using which the documentation writer has full control over the content being created is more relevant. The work by Head et al. is an example of focusing on the interaction aspect for tutorial writers 

(Head et al., 2020). We adopt a similar approach. Motivated by the empirical observations of model documentation quality, we seek to address human need during documenting models through interaction design.

2.2. Documentation for Machine Learning

The interest in ML documentation mostly concerns the data and model (Boyd, 2021), concentrating on the content of the documentation. On the data side, the works such as Dataset Nutrition Label (Holland et al., 2018), and Datasheet (Gebru et al., 2018) propose standards for documenting information related to the data, such as provenance, statistics, accountable parties, etc. On the model side, the work of model cards proposed by Mitchell et al. has accumulated the most interest from both academia and industry. The model cards proposal aims to standardize ML model reporting, in particular from the aspect of ethical considerations. Many companies such as Google, Nvidia, and Saleforce have adopted model cards proposal for their public models. HuggingFace, the open source ML model hosting platform, also encourage its users to adopt model cards when sharing their models. Our work provides a more critical view of the current model cards adoptions. We set off to understand the concrete impact of model cards proposal on documentation quality.

Previous work on the documentation process is relatively scarce. One such effort is called FactSheets, proposed by Richards et al., which asks the primary question of how to generate high-quality AI documentation (Richards et al., 2020). FactSheets is a methodology using which stakeholders can instrument their documentation generation at each stage of the AI development life cycle through asking concrete questions relevant to that stage. While the outcome of using FactSheets might resemble the model cards proposal, it provides more support for planning the documentation effort and collaboration between different roles within the organization towards AI documentation.

Several previous works present tools in the form of libraries or plug-ins to support documenting models. For example, Google has introduced a Model Card Toolkit, which allows TensorFlow users to generate a model card by manually filling out a predefined JSON schema 

(Google, 2020)

through calling the provided APIs. Like other model metadata gathering tools, the Model Card Toolkit requires users to manually prepare their code to extract information and does not provide any additional support for Jupyter Notebooks. Moreover, the code related to model card toolkits can potentially produce a bloating effect for the original model development code. Based on our investigation, this toolkit has not received general adoption for model documentation on GitHub. Wang et al. proposed a tool called Themisto that combines deep-learning and information retrieval approaches to generate documentation for code cells in the notebook 

(Wang et al., 2021). Themisto targets the same population with our tool, i.e., data scientists using the notebook, but with a different objective. Themisto aims to generate documentation that can better support the computational narrative of the notebooks, while the tool proposed in our work focuses on encouraging data scientists to follow the documentation standard and consider ethical aspects of their model development.

3. Model Documentation Practice

The number of ML models being published and reused is increasing at an astounding speed. Hugging Face, one of the popular platforms for sharing and hosting reusable models, is used by more than 5,000 organizations and currently hosts more than 19 thousand machine-learned models (Face, 2021a). The top-ranked model on Hugging Face named BERT base model (uncased) is downloaded more than 19M times per month.222https://huggingface.co/bert-base-uncased, accessed on 10-29-2021. Many organizations also offer proprietary models for a wide range of tasks through public APIs, from BigTech companies such as Google and AWS to many startups.

The technical steps for reusing models and incorporating them as components into own applications for various predictive tasks is easy, typically by downloading the trained model binary or calling a REST API. However understanding the scope and quality of a model is often not obvious. Incomplete documentation of ML models can cause serious trouble for the potential model adopters to properly set up the models within their own application. More importantly, without information about the model development process and their impact on performance and ethics in the application domain, the models might be misused or used without proper care, and therefore causing various harm to the end-users (Blodgett et al., 2020; Buolamwini and Gebru, 2018).

To understand the current practice in documenting reusable models and the gap between recommendations and practices, we conduct an empirical study on model documentation. We start with collecting a dataset of models and corresponding documentation that explicitly or implicitly indicates the adoption of the idea of model cards, a model documentation template proposed by Mitchell et al. (Mitchell et al., 2019). We then examine the model’s documentation quality using a rubric we created based on the original model card work (Mitchell et al., 2019). Our analysis is entirely manual and mixes qualitative and quantitative aspects to ensure the reliability of our evaluation. We developed and validated our rubric iteratively and release it publicly as potential foundation for documentation guidelines or quality assessment tools. Finally, we discuss the implications of the model documentation quality evaluation result.

3.1. Dataset Curation

We curate a dataset of model documentation from four sources (summarized in Table 1). We intentionally stratify our sample to cover mostly models that adopt the idea of model cards and to cover both commercial and research models. The first source for collecting model documentation is Hugging Face (Face, 2021a). As we mentioned previously, Hugging Face has a large user base. They formally adopt the model card by providing documentation (Face, 2021b) and training materials 333https://huggingface.co/course/chapter4/4?fw=pt and show a model’s README file under the label “Model Card” on the landing page for each hosted model. The content of README, however, is not enforced when publishing models. We collected model cards from Hugging Face to observe how effective model card promotion can be. From its website, we collected all 370 models with more than 5,000 monthly downloads. We then randomly sampled 20 of these model cards from the top 100 monthly downloaded, representing the most popular models on Hugging Face, and 30 from the rest of the 270 model cards, representing other popular models. This research resulted in a representative sample of model cards for popular models on Hugging Face.

Our second source is GitHub, a popular platform to share ML model, along with the code to train the model. Some authors have adopted the model cards proposal for documenting the model in their README files. Therefore, we included GitHub as one of the sources to analyze the practices of adopting model cards proposal for open source ML models. We used two search queries to identify candidate repositories. First, we used code searched to identify repositories that used the Model Card Toolkit (Google, 2020), an open source Python Model Card API with the search query “import model_card_toolkit”. Second, we searched with the query “model card” in the README files among all repositories. We then manually validated all results, by removing any repositories that did not contain actual model cards, were included in the Hugging Face data, or were duplicates of another GitHub repository. After validation, we identified only a single repository using the model card toolkit and 23 repositories recognizably using the model card concept in their README. This process yielded a relatively complete set of model cards for ML models shared on GitHub.

As the third source of model documentation, we searched for models offered as APIs from companies (from Big Tech to startups). To this end, we relied on Google search with keywords “model card” and “model card [company name],” using company name such as Nvidia, Microsoft, Google, Facebook, OpenAI, DeepMind, and Amazon. We manually inspected the top results, discarding false positives, resulting in 28 model cards for commercial models. While the resulting set is not exhaustive, its size is comparable to the size of documentation sets from other sources.

As a baseline

for our analysis, we further included a sample of ML models hosted on GitHub that do not claim to have followed the model card concept. To identify relevant repositories, we search GitHub for three common and popular machine learning tasks for which models are commonly shared and nontrivial reuse questions arise (including ethical questions): object detection, sentiment analysis, and face generation. Among the identified candidate repositories, we sampled 30 (10 for each task) that meet the following criteria: have a README, release their pre-trained machine learning models, but do not mention model cards.


Source
Subcategory # of Samples
Top 100 Most Downloaded 20
Hugging Face Top 101-370 Most Downloaded 30
Model Card Toolkit 1
GitHub Model Card “Model Card” in Project README 23
Companies 28
GitHub Non “Model Card” 30
Total 132
Table 1. Model documentation collection that is used in our empirical study on documentation quality.

3.2. Assessing Model Documentation Quality

To evaluate the quality of model documentation, we created a rubric to judge how different aspects of the model cards proposal were documented (Mitchell et al., 2019). We realized in early phases of the rubric design that it would be very difficult to judge reliably how well certain aspects of a model are documented, for example, if the scope of a model is described accurately and clearly. Such judgement is highly subjective; we found low inter-rater reliability and found it challenging to define and describe levels of a measure. Hence, we converged on an approach that measures more reliably whether certain topics are covered in the documentation with more concrete yes/no questions at the cost of capturing only the presence of information in the documentation but not its comprehensiveness or correctness.

Concretely, starting from the description of each component in the model card, we converted each aspect to be covered in the recommended model card structure into a set of concrete yes/no questions. For example, for the section “Primary Intended Uses” of model cards, our rubric includes a question of Does this model card (or equivalent model documentation) explain scenarios in which to use the model? For certain sections recommended by the model card paper, the original paper does not elaborate on what kind of content should be provided. Therefore, we add additional questions to resolve potential ambiguity in interpreting those components with more concrete questions. We developed those questions iteratively based on our own observations while inspecting documentation in our dataset. For example, we added two more concrete questions concerning section “Ethical Considerations”: Does the documentation discuss the used process for considering ethical issues with the model? and Do the documentation provide concrete measurements to support the discussed ethical considerations?

The rubric was created iteratively and underwent three rounds of inter-rater reliability assessment.An initial rubric was designed and used by six individual raters during the first round, each evaluating the same ten model documentations (five from companies and five from Hugging Face). We then updated our rubric after investigating and resolving inconsistencies among the raters. We followed the same process during the second round on a new set of 15 model cards with a inter-rater reliability of 0.59 using Cohen’s Kappa (Cohen, 1960). In a final round, we focused on three questions that yielded the most disagreement: questions about target distribution of the model and description of the training data (Q8, Q17, and Q19 in Table 2). After clarification and refining the rubric for those questions, we reached 0.73 inter-rater reliability using Cohen’s Kappa for those questions using additional 15 model cards.

As part of the rubric, we instructed raters to only look at the primary documentation (model card or README file), but not follow links to papers or external data, unless the main documentation makes it clear what specific information can be found there (e.g. “for more graphs demonstrating the model’s performance, see link”). This was an intentional choice to focus on the core documentation that a user might read rather than evaluating what information users could acquire by digging through academic papers or conducting their own experiments. This aligns with the style of the model card proposal that suggests to collect all important information in a compact description.

After establishing the reliability of the assessment rubric, one author manually rated the quality of all of the model cards in our dataset using the rubric. The complete rubric is included in the supplementary material for future reuse and refinement by model developers and other model stakeholders.

Rubric Item Mean Score
Model Description
1. Contact Information
2. Model Type
3. Model Date/Version
4. Model License
Intended Usages
5. Intended Uses
6. Out of Scope Uses
7. How to Use
Target Distribution
8. Target Distribution Description
9. Target Distribution Examples
Evaluation Metrics
10. Evaluation Statistics Reported
11. Evaluation Statistics Explained
12. Model Performance Visuals
Rubric Mean Score
Evaluation Process
13. Evaluation Process Explained
14. Evaluation Data Explained
15. Evaluation Data Available
Training Process
16. Training Process Explained
17. Data Properties Explained
18. Data Collection/Creation Explained
19. Training Data Available
Ethical Considerations
20. Ethical Considerations Discussed
21. Ethical Issue Mitigation Process
22. Concrete Ethical Measurements
  GitHub No Model Card   Hugging Face
  GitHub “Model Card”   Company   Overall Mean
Table 2. Evaluation result for model documentation using our model card rubric, broken down by data source and question, each bar shows what percentage of models documentations include the information relating to the question. The vertical bar indicates the average across all data sources. The concrete rubric including the wording of each question can be found in the appendix.

3.3. Threats to Validity

As discussed, our rubric does not assess the completeness or correctness of information, but only whether it includes information related to the model card structure. As discussed, we also largely excluded linked documents and papers from what we consider as the primary documentation. Our results should be interpreted with this information in mind. Whereas we could analyze a (near) complete set of models with model cards from corporations and GitHub, for Hugging Face and baseline GitHub models without model cards we had to sample. Our samples were not truly random; to keep the analysis manageable, we stratified the sample among popular Hugging Face projects and focused on models for three tasks. Finally, our analysis was manual and relies on some subjective judgement despite our best attempts to clarify and validate the rubric.

3.4. Model Documentation Assessment

3.4.1. Result and Observations

In Table 2, we summarize what information we found in each model card based on our rubric. The first notable observation is that model cards provided by companies with their models and model cards in GitHub repositories tend to include more information corresponding to model card sections. In contrast, the model cards on Hugging Face are less likely to include information related to most questions - all but Q2, Q3, Q10, and Q19, by Dunn’s Kruskal-Wallis multiple comparisons test. We also found no significant differences between the two strata of our Hugging Face sample, suggesting that the most popular models are not documented more comprehensively than less popular ones. Our baseline of model documentation in GitHub repositories that do not mention model cards draws a more mixed picture: Regarding some information, such as the intended use (Q5 and Q7), they rate similar to or even higher than those of companies’ model card and GitHub model cards, but they evidently fall short on questions related to ethical considerations (Q20-Q22), where they included information rarely, similar to Hugging Face model cards.

Overall, we found that 18 of the 22 pieces of information covered by our questions were included by less than half of the models’ documentation, including most related to reasoning about the impact of model adoption, such as on unforeseen scenarios and on minority populations. Only questions about model type (Q2), model date and version (Q3), intended uses (Q5), and target distribution description (Q8) are included by more than half of the models’ documentation. Q6 about the situations where a user should not use a model, however, was only documented in 32% of the models’ documentation. Similarly, merely 35% of the models’ documentation included some discussion of bias or ethics (Q20). The ethical issue mitigation process (Q21) and measurement (Q22) were included in less than 10% of the models’ documentation, suggesting only a shallow (public) engagement with fairness issues.

The results suggest that Hugging Face’s attempt to promote the concept of model cards with documentation and by labeling READMEs as model cards in their interface is not sufficient to encourage adoption of the model card concept. Indeed, on average, documentation of models on Hugging Face includes similar information to models published on GitHub without any reference to model cards. On the other hand, developers who voluntarily adopt model cards in GitHub repositories or for published models of companies tend to include information such as out of scope uses, evaluation results, and ethical considerations much more systematically. However, even those who voluntarily adopt the format tend to skip many sections of the recommended model card structure, including commonly information about the data collection process and concrete measures or steps taken regarding ethical issues.

Our empirical study suggests that an aspiring documentation proposal alone is insufficient to systematically improve the model documentation quality and just labeling something a model card may even water down the concept rather than actually encouraging better documentation practices. Despite Hugging Face’s effort for providing the model card instruction and naming the model documentation as “model card”, the vast majority of its users clearly do not take the time to read and follow the instruction Hugging Face provides; only a single one included ethical considerations or any discussion of bias despite instructions recommending to do so. In fact, during our analysis, we found at least four of the Hugging Face model cards contain a disclaimer stating that the model card was created by the Hugging Face team on behalf of the model creators. Currently, other factors such as incentives and intrinsic motivation might play a bigger role for improving model documentation.

4. DocML Design

To accelerate a wide adoption of model cards and improve their compliance, we present an interactive documentation tool that can be integrated as part of the model development routine of data scientists. In this section, we discuss the major design considerations for such a tool and the implementation of our prototype.

4.1. Design Motivation

The design of DocML followed a theory-driven but interactive approach. Synthesizing from the previous literature on program understanding, digital nudging and software documentation, we first derived a set of concrete design guidelines for tools that support ML documentation. Then through an exploratory study, we further enrich and finalize the design specifications for such tools. Below, we start with introducing each design guideline and its rationales.

4.1.1. Design Guidelines

It requires a profound understanding of the source code for the programmers to generate accurate documentation. When the content is less familiar, the programmers have to spend considerable effort seeking, relating, and collecting the relevant information to support their comprehension (Ko et al., 2006). Even when the code is written by the same programmers, context change is going to exert the effect of unfamiliarity and increase their cognitive load (Sedano et al., 2017). A more task-focused interface that facilitates the display and management of task contexts provides the opportunity to improve the flow of knowledge-intensive work (Kersten and Murphy, 2015). For the task of documentation, therefore, it is preferable to conjoin the environment to write and maintain documentation and the environment of the documentation target, i.e. machine learning model development in our case. Such an idea is not unprecedented. The early design of JavaDoc (now mirrored in almost every programming language), the documentation generation tool for Java, also aligns with this principle: It uses source code comments with special notation as the location for API specifications (Kramer, 1999).

Design Guideline 1: The documentation environment should be collocated with the coding environment where the related context is directly accessible.

Our second guideline is derived from the theory of digital nudging. The idea of nudging is originated in the study of behavioral economics (Thaler and Sunstein, 2009). It suggests that small changes in the environment can influence the behavior and decision-making of groups or individuals in predictable ways. Instead of relying solely on education and policy, the positive reinforcement brought by nudging can effectively encourage people to act rationally in numerous scenarios such as consumer choices, fundraising, and retirement saving. Nudging is equipped in the digital interface designers’ toolbox as well. Recent work by Caraban et al. categorizes nudging mechanisms in technology design for health, sustainability, and privacy (Caraban et al., 2019). The six categories are facilitate, confront, deceive, social influence, fear, and reinforce. Among them, facilitate and reinforce has been adopted for encouraging software engineering behaviors (Brown and Parnin, 2021; Kramer, 1999). Inspired by this line of work, we propose to use the nudging mechanism in the documentation tool for the model developers to more consciously consider and document the model usage and ethical issues during development.

Design Guideline 2: The important but often overlooked content for ML model documentation should be prompt during model development and documentation writing.

Moreover, a practical and effective documentation tool should prevent inconsistent or stale information. Given that machine learning models are often improved in an iterative and continuous process (Patel et al., 2008; Kery et al., 2018; Siebert et al., 2021), often with additional data over time, there is a risk that model documentation and actual model properties drift apart. In a recent study on documentation issues discussed in developer mailing lists, Stack Overflow, issue repositories, and pull requests (Aghajani et al., 2019), Aghajani et al. observed that the issues related to up-to-dateness (including code-doc inconsistency as a subcategory) are the second most frequent issues, right after the documentation completeness. One strategy to prevent such a problem is to follow the principle of “single source of truth.” This implies that if the same information appears in different contexts, one should be derived from the other depending on which one is the source of the information, rather than each maintaining a separate copy. Moreover, the documentation tool should support maintaining the traceability links between the documentation and corresponding source code. Traceability is an important property for developing safety-critical software (Cleland-Huang et al., ) and is suggested to improve AI accountability (Raji et al., 2020). Explicit traceability links between code and documentation, in this case, can support examining the consistency between those two sets of artifacts and therefore improve the accuracy of the documentation and the accountability of the machine learning models.

Design Guideline 3: The content in the documentation should be derived and/or trace to the origin of the information, such as the source code, comment, markdown cells in notebook.

4.1.2. Additional Design Consideration

We approached our prototype design by following the three design guideline discussed above while iterating on users’ feedback from different stages of the tool design. Through this process, we added additional considerations in our prototype design, including:

  • [label=-,leftmargin=10pt]

  • Minimal distraction: the users prefer to have the documentation tool with simple and intuitive functions to reduce the overhead of using this tool.

  • Customizable: the sections in the documentation should be configurable to reflect the priority of the project.

  • Access to explanations and exemplars: some sections in the model card is not easy to understand from its title. Direct access to its explanation and exemplars would be helpful to understand how to document those sections properly.

4.2. DocML User Interface

In this section, we describe how the DocML interface support several scenarios when the data scientists develop and maintain their models and the corresponding documentation in notebook environment.

Figure 1. As the data scientists are developing models, they can activate DocML by clicking the tool button. The notebook and the documentation panel will show side by side on Jupyter Lab.

4.2.1. Creating the documentation within the notebook environment

DocML is designed and implemented as an extension for Jupyter Lab, one of the common notebook environments used by data scientists. When activated, the interactive documentation panel will be expanded alongside their notebook on Jupyter Lab (as shown in Fig. 1). The pre-defined documentation sections will show on this panel so that users will be reminded during the model development and documentation writing. We use the model card sections with additional sections reflecting the model development context, such as library use. Users can customize the sections titles and orders through an explicit configuration file.

Figure 2. The user starts editing the documentation for each section by clicking the edit button. The content is created and maintained within the notebook \⃝raisebox{-0.9pt}{1}. DocML presents the description for each section when the cursor hovers over the title of the section \⃝raisebox{-0.9pt}{2} and provides documentation examples through hyperlinks next to the section title \⃝raisebox{-0.9pt}{3}.

.

When users click the edit button next to the section title on the DocML panel, they can start filling in the content for that section. Instead of maintaining a separate data storage for the documentation, we redirect users to an automatically created markdown cell in the notebook with the section title (shown as \⃝raisebox{-0.9pt}{1} in Fig. 2). To differentiate the model user oriented model documentation with other markdowns in the notebook that serve different purposes, we create a special set of HTML comments indicating their role in the model card. Users can view the latest version of documentation anytime though clicking the Refresh button on the panel (see Fig. 1). The complete documentation can be exported to a markdown file for sharing by clicking the Export to MD button.

4.2.2. Nudging the adoption of model cards

To nudge data scientists to consider and to follow the model cards proposal more effectively, we designed and implemented several features. First, when the users hovered their cursor on the title of each section, we provide a concise description for that section to remind them what content is appropriate (as indicated by \⃝raisebox{-0.9pt}{2} in Fig. 2). Moreover, we provide explicit links to one or more examples next to the section title so that users can reference when necessary (as indicated by \⃝raisebox{-0.9pt}{3} in the Fig. 2). In this example, we select several high-quality model cards from our empirical study discussed in Section 3.4 as exemplars. The section titles, their descriptions, and examples links are also customizable through the configuration file.

Once the users finish writing the documentation, they can export it to an markdown file for sharing. At this point, DocML will perform a lightweight completion check and inform them if any pre-defined sections are still empty, as shown in Fig. 3. This extra step serves as an encouragement for them to complete missing model card sections.

Figure 3. DocML performs a completion check before exporting and suggests the users add contents in the sections that are still empty.

4.2.3. Traceability and Navigation across notebook and documentation

Certain sections in the model documentation directly describe the purpose and outcome of the code cells in the notebook, such as training process and evaluation process. In this case, users can link the code cells to the corresponding sections in the model documentation so that the content can be easily referenced and analyzed during documentation creation and maintenance. Currently, we define six stages that represents a common machine learning pipeline related to the data scientists (Amershi et al., 2019)

, i.e. data cleaning, preprocessing, hyperparameter tuning, model training, and model evaluation. To alleviate the manual effort required on selecting the stages, we automatically identify the stages for common libraries used by data scientists, including scikit-learn,

444https://scikit-learn.org/stable/ numpy,555https://numpy.org pandas666https://pandas.pydata.org and matplotlib,777https://matplotlib.org through constructing a knowledge base of the API usage. Further support to other libraries can be added by enhancing the knowledge base. In case of incorrectly identified or missed code cells due to the auto detection process, users can correct the stages manually. Under the hood, the trace links are maintained through the metadata for the code cell which cannot be easily viewed in the notebook environment. To make the information explicit, we automatically generate code comments to indicate those trace links, as shown in Fig. 4. Once the trace links are established, users can navigate the code cells related to corresponding model documentation through a navigation bar. It also indicates the relevant location of the code cell within the notebook (see Fig. 5).

Figure 4. Users can explicitly select the machine learning stages corresponding to the model documentation. Once selected, the stage is indicated through a code comment.
Figure 5. To support the users navigate the corresponding cells in the notebook, DocML provide a navigation bar for each machine learning stages specified in the notebook.

4.3. Implementation

The architecture of DocML a back-end extractor module, which is written in Python and JavaScript, and a front-end JupyterLab plugin written in React. It supports multiple model cards interfaces at the same time with the single back-end. The front-end module follows the interface design specifications described in Section 4.1. The back-end module analyzes the code in the notebook building on existing tools for program analysis,888https://github.com/andrewhead/python-program-analysis ML stages analysis for notebook999https://github.com/yjiang2cmu/Jupyter-Notebook-Project and cell dependencies analysis.101010https://github.com/jerry-lu/cell-dependencies

Once a request is made from the front-end plugin, the content of the notebook is sent to the back-end for obtaining the cell dependencies and clustering the cells that belong to the same stage. Mappings to the relevant stage name are then added either based on the manual input by the user or based on knowledge base matching rules which classify various scikit-learn, numpy, pandas and matplotlib code lines to the sections.

The configuration file consists of a list of JSON objects, which record customizable content such as the name of the sections, their description, and any examples that showcase the suggested way of documentation. The JSON objects are populated and displayed in the front-end when DocML is initialized. All markdown cells from the Jupyter Notebook are parsed to retrieve any model documentation with the specific HTML tags for displaying on the tool panel.

5. User Study

We conducted a user study with participants who have previous experience with model development with notebook to investigate to what extent DocML can encourage creating more comprehensive documentation and support understanding and improving the quality of the documentation during model creation and maintenance.

5.1. Method

5.1.1. Recruitment and Participants

The recruitment was through sharing invitations on Twitter, LinkedIn, and personal networks following with the screening of candidates with appropriate background. The selection criteria included having sufficient experience with the notebook environment and have shared at least one machine learning model prior to the study. The suitable candidates were then invited for filling a pre-study survey, participating in a remote lab study that lasts for around one hour and post-study interview. In the end, 16 participants were recruited for our study. Among them, 14 participants have at least one year of experience using computational notebook. All participants have used Jupyter Notebook in academic settings (e.g. assignment) and five of them also used Jupyter Notebook in professional settings (e.g. professional data scientists).

5.1.2. Study Process

The 16 participants were randomly divided into two groups, one proceed the study with our tool (i.e., in experimental condition), referred to as -, and another without (i.e. in controlled condition), referred to as -. In the pre-study survey, all participants answered questions about their data scientists background and documentation practice, and were informed with the model cards proposal. Participants in the experimental group were additionally asked to watch a short tutorial video introducing the functions of DocML and to get familiar with DocML’s interface through a provided remote server.

Due to the restrictions on user studies during the pandemic, the lab study was performed remotely through Microsoft TEAM. At the beginning of the study, we provided the participants an incomplete notebook and a model card in the form of a README.md file with only some sections filled, such as information about the model, intended use and preprocessing. Moreover, the information at three places in the model card is inconsistent with the notebook, including the library use, hyper-parameters, and dataset feature description. We deliberately chose not to inform the participants of the concrete quality problem to mimic the situation in practice.

The participants from both groups were asked to perform two tasks, around 20 minutes each, representing common activities during model development and maintenance. The first task was to choose one among the two potential models we provided in the notebook and complete the documentation for the model of their choice. The participants were encouraged to make any update to the existing code and documentation to improve their accuracy, completeness, or other quality attributes. During the second task, the participants were asked to update the model development by developing a new model on the same dataset using different features and to update the documentation accordingly.

Upon completion of the two tasks, we interviewed the participants about their experience related to the model documentation. For the experimental group, we asked the participants to evaluate six major features of DocML. For the controlled group, we sought their opinion on the potential support that improve their documentation experience.

5.2. Threat to Validity

The study is limited in the lab setting where the time constraint plays a major factor impacting their documentation experience. Tasks complexity, target domain, and other factors might play a bigger impact in practice for documentation practice. Moreover, despite providing the tutorial and tool access prior to the study, participants in the experimental group still experience a certain learning curve of using the tool. Therefore, our study might not reflect the documentation quality developed by users who are familiar with the interface. Finally, the long-term impact of deploying the tool, especially during continuous model and document evolution, cannot be observed from the current study design.

5.3. Results and Observations

5.3.1. Documentation Creation and Maintenance

Participants from both group generally started their tasks with glancing through the provided notebooks to get familiar with of its structure and content. How they approached the documentation diverged depending on if DocML was presented. We discuss the observations from two groups below.

Most participants in the experimental group tends to perform a more careful examination on the provided content following the order of the model card template. For example, devoted more effort on reworking the content in almost every sections in the provided model card to make them more precise and informative. Notable, except and , all participants from the experimental group read the prompt descriptions of model card sections to help their understanding, in particular, the sections of Factor, Fairness considerations, and Caveat and Recommendation. Some of them further clicked the example links of model cards. In the end, three participants (, , and ) from the experimental group added the documentation related to ethical considerations in their model card.

All participants except from the experimental group used the DocML’s navigation support enabled by code-doc trace links to help them examine the model card content111111Due technical issue during the study, DocML only became accessible to at the end of the second task.. Some of them (, , ) made edits on the trace links by modifying the existing links from code cells and/or adding new links for first task. For the second task during which they were asked to develop their own models, and devoted considerable time to create trace links for their own code cells. They tend to consider the trace links as an innate component of the documentation.

On the other hand, almost all participants from the controlled group (except ) devoted most of their effort on adding the documentation related to algorithmic aspect of the model and its training and testing processes. They spent little time on examining and updating other sections in provided documentation. When they did, the navigation seems to be a struggle. For example, and manually scrolled the notebook extensively to locate the relevant content in the notebook and switched back and forth between notebook and the ReadME file. The extra effort might explain why after the initial inspection they tend to think the content provided is in sufficient quality. No participants from this group have added any information related to ethical considerations.

In terms of the inconsistencies in the provided documentation, six participants from the experimental group and seven from the controlled group have fixed one inconsistency (incorrect library name) either by replacing the content or rewriting the whole section. , , and has fixed two places of inconsistency during the study.

Overall, the behavior and delivered model card from the controlled group resemble the current common practice for data scientists. At the same time, even with a strict time constraint in our study, most participants from the experimental group chose to put more effort into different activities related to documentation that are missing from the current practice, including understanding the model card sections (especially ethical-related considerations), improving the writing of existing content, and creating and maintaining doc-code trace links. Such behavior change can potentially bring non-negligible benefits to machine learning documentation in a long term.

5.3.2. User Evaluation and Feedback

Figure 6. User evaluation on the necessity and ease of use for the features in DocML. Both aspects were evaluated through a Likert scale. Neutral and abstained input is omitted from the figure. X axis is the number of participants from the experimental group.

Fig. 6 summarizes how the participants from the experimental group rated the necessity and ease of use for different features in DocML. Most participants agreed or strongly agreed that those features are necessary. They in particular commented on the importance of having the model card template to structure their documentation during the model development. Most participants also thought that the navigation function supported by trace link is especially effective so that they “do not have to see hundreds of lines of code for going to the [model card] section”.

At the same time, the participants expected more control over the section order and title in the model card. They preferred direct modification on the model card template through the UI panel rather than the configuration file, indicating a tension between customization and standardization that needs to be carefully balanced in practice. Moreover, the participants hoped the markdown cells representing the model card content can be injected next to the corresponding code cell. DocML currently is limited to depending on the users to appropriately locate the newly added markdown in the notebook. Two participants also thought trace link creation can be more intuitive in future versions.

Participants from the controlled group voiced their needs for similar functions provided by DocML, including the documentation template and code documentation trace links. Moreover, participants from both groups also suggested their preference for collaboration functions for the documentation tool. Future work is needed to understand how to fit documentation tools into the development toolchain for different stakeholders and to fit into the team dynamic.

6. Discussion

Model cards are still scarcely adopted in practice. In all of GitHub with millions of public notebooks (Pimentel et al., 2019; Rule et al., 2018; Psallidas et al., 2019)

and many repositories sharing learning code and learned models, we found only 24 models documented explicitly with model cards. Our best effort on finding model cards published by companies results in only 28 models. Even when the model cards are adopted, it is hardly an indication of documentation quality. During our assessment, we observed strong variance in what information is provided by the model cards using our rubric. Moreover, the extent to which the documentation answers those questions in the rubric also varies drastically. For example, the majority of the model cards we examined fail to provide more than vague or generic information related to target distribution and ethical considerations. They are often not self-contained and sometimes direct the readers to additional resources such as research papers. Even when digging into the papers, we often fail to find information related to the model card sections. In general, model documentation in practice seems to still be an afterthought at best. Such practice calls for more effective tools that support the model creators from early on.

Nudging and traceability enable behavior shift for documentation. As observed in the user study, when our tool is not available, data scientists are overwhelmed with navigating and documenting details of the model development code, leaving no room for considering the model cards recommendations. On the other hand, when the documentation environment is integrated with the coding environment in a meaningful way, data scientists are devoting more effort to improving documentation quality and maintainability. They approached documentation in a more iterative manner and actively use and maintain the trace links between documentation and the source code. The data scientists further spend more time considering the context and impact of the model development and deployment when the explanation and examples of model card sections are nudged in their model development environment. Such behavior change is critical to the comprehensiveness of model documentation, in particular along the ethical axis.

7. Conclusion

In this work, we investigated how publicly available ML models are documented, especially when they adopted the model cards proposal. Our assessment of the quality of those documentation reveals the clear gap between the proposal and practice. As an effort to move the needle towards meaningful model cards adoption and improving documentation practice, we proposed a set of design guidelines for model documentation tools drawn from the literature on software documentation, interaction design, and traceability. Following those guidelines and an iterative approach, we implemented a documentation tool for data scientists using computational notebooks. As demonstrated in the user study, the tool nudges data scientists to create documentation and to consider ethical implications during model development. It also encourages the construction and maintenance of trace links between documentation and source code, which supports model accountability.

References

  • E. Aghajani, C. Nagy, M. Linares-Vásquez, L. Moreno, G. Bavota, M. Lanza, and D. C. Shepherd (2020) Software documentation: the practitioners’ perspective. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, ICSE ’20, New York, NY, USA, pp. 590–601. External Links: ISBN 9781450371216, Link, Document Cited by: §1.
  • E. Aghajani, C. Nagy, O. L. Vega-Márquez, M. Linares-Vásquez, L. Moreno, G. Bavota, and M. Lanza (2019) Software documentation issues unveiled. In 2019 IEEE/ACM 41st International Conference on Software Engineering (ICSE), pp. 1199–1210. Cited by: §2.1, §2.1, §4.1.1.
  • S. Amershi, A. Begel, C. Bird, R. DeLine, H. Gall, E. Kamar, N. Nagappan, B. Nushi, and T. Zimmermann (2019) Software engineering for machine learning: a case study. In 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), Vol. , pp. 291–300. External Links: Document Cited by: §4.2.3.
  • M. Arnold, R. K. Bellamy, M. Hind, S. Houde, S. Mehta, A. Mojsilović, R. Nair, K. N. Ramamurthy, A. Olteanu, D. Piorkowski, et al. (2019) FactSheets: increasing trust in ai services through supplier’s declarations of conformity. IBM Journal of Research and Development 63 (4/5), pp. 6–1. Cited by: §1.
  • D. M. Arya, J. L. Guo, and M. P. Robillard (2020) Information correspondence between types of documentation for apis. Empirical Software Engineering 25 (5), pp. 4069–4096. Cited by: §2.1.
  • S. L. Blodgett, S. Barocas, H. Daumé III, and H. Wallach (2020) Language (technology) is power: a critical survey of “bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 5454–5476. External Links: Link, Document Cited by: §3.
  • K. L. Boyd (2021) Datasheets for datasets help ml engineers notice and understand ethical issues in training data. Proceedings of the ACM on Human-Computer Interaction 5 (CSCW2), pp. 1–27. Cited by: §2.2.
  • C. Brown and C. Parnin (2021) Nudging students toward better software engineering behaviors. arXiv preprint arXiv:2103.09685. Cited by: §4.1.1.
  • J. Buolamwini and T. Gebru (2018) Gender shades: intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pp. 77–91. Cited by: §1, §3.
  • A. Caraban, E. Karapanos, D. Gonçalves, and P. Campos (2019) 23 ways to nudge: a review of technology-mediated nudging in human-computer interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 1–15. External Links: ISBN 9781450359702, Link Cited by: §4.1.1.
  • [11] J. Cleland-Huang, O. Gotel, A. Zisman, et al. Software and systems traceability. Vol. 2, Springer. Cited by: §4.1.1.
  • J. Cohen (1960) A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20 (1), pp. 37–46. External Links: Document, Link, https://doi.org/10.1177/001316446002000104 Cited by: §3.2.
  • L. F. Cortés-Coy, M. Linares-Vásquez, J. Aponte, and D. Poshyvanyk (2014) On automatically generating commit messages via summarization of source code changes. In 2014 IEEE 14th International Working Conference on Source Code Analysis and Manipulation, Vol. , pp. 275–284. External Links: Document Cited by: §2.1.
  • J. Dastin (2018) Amazon scraps secret ai recruiting tool that showed bias against women. Reuters. External Links: Link Cited by: §1.
  • J. Dressel and H. Farid (2018) The accuracy, fairness, and limits of predicting recidivism. Science advances 4 (1), pp. eaao5580. Cited by: §1.
  • H. Face (2021a) Hugging face – the ai community building the future.. External Links: Link Cited by: §3.1, §3.
  • H. Face (2021b) Hugging face documentation: model repos docs. External Links: Link Cited by: §1, §3.1.
  • T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé III, and K. Crawford (2018) Datasheets for datasets. arXiv preprint arXiv:1803.09010. Cited by: §1, §2.2.
  • Google (2020) Model-card-toolkit. GitHub. Note: https://github.com/tensorflow/model-card-toolkit Cited by: §2.2, §3.1.
  • A. Head, J. Jiang, J. Smith, M. A. Hearst, and B. Hartmann (2020) Composing flexibly-organized step-by-step tutorials from linked source code, snippets, and outputs. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, CHI ’20, New York, NY, USA, pp. 1–12. External Links: ISBN 9781450367080, Link, Document Cited by: §2.1, §2.1.
  • J. Hellman, E. Jang, C. Treude, C. Huang, and J. L. Guo (2021) Generating github repository descriptions: a comparison of manual and automated approaches. arXiv preprint arXiv:2110.13283. Cited by: §2.1, §2.1.
  • S. Holland, A. Hosny, S. Newman, J. Joseph, and K. Chmielinski (2018) The dataset nutrition label: a framework to drive higher data quality standards. arXiv preprint arXiv:1805.03677. Cited by: §2.2.
  • M. Kersten and G. C. Murphy (2015) Reducing friction for knowledge workers with task context. AI Magazine 36 (2), pp. 33–41. Cited by: §4.1.1.
  • M. B. Kery, M. Radensky, M. Arya, B. E. John, and B. A. Myers (2018)

    The story in the notebook: exploratory data science using a literate programming tool

    .
    In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–11. Cited by: §4.1.1.
  • A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung (2006) An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks. IEEE Transactions on Software Engineering 32 (12), pp. 971–987. External Links: Document Cited by: §4.1.1.
  • D. Kramer (1999) API documentation from source code comments: a case study of javadoc. In Proceedings of the 17th annual international conference on Computer documentation, pp. 147–153. Cited by: §4.1.1, §4.1.1.
  • P. W. McBurney and C. McMillan (2014) Automatic documentation generation via source code summarization of method context. In Proceedings of the 22nd International Conference on Program Comprehension, pp. 279–290. Cited by: §2.1.
  • M. Mitchell, S. Wu, A. Zaldivar, P. Barnes, L. Vasserman, B. Hutchinson, E. Spitzer, I. D. Raji, and T. Gebru (2019) Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19, New York, NY, USA, pp. 220–229. External Links: ISBN 9781450361255, Link, Document Cited by: §1, §3.2, §3.
  • L. Moreno, J. Aponte, G. Sridhara, A. Marcus, L. Pollock, and K. Vijay-Shanker (2013) Automatic generation of natural language summaries for java classes. In 2013 21st International Conference on Program Comprehension (ICPC), pp. 23–32. Cited by: §2.1, §2.1.
  • K. Patel, J. Fogarty, J. A. Landay, and B. Harrison (2008) Investigating statistical machine learning as a tool for software development. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 667–676. Cited by: §4.1.1.
  • J. F. Pimentel, L. Murta, V. Braganholo, and J. Freire (2019) A large-scale study about quality and reproducibility of jupyter notebooks. In 2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR), pp. 507–517. Cited by: §1, §6.
  • G. A. A. Prana, C. Treude, F. Thung, T. Atapattu, and D. Lo (2019) Categorizing the content of github readme files. Empirical Software Engineering 24 (3), pp. 1296–1327. Cited by: §2.1.
  • F. Psallidas, Y. Zhu, B. Karlas, M. Interlandi, A. Floratou, K. Karanasos, W. Wu, C. Zhang, S. Krishnan, C. Curino, et al. (2019) Data science through the looking glass and what we found there. arXiv preprint arXiv:1912.09536. Cited by: §6.
  • I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes (2020) Closing the ai accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20, New York, NY, USA, pp. 33–44. External Links: ISBN 9781450369367, Link, Document Cited by: §4.1.1.
  • J. Richards, D. Piorkowski, M. Hind, S. Houde, and A. Mojsilović (2020) A methodology for creating ai factsheets. arXiv preprint arXiv:2006.13796. Cited by: §2.2.
  • M. P. Robillard, A. Marcus, C. Treude, G. Bavota, O. Chaparro, N. Ernst, M. A. Gerosa, M. Godfrey, M. Lanza, M. Linares-Vásquez, et al. (2017) On-demand developer documentation. In 2017 IEEE International conference on software maintenance and evolution (ICSME), pp. 479–483. Cited by: §2.1.
  • A. Rule, A. Tabard, and J. D. Hollan (2018) Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 1–12. Cited by: §1, §6.
  • T. Sedano, P. Ralph, and C. Péraire (2017) Software development waste. In 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), pp. 130–140. Cited by: §4.1.1.
  • J. Siebert, L. Joeckel, J. Heidrich, A. Trendowicz, K. Nakamichi, K. Ohashi, I. Namba, R. Yamamoto, and M. Aoyama (2021) Construction of a quality model for machine learning systems. Software Quality Journal, pp. 1–29. Cited by: §4.1.1.
  • K. T. Stolee, S. Elbaum, and M. B. Dwyer (2016) Code search with input/output queries: generalizing, ranking, and assessment. Journal of Systems and Software 116, pp. 35–48. External Links: ISSN 0164-1212, Document, Link Cited by: §2.1.
  • J. Stylos and B. A. Myers (2006) Mica: a web-search tool for finding api components and examples. In Visual Languages and Human-Centric Computing (VL/HCC’06), pp. 195–202. Cited by: §2.1.
  • R. H. Thaler and C. R. Sunstein (2009) Nudge: improving decisions about health, wealth, and happiness. New York: Penguin Books. Cited by: §4.1.1.
  • C. Treude and M. P. Robillard (2016) Augmenting api documentation with insights from stack overflow. In 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE), pp. 392–403. Cited by: §2.1.
  • A. Y. Wang, D. Wang, J. Drozdal, M. Muller, S. Park, J. D. Weisz, X. Liu, L. Wu, and C. Dugan (2021) Documentation matters: human-centered ai system to assist data science code documentation in computational notebooks. External Links: 2102.12592 Cited by: §2.2.