Data science often refers to the process of leveraging modern machine learning techniques to identify insights from data (Kim et al., 2016; Muller et al., 2019b; Kaggle, 2018). In recent years, with more organizations adopting a “data-centered” approach to decision-making (Ufford et al., 2018; Dobrin and Analytics, 2017), more and more teams of data science workers have formed to work collaboratively on larger data sets, more structured code pipelines, and more consequential decisions and products. Meanwhile, research around data science topics has also increased rapidly within the HCI and CSCW community in the past several years (Guo et al., 2011; Kim et al., 2016; Muller et al., 2019b; Kross and Guo, 2019; Rule et al., 2018; Kery et al., 2018; Wang et al., 2019b, a, 2020).
From existing literature, we have learned that the data science workflow often consists of multiple phases (Muller et al., 2019b; Wang et al., 2019b; Kross and Guo, 2019). For example, Wang et al. describes the data science workflow as containing 3 major phrases (i.e., Preparation, Modeling, and Deployment) and 10 more fine-grained steps (Wang et al., 2019b). Various tools have also been built for supporting data science work, including programming languages such as Python or R, or statistical analysis tools such as SAS (Littell et al., 2006) and SPSS (Mikut and Reischl, 2011), an integrated developing environment (IDE) Jupyter Notebook (Kluyver et al., 2016; Granger et al., 2017), and automated model building systems such as AutoML (Google, ; Liu et al., 2019b) and AutoAI (Wang et al., 2019b). And from empirical studies, we know how individual data scientists are using these tools (Kery et al., 2018, 2019; Rule et al., 2018), and what features could be added to improve the tools for users working alone (Wang et al., 2019a).
However, a growing body of recent literature has hinted that data science projects consist of complex tasks that require multiple skills (Kim et al., 2016; Matsudaira, 2015). These requirements often lead participants to juggle multiple roles in a project, or to work with others who have distinct skills. For instance, in addition to the well-studied role of data scientist (Lee, 2014; Miller, 2014), who engages in technical activities such as cleaning data, extracting or designing features, analyzing/modeling data, and evaluating results, there is also the role of project manager, who engages in less technical activities such as reporting results (Hou and Wang, 2017; Lawton, 2018; Nov and Ye, 2010; Patil, 2011). The 2017 Kaggle survey reported additional roles involved in data science (Hayes, 2018), but without addressing topics of collaboration. We limited the roles in our survey to activities and relationships that were mention in interviews in (Muller et al., 2019b).
Unfortunately, most of today’s understanding of data science collaboration only focuses on the perspective of the data scientist, and how to build tools to support distant and asynchronous collaborations among data scientists, such as version control of code. The technical collaborations afforded by such tools (Wu et al., 2014) only scratch the surface of the many ways that collaborations may happen within a data science team, such as when stakeholders discuss the framing of an initial problem before any code is written or data collected (Pine and Liboiron, 2015). However, we have little empirical data to characterize the many potential forms of data science collaboration.
Indeed, we should not assume data science team collaboration is the same as the activities from a conventional software development team, as argued by various previous literature (Guo et al., 2011; Kross and Guo, 2019). Data science is engaged as an “exploration” process more than an “engineering” process (Heer and Shneiderman, 2012; Kery et al., 2018; Muller et al., 2019b). “Engineering” work is oftentimes assumed to involve extended blocks of solitude, without the benefit of colleagues’ expertise while engaging with data and code (Parnin, 2013). While this perspective on engineering is still evolving (Storey et al., 2006), there is no doubt that “exploration” work requires deep domain knowledge that oftentimes only resides in domain experts’ minds (Kery et al., 2018; Muller et al., 2019b). And due to the distinct skills and knowledge residing within different roles in a data science team, more challenges with collaboration can arise (Mao et al., 2020; Hou and Wang, 2017).
In this paper, we aim to deepen our current understanding of the collaborative practices of data science teams from not only the perspective of technical team members (e.g., data scientists and engineers), but also the understudied perspective of non-technical team members (e.g., team managers and domain experts). Our study covers both a large scale of users—we designed an online survey and recruited 183 participants with experience working in data science teams—and an in-depth investigation—our survey questions dive into 5 major roles in a data science team (engineer/analyst/programmer, researcher/scientist, domain expert, manager/executive, and communicator), and 6 stages (understand problem and create plan, access and clean data, select and engineer features, train and apply models, evaluate model outcomes, and communicate with clients or stakeholders) in a data science workflow. In particular, we report what other roles a team member works with, in which step(s) of the workflow, and using what tools.
In what follows, we first review literature on the topic of data science work practices and tooling; then we present the research method and the design of the survey; we report survey results following the order of Overview of Collaboration, Collaboration Roles, Collaborative Tools, and Collaborative Practices. Based on these findings, we discuss implications and suggest designs of future collaborative tools for data science.
2. Related Work
Our research contributes to the existing literature on how data science teams work. We start this section by reviewing recent HCI and CSCW research on data science work practices; then we take an overview of the systems and features designed to support data science work practices. Finally, we highlight specific literature that aims to understand and support particularly the collaborative aspect of data science teamwork.
2.1. Data Science Work Practices
Jonathan Grudin describes the current cycle of the popularity of AI-related topics in industry and in academia as an “AI Summer” (Grudin, 2009). In the hype surrounding AI, many fancy technology demos mark key milestones, such as IBM DeepBlue (Campbell et al., 2002), which defeated a human chess champion for the first time, and Google’s AlphaGo demo, which defeated the world champion in Go (Wang et al., 2016). With these advances in AI and machine learning technologies, more and more organizations are trying to apply machine learning models to business decision-making processes. People refer to this collection of work as “data science” (Guo et al., 2011; Kross and Guo, 2019; Kaggle, 2018; Muller et al., 2019b) and the various workers who participate in this process as “data scientists” or “data engineers”.
HCI researchers are interested in data science practices. Studies have been conducted to understand data science work practices (Passi and Jackson, 2017, 2018; Guo et al., 2011; Kross and Guo, 2019; Kery et al., 2018; Rule et al., 2018; Hou and Wang, 2017; Mao et al., 2020; Muller et al., 2019b), sometimes using the label of Human Centered Data Science (Aragon et al., 2016; Muller et al., 2019a). For example, Wang et al. proposed a framework of 3 stages and 10 steps to characterize the data science workflow by synthesizing existing literature (Figure 1) (Wang et al., 2019b). The stages consist of Preparation, Modeling, and Deployment, and at a finer-grained level, the framework has 10 steps from Data Acquisition to Model Runtime Monitoring and Improvement. This workflow framework is built on top of Muller et al.’s work (Muller et al., 2019b), which mostly focused on the Preparation steps, and decomposed the data science workflow into 4 stages, based on interviews with professional data scientists: Data Acquisition, Data Cleaning, Feature Engineering, and Model Building and Selection.
Given these workflow frameworks and terminology (Muller et al., 2019b; Wang et al., 2019b), we can position existing empirical work within a data science workflow. For example, researchers suggested that 80% of a data science project is spent in the Preparation stage (Zöller and Huber, 2019; Muller et al., 2019b; Guo et al., 2011; Kandel et al., 2011; Rattenbury et al., 2017). As a result, data scientists often do not have enough time to complete a comprehensive data analysis in the Modeling stage (Sutton et al., 2018). Passi and Jackson dug deeper into the Preparation stage, showing that when data scientists pre-process their data, it is often a rule-based, not rule-bound, process (Passi and Jackson, 2017). Pine and Liboiron further investigated how data scientists made those pre-processing rules (Pine and Liboiron, 2015).
However, most of the literature focuses only on a single data scientist’s perspective, despite many interviewees reporting that “data science is a team sport” (Wang et al., 2019b). Even in the Muller et al. (Muller et al., 2019b) and Wang et al.(Wang et al., 2019b) workflows, they focus only on the activities that involve data and code, which were most likely performed by the technical roles in the data science team. The voices of the non-technical collaborators within a data science team are missing, including an understanding of who they worked with when, and what tools they used.
In contrast to the current literature in data science, software engineering has built a strong literature on collaborative practices in software development (Treude et al., 2009; Kraut and Streeter, 1995; Herbsleb et al., 2001)
, including in both open source communities(Bird et al., 2008; Dabbish et al., 2012) and industry teams (Begel, 2008). As teams working on a single code base can often be large, cross-site, and globally dispersed, much research has focused on the challenges and potential solutions for communication and coordination (Herbsleb et al., 2001). These challenges can be exacerbated by cultural differences between team members of different backgrounds (Halverson et al., 2006; Huang and Trauth, 2007) and by the different roles of team members such as project manager (Zhang et al., 2007) or operator (Tessem and Iden, 2008). Many of the tools used by software engineering teams are also used by data science teams (i.e., GitHub (Dabbish et al., 2012), Slack (Park et al., 2018)), and the lessons learned from this work can also inform the design of collaborative tools for data science. However, there are also important differences when it comes to data science in particular, such as the types of roles and technical expertise of data science collaborators as well as a greater emphasis on exploration, data management, and communicating insights in data science projects.
Many of the papers about solitary data science work practices adopted the interview research method (Wang et al., 2019a; Guo et al., 2011; Kross and Guo, 2019; Kery et al., 2018; Wang et al., 2019b; Muller et al., 2019b). An interview research method is well-suited for the exploratory nature of these empirical works in understanding a new practice, but it also falls short in generating a representative and generalizable understanding from a larger user population. Thus, we decided to leverage a large-scale online survey to complement the existing qualitative narratives.
2.2. Collaboration in Data Science
Only recently have some CSCW researchers began to investigate the collaborative aspect of data science work (Grappiolo et al., 2019; Passi and Jackson, 2017; Stein et al., 2017; Borgman et al., 2012; Viaene, 2013). For example, Hou and Wang (Hou and Wang, 2017) conducted an ethnography study to explore collaboration in a civic data hackathon event where data science workers help non-profit organizations develop insights from their data. Mao et al. (Mao et al., 2020) interviewed biomedical domain experts and data scientists who worked on the same data science projects. Their findings partially echo previous results (Muller et al., 2019b; Guo et al., 2011; Kross and Guo, 2019) that suggest data science workflows have many steps. More importantly, their findings are similar to the software engineering work cited above (Treude et al., 2009; Kraut and Streeter, 1995; Herbsleb et al., 2001) opening the possibility that data science may also be a highly collaborative effort where domain experts and data scientists need to work closely together to advance along the workflow.
Researchers also observed challenges in collaborations within data science teams that were not as common in conventional software engineering teams. Bopp et al. showed that “big data” could become a burden to non-profit organizations who lack staff to make use of those data resources (Bopp et al., 2017). Hou et al. provided a possible reason—i.e., that the technical data workers “speak a different language” than the non-technical problem owners, such as a non-profit organization (NPO) client, in the civic data hackathon that they studied (Hou and Wang, 2017). The non-technical NPO clients could only describe their business questions in natural language, e.g., “why is this phenomenon happening?” But data science workers do not necessarily know how to translate this business question into a data science question. A good practice researchers observed in this context was “brokering activity”, where a special group of organizers who understand both data science and the context serve as translators to turn business questions into data science questions (see Williams’ earlier HCI work on the importance of “translators” who understand and mediate between multiple domains (Williams and Begg, 1993)). Also, once the data workers generated the results, the “brokers” helped to interpret the data science findings into business insights.
The aforementioned collaboration challenges (Hou and Wang, 2017) are not unique to the civic data hackathon context. Mao et al. (Mao et al., 2020) interviewed both data scientists and bio-medical scientists who worked together in research projects. They found that these two different roles often do not have common ground about the project’s progress. For example, the goal of bio-medical scientists is to discover new knowledge; thus, when they ask a research question, that question is often tentative. Once there is an intermediate result, bio-medical scientists often need to revise their research question or ask a new question, because their scientific journey is to “ask the right question”. Data scientists instead are used to transferring a research question into a well-defined data science question so they can optimize machine learning models and increase performance. This behavior of bio-medical scientists was perceived by the data scientist collaborators as “wasting our time”, as they had worked hard to “find the answer to the question” that later was discarded. Mao et al. argued that the constant re-calibration of common ground might help to ease tensions and support cross-discipline data science work.
However, these related projects focused only on a civic data hackathon (Hou and Wang, 2017) and on the collaborative projects between data scientists and bio-medical scientists in scientific discovery projects (Mao et al., 2020). Also, both of them used ethnographic research methods aiming for in-depth understanding of the context. In this work, we want to target a more commonly available scenario—data science teams’ work practices in corporations—as this scenario is where most data science professionals work. We also want to gather a broader user perspective through the deployment of an online survey.
2.3. Data Science Tools
Based on the empirical findings and design suggestions from previous literature (Hou and Wang, 2017; Mao et al., 2020; Muller et al., 2019b; Grappiolo et al., 2019; Passi and Jackson, 2017; Stein et al., 2017; Borgman et al., 2012; Viaene, 2013), some designers and system builders have proposed human-in-the-loop design principles for science tools (Amershi et al., 2019, 2011; Wang et al., 2019a; Kery et al., 2018, 2019; Gil et al., 2019). For example, Gil et al. surveyed papers about building machine learning systems and developed a set of design guidelines for building human-centered machine learning systems (Gil et al., 2019). Amershi et al. in parallel reviewed a broader spectrum of AI applications and proposed a set of design suggestions for AI systems in general (Amershi et al., 2019).
With these design principles and guidelines (Gil et al., 2019; Amershi et al., 2019), many systems and features have been proposed to support aspects of data science work practices. One notable system is Jupyter Notebook (Jupyter, ) and its variations such as Google Colab (Google, ) and Jupyter-Lab (Jupyter, ). Jupyter Notebook is an integrated code development environment tailored for data science work practices. It has a graphical user interface that supports three key functionalities—coding, documenting a narrative, and observing execution results (Kross and Guo, 2019)—that are central to data science work (Kery et al., 2018). Moreover, the ability to easily switch between code and output cells allows data scientists to quickly iterate on their model-crafting and testing steps (Kery et al., 2018; Muller et al., 2019b).
However, only a few recent works have started to look at designing specific collaborative features to support data science teams beyond the individual data scientist’s perspective (Wang et al., 2019a, 2020; Rule et al., 2018; Chang et al., 2018; Muller and Wang, 2018). For example, Jupyter Notebook’s narrative cell feature is designed to allow data scientists to leave human-readable annotations so that when another data scientist re-uses the code, they can better understand it. However, Rule et al. found a very low usage of these narrative cells (markdown cells) among a million Jupyter notebooks that they sampled from GitHub (Rule et al., 2018). Data scientists were not writing their notebooks with a future collaborator or re-user in mind.
Wang and colleagues at University of Michigan published a series of papers that are closest to our eventual research goal. Their 2019 study (Wang et al., 2019a) aims to understand if the Jupyter Notebook had a new feature that allows multiple data scientists to synchronously write code (e.g., as many people do in Google Docs today (Olson et al., 2017)), whether and how data scientists would use it for their collaboration. They implemented an experimental feature on top of Jupyter Notebook as a prototype, and they observed 12 pairs of data scientists using it. They found the proposed feature can encourage more exploration and reduce communication costs, while also promoting unbalanced participation and slacker behaviors.
In their 2020 paper (Wang et al., 2020), Wang et al. took up a related challenge, namely the documentation of informal conversations and decisions that take place during data science projects. Following an extensive analysis of chat messages related to such projects, they built Callisto, an integration of synchronous chat with a Jupyter notebook (similar in spirit to (Chang et al., 2018; Muller and Wang, 2018)). In tests with 33 data science practitioners, Wang et al. showed the importance of automatic computation of the reference point-in-code, to anchor that chat discussion in the code. Users of a version of Callisto that supported this feature were able to achieve significantly better performance when trying to understand previously written code and previous coding decisions.
In sum, almost all of the proposed tools and features in data science focus only on the technical users’ scenarios (e.g., data scientists and data engineers), e.g., how to better understand and wrangle data (Heer et al., 2007; Dang et al., 2018), or how to better write and share code (Wang et al., 2019a, 2020; Rule et al., 2018; Kery et al., 2019). In this work, we want to present an account that covers both the technical roles and the non-technical roles of a professional data science team in corporations, so that we can better propose design suggestions from a multi-disciplinary perspective.
Participants were a self-selected convenience sample of employees in IBM who read or contributed to Slack channels about data science (e.g., channel-names such as “deeplearning”, “data-science-at-ibm”, “ibm-nlp”, and similar). Participants worked in diverse roles in research, engineering, health sciences, management, and related line-of-business organizations.
We estimate that the Slack channels were read by approximately 1000 employees. Thus, the 183 people who provided data constituted a 20% percent participation rate.
Participants had the option to complete the survey anonymously. Therefore, our knowledge of the participants is derived from their responses to survey items about their roles on data science teams (Figure 2).
3.2. Survey Questions
The survey asked participants to describe a recent data science project, focusing on collaborations (if any), the roles and skills among the data science team (if appropriate), and the role of collaborators at different stages of the data science workflow. Next, we asked open-ended questions about the tools participants used to collaborate, including at different workflow stages.111Open-text responses were coded by two of the authors. We agreed on a set of coding guidelines in advance, and we resolved any disagreements through discussion. Finally, we asked participants to describe their collaborative practices around sharing and re-using code and data, including their expectations around their own work and their documentation practices.
3.3. Survey Distribution
We posted requests to participate in relevant IBM internal Slack channels during January 2019. Responses began to arrive in January. We wrote 2–4 reminder posts, depending on the size and activity of each Slack channel. We collected the last response on 3 April 2019.
The 183 people who responded to the anonymous survey described themselves as being of varied experience in data science but primarily 0–5 years (Figure 2A). Clearly, some saw connections between contemporary data science and earlier projects involving statistical modeling and that is why we see some long years of experience. Respondents worked primarily in smaller teams of six people or fewer (Figure 2B). A few appeared to have solo data science practices.
Respondents reported that they often acted in multiple roles in their teams, and this may be due to the fact that most of them have a relatively small team. Figure 2C is a heatmap showing the number of times in our survey a respondent stated they acted in both roles out of a possible pair (with the two roles defined by a cell’s position along the x-axis and y-axis). For cells along the diagonal, we report the number of respondents who stated they only performed that one role and no other. As can be seen, this was relatively rare, except in the case of the Engineer/Analyst/Programmer role.
Unsurprisingly, there was considerable role-overlap among Engineers/Analysts/Programmers and Researchers/Scientists (i.e., the technical roles). These two roles also served—to a lesser extent—in the roles of Communicators and Domain Experts.
By contrast, people in the role of Manager/Executive reported little overlap with other roles. From the roles-overlap heatmap of Figure 2C, it appears that functional leadership—i.e., working in multiple roles—occurred primarily in technical roles (Engineer/Analyst/Programmer and Researcher/Scientist). These patterns may reflect IBM’s culture to define managers as people-managers, rather than as technical team leaders.
4.1. Do Data Science Workers Collaborate?
Figure 3 shows patterns of self-reported collaborations across different roles in data science projects. First, we begin answering one of the overall research questions: What is the extent of collaboration on data science teams?
|Role||Percent Reporting Collaboration|
4.1.1. Rates of Collaboration
The data behind Figure 3 allow us to see the extent of collaboration for each self-reported role among the data science workers (Table 1). Among the five data science roles of Figure 3, three roles reported collaboration at rates of 95% or higher. The lowest collaboration-rate was among Domain Experts, who collectively reported a collaboration percentage of 87%. In IBM, data science is practiced in a collaborative manner, with very high reported percentages of collaboration. In the following subsections, we explore the patterns and supports for these collaborations.
4.1.2. Who Collaborates with Whom?
The stacked bar chart to the left in Figure 3 reflects the raw numbers of people in each role who responded to our survey and stated that they collaborated with another role. The heatmap to the right of Figure 3 shows a similar view of the collaboration relationship—with whom they believe they collaborate—as the chart on the left, except that the cells are now normalized by the total volume in each column. The columns (and the horizontal axis) represent the reporter of a collaborative relationship. The rows (and the vertical axis) represent the collaboration partner who is mentioned by the reporter at the base of each column. Lighter colors in the heatmap indicate more frequently-reported collaboration partnerships.
When we examine a square heatmap with reciprocal rows and columns, we may look for asymmetries around the major diagonal. For each pair of roles (A and B), do the informants report a similar proportion of collaboration in each direction—i.e., does A report about the same level of collaboration with B, as B reports about A?
Surprisingly, we see a disagreement about collaborations in relation to the role of Communicator. Communicators report strong collaborations with Managers and with Domain Experts, as shown in the Communicator column of Figure 3. However, these reports are not fully reciprocated by those collaboration partners. As shown in the row for Communicators, most roles reported little collaboration with Communicators relative to the other roles. A particularly striking difference is that Communicators report (in their column) relatively strong collaboration with Managers/Executives, but the Managers/Executives (in their own column) report the least collaboration with Communicators. There is a similar, less severe, asymmetry between Communicators and Domain Experts. We will later interpret these findings in the Discussion section 5.2.3.
4.1.3. Are there “Hub” Collaborator Roles?
Are certain roles dominant in the collaboration network of Figure 3? Figure 4 shows the reports of collaboration from Figure 3 as a network graph. Each report of collaboration takes the form of a directed arc from one role to another. The direction of the arc between e.g. (A-¿B) can be interpreted as “A reports collaboration with B.” The thickness of each arc represents the proportion of people who report each directed-type of collaboration. To avoid distortions due to different numbers of people reporting in each role, we normalized the width of each arc as the number of reported collaborations divided by the number of people reporting from that role. Self-arcs represent cases in which the two collaborators were in the same role—e.g., an engineer who reports collaborating with another engineer.
With one exception, this view shows relatively egalitarian strengths of role-to-role collaboration. While we might expect to find Managers/Executives as the dominant or “hub” role, their collaborative relations are generally similar to those of Engineers and Researchers. Domain Experts are only slightly less engaged in collaborations.
The exception occurs, as noted above, in relation to Communicators. Communicators in this graph clearly believe that they are collaborating strongly with other roles (thick arrows), but the other roles report less collaboration toward Communicators (thin arrows).
The self-loop arrows are also suggestive. These arrows appear to show strong intra-role collaborations among Engineers, Researchers, and Communicators. By contrast, Managers/Executives and Domain Experts appear to collaborate less with other members of their own roles.
4.2. Collaborator Roles in Different Stages of the Data Science Workflow
As reviewed above in relation to Figure 1, data science projects are often thought to follow a series of steps or stages—even if these sequences serve more as mental models than as guides to daily practice (Muller et al., 2019b; Passi and Jackson, 2017). We now consider how the roles of data science workers interact with those stages.
Figure 5 shows the relative participation of each role as a collaborator in the stages of a data science workflow. As motivated in the Related Work section, in this paper, we adopted a six-step view of a reference-model data science workflow, beginning with creating a measurement plan (Pine and Liboiron, 2015), and moving through technical stages to an eventual delivering stage of an analysis or model or working system. Some organizations also perform a check for bias and/or discrimination during the technical development (Bellamy et al., 2019). However, because that step is not yet accepted by all data science projects and may happen at different stages, we have listed that step separately at the end of the horizontal axis of the stacked bar chart in Figure 5.
4.2.1. Where do Non-Technical Roles Work?
The data for Figure 5 show highly significant differences from one column to the next column (, p¡ .001).
Through a close examination of Figure 5, we found that the degree of involvement by Managers/Executives and by Communicators is roughly synchronized—despite their seeming lack of collaboration patterns as seen in Figures 3 and 4. Each of these roles is relatively strongly engaged in the first stage (measurement plan) and the last two stages (evaluate, communicate), but largely absent from the technical work stages (access data, features, model). Perhaps each of these roles is engaged with relatively humanistic aspects of the work, but with different and perhaps unconnected humanistic aspects for each of their distinct roles.
4.2.2. Where do Domain Experts Work?
The involvement of Domain Experts is similar to that of Managers and Communicators, but to a lesser extent. Domain experts are active at every stage, in contrast to Communicators, who appear to drop out during the modeling stage. Domain experts are also more engaged (by self-report) during stages in which Managers have very little engagement. Thus, it appears that Domain Experts are either directly involved in the core technical work, or are strongly engaged in consultation during data-centric activities such as data-access and feature-extraction. They take on even more prominent roles during later stages of evaluating and communicating.
4.2.3. Where do Technical Roles Work?
There is an opposite pattern of engagement for the core technical work, done by Engineers/Analysts/Programmers, who are most active while the Managers and Communicators are less involved.
The degree of involvement by Researchers/Scientists seems to be relatively stable and strongly engaged across all stages. This finding may clarify the “hub” results of Figure 4, which suggested relatively egalitarian collaboration relations. Figure 5 suggests that perhaps Researchers/Scientists actively guide the project through all of its stages.
4.2.4. Who Checks AI Fairness and Bias?
The stage of assessment and mitigation of bias appears to be treated largely as a technical matter. Communicators and Managers have minimal involvement. Unsurprisingly, Domain Experts play a role in this stage, presumably because they know more about how bias may creep into work in their own domains.
From the analyses in this section, we begin to see data science work as a convergence of several analytic dimensions: people in roles, roles in collaboration, and roles in a sequence of project activities. The next section adds a fourth dimension, namely the tools used by data scientists.
|Tool Category||Tools Mentioned by Respondents (number of times mentioned)|
|asynchronous discussion||Slack (86), email (55), Microsoft Teams (1)|
|synchronous discussion||meeting (13), e-meeting (12), phone (1)|
|project management||Jira (8), ZenHub (2), Trello (1)|
|code management||GitHub (56), Git (5)|
|code||Python (42), R (9), Java (3), scripts (3)|
|code editor||Visual Studio Code (11), PyCharm (11), RStudio (8), Eclipse (1), Atom (1)|
|interactive code environment||Jupyter Notebook (66), SQL (6), terminal (4), Google Colab (4)|
|software package||Scikit-learn (3), Shiny App (2), Pandas (2)|
|analytics/visualization||SPSS (27), Watson Analytics (22), Cognos (7), ElasticSearch (4), Apache Spark (3), Graphana (2), Tableau (2), Logstash (2), Kibana (1)|
|spreadsheet||Microsoft Excel (22), spreadsheets (3), Google Sheets (1)|
|document editing||wiki (2), LaTeX (2), Microsoft Word (2), Dropbox Paper (2), Google Docs (1)|
|filesharing||Box (43), cloud (5), NFS (2), Dropbox (1), Filezilla (1)|
|presentation software||Microsoft Powerpoint (18), Prezi (1)|
Note: code allows programmers to write algorithms for data science. code editor and interactive code environment provide a user experience for writing that code. code management is where the code may be stored and shared. By contrast, analytics/visualization provides ”macro-level” tools that can invoke entire steps or modular actions in a data science pipeline.
4.3. Tooling for Collaboration
We asked respondents to describe the tools that they used in the stages of a data science project—i.e., the same stages as in the preceding section. We provided free-text fields for them to list their tools, so that we could capture the range of tools used. We then collaboratively converted the free-text responses into sets of tools for each response, before iteratively classifying the greater set of tools from all responses into 13 higher-level categories, as shown in Table2. 222We discussed the classification scheme repeatedly until we were in agreement about which tool fit into each category. We postponed all statistical analyses until we had completed our social process of classification.
When we examined the pattern of tools usage across project stages (Figure 6), we found highly significant differences across the project stages (, p¡ .001). As above, we summarize trends that suggest interesting properties of data science collaboration:
4.3.1. Coding and Discussing
The use of coding resources was as anticipated. Coding resources were used during intense work with data, and during intense work with models. Code may serve as a form of asynchronous discussion (e.g., (Brothers et al., 1990)): Respondents tended to decrease their use of asynchronous discussion during project stages in which they made relatively heavier use of coding resources.
4.3.2. Documentation of Work
We were interested to see whether and how respondents documented their work. Respondents reported some document-editing during the activities leading to a measurement plan. There was also a small use of presentation software, which can of course serve as a form of documentation.333 We consider the use of spreadsheets to be ambiguous in terms of documentation. Spreadsheets function both as records and as discardable scratch-pads and sandboxes. Baker et al. summarize the arguments for treating spreadsheets
We consider the use of spreadsheets to be ambiguous in terms of documentation. Spreadsheets function both as records and as discardable scratch-pads and sandboxes. Baker et al. summarize the arguments for treating spreadsheetsnot as documentation, but rather as tools that are in need of external documentation (e.g., (Davis, 1996)), which is often lacking (Baker et al., 2006). The use of these tools returned during the stage of delivery to clients.
4.3.3. Gaps in Documentation for Feature Engineering
In contrast, we were surprised that there was little use of documents during the phase of feature-extraction and feature-engineering. This stage is an important site for the design of data (Feinberg, 2017). The meaning and nature of the data may be changed (Feinberg, 2017; Muller et al., 2019b) during this time-consuming step (Zöller and Huber, 2019; Muller et al., 2019b; Guo et al., 2011; Kandel et al., 2011; Rattenbury et al., 2017). During this phase, the use of synchronous discussion tools dropped to nearly zero, and the use of asynchronous discussion tools was relatively low. There was relatively little use of filesharing tools. It appears that these teams were not explicitly recording their decisions. Thus, important human decisions may be inscribed into the data and the data science pipeline, while simultaneously becoming invisible (Muller et al., 2019b; Pine and Liboiron, 2015). The implications for subsequent re-analysis and revision may be severe.
4.3.4. Gaps in Documentation for Bias Mitigation
We were similarly surprised that the stage of bias detection and mitigation also seemed to lack documentation, except perhaps through filesharing. We anticipate that organizations will begin to require documentation of bias mitigation as bias issues become more important.
In Section 4.1, we showed that data science workers engage in extensive collaboration. Then in Section 4.2 we showed that collaboration is pervasive across across all stages of data science work, and that members of data science teams are intensely involved in those collaborative activities. By contrast, this section shows gaps in the usage of collaborative tools. We propose that a new generation of data science tools should be created with collaboration “baked in” to the design.
4.4. Collaborative Practices around Code and Data
Finally, we sought to understand how tool usage by a respondent relates to their practices around code reading, re-use, and documentation, as well as data sharing, re-use, and documentation. Particularly, if technical team members must collaborate with non-technical team members, then tools and practices to support documentation will be key.
To begin, we clustered the survey respondents into different clusters according to their self-reported tool usage. To create a “tools profile” for each respondent, we used the questions regarding tool usage, described in Section 4.3, and summed up all the mentions of each tool from all the open-ended questions on tool usage. Thus, if a respondent answered only “GitHub” for all 7 stages of their data science project, then they would have a count of 7 under the tool “GitHub” and a count of 0 elsewhere.
Using the k-means clustering algorithm in the Scikit-learn Python library, we found that k=3 clusters resulted in the highest average silhouette coefficient of 0.254. This resulted in the clusters described in Table 3. We only included respondents who had mentioned at least one tool across all the tool usage questions; as the questions were optional, and we experienced some dropout partway through the survey, we had 76 respondents to cluster.
We saw that the respondents in the first cluster (Cluster 0) mentioned using both GitHub and Slack at multiple points in their data science workflow, as well as email and Box to a lesser extent. Given these tools’ features for project management, including code management, issue tracking, and team coordination, we characterize this cluster of respondents as project managed. In contrast, respondents in Cluster 1 mentioned using Jupyter Notebook repeatedly, and only occasionally mentioned other tools; thus we designate the cluster’s respondents as using interactive tools due to Jupyter Notebook’s interactive coding environment. Finally, Cluster 2 had the most respondents and a longer tail of mentioned tools. However, the tools most mentioned were Python and SPSS; thus, we characterize this cluster of respondents as using scripted tools. We also noticed that Cluster 2 was predominately made up of self-reported Engineers/Analysts/Programmers at 80%. Meanwhile, Researchers/Scientists had the greatest prevalence in Cluster 0 and Cluster 1, with 84.2% and 84.6%, respectively.
|Respondent Clusters||Number of People Per Cluster||Tools Frequently Mentioned (number of times mentioned across questions)|
|0 (project managed)||19||GitHub (86), Slack (79), email (47), Box (26)|
|1 (interactive)||13||Jupyter Notebook (82), GitHub (44), Slack (22)|
|2 (scripted)||44||Python (50), SPSS (44), GitHub (27), Jupyter notebook (27), Slack (24)|
4.4.1. Reading and Re-using Others’ Code and Data
In Figure 7, we report the answers in the affirmative to questions asking respondents whether they read other people’s code and data and re-used other people’s code and data, separated and normalized by cluster. One finding that stood out is the overall lower levels of collaborative practices around data as opposed to code. This was observed across all three clusters of tool profiles, despite the ability in some tools, such as GitHub, to store and share data.
When comparing across the stages of planning, coding, and testing of code, there were few noticeable differences between clusters except in the stage of testing code. Here, we saw that Clusters 1 (interactive) and 2 (scripted) had relatively fewer respondents reading others’ code in the testing phase (and Cluster 2 had few respondents re-using others’ code in the testing phase). It may be that in an interactive notebook or scripting environment, there is relatively less testing, in contrast to the practice of writing unit tests in larger software projects, and as a result, a relatively lower need for alignment with others’ code when it comes to testing. We also saw that Cluster 0 (project-managed) had no respondents that did not read other people’s code or did not re-use other people’s code, which suggests that workers in this cluster are coordinating their code with others, using tools like GitHub and Slack.
4.4.2. Expectations Around One’s Own Code and Data Being Re-used
|Cluster 0 (Project managed)||Cluster 1 (Interactive)||Cluster 2 (Scripted)||All|
|Expect that their code will be re-used||68.4%||84.6%||80.9%||78.8%|
|Expect that their data will be re-used||73.7%||46.1%||50%||59.6%|
In Table 4, we report on respondents answers to their expectations for how their own code or data will be used by others. Respondents were more likely to state that they expected others to re-use their code as opposed to their data. In the case of code re-use, peoples’ expectations were slightly lower for respondents in Cluster 0 and slightly higher for respondents in the other clusters, though this was not significant. We also saw that the expectation that data would be re-used was more prevalent in Cluster 0 while relatively low in Cluster 1 and 2. This may be due to the native features for displaying structured data or coordinating the sharing of data, such as using version control, are more rudimentary within tools like Jupyter Notebook, although a few recent works have developed prototypes examined in a lab environment (e.g., (Kery et al., 2019; Wang et al., 2019a, 2020)).
|Documentation Practice||Cluster 0 (Project managed)||Cluster 1 (Interactive)||Cluster 2 (Scripted)|
|Longer blocks of comments in the code||68.4%||30.8%||38.1%|
|Markdown cells in notebooks||63.2%||92.3%||38.1%|
|Extra text in JSON schema (or similar)||27.8%||18.2%||15%|
4.4.3. Code and Data Documentation Practices
In Table 5, we show respondents’ answers to how they document their code as well as their data, broken down by cluster. Overall, we see higher rates of self-reported code documentation in Cluster 0 (project managed) and 1 (interactive) compared to 2 (scripted). For instance, 100% of members of Cluster 0 said they used in-line comments to document their code. Cluster 2 also had high rates of using in-line comments, though other practices were infrequently used. Unsurprisingly, the use of markdown cells in notebooks was most prevalent in Cluster 1 (interactive), while longer blocks of comments in the code was least used (30.8%) likely because markdown cells perform that function. A lack of code documentation besides in-line comments in Cluster 2 suggests that it may be more difficult for collaborators to work with code written by members of this cluster. We note that this is not due to a low expectation within Cluster 2 that code would be re-used.
We found that data science workers overall performed less documentation when it comes to data as opposed to code, perhaps due to their perceptions around re-use. Even something basic like adding column labels to data sets was not performed by a third to a quarter of members of each cluster, as shown in Table 5. Instead, the most prevalent practice within any of the clusters was the use of external documents by Cluster 0 (project managed) at 77.8%. While external documents allow data set curators to add extensive documentation about their data, one major downside is that they are uncoupled—there is little ability to directly reference, link to, or display annotations on top of the data itself. This may lead to issues where the documentation can be lost, not noticed by a collaborator, or misunderstood out of context.
In this section, we examined the collaborative practices of data science workers in relation to the kinds of tools they use. Through clustering, we identified three main “tools profiles”. The first makes heavy use of GitHub and Slack and is relatively active in reading other people’s code, re-using other people’s code, expecting that others would use one’s code and data, and documenting code. Out of all the three clusters, workers using this tool profile seem to have the healthiest collaborative practices. However, even this cluster has relatively low rates of collaboration and documentation around data.
The second cluster primarily uses Jupyter Notebook for data science work. While people in this cluster were generally active in code collaboration and code documentation, we notice a lower rate of reading others’ code while testing one’s code as well as a low expectation that one’s data would be re-used.
The third cluster had a greater variety of tool usage but more emphasis on writing scripts in Python or SPSS. This cluster had low rates of code documentation outside of in-line comments, signaling potential difficulties for non-technical collaborators.
We began our Results by asking “Do data science workers collaborate?” The answer from this survey dataset is clearly “yes.” These results are in agreement with (Grappiolo et al., 2019; Hou and Wang, 2017; Passi and Jackson, 2017; Stein et al., 2017; Borgman et al., 2012; Mao et al., 2020; Viaene, 2013; Wang et al., 2019b). In this paper, we provide a greater depth of collaboration information by exploring the interactions of team roles, tools, project stages, and documentation practices. One of our strong findings is that people in most roles report extensive collaboration during each stage of a data science project. These findings suggest new needs among data science teams and communities, and encourage us to think about a new generation of “collaboration-friendly” data science tools and environments.
5.1. Possible Collaborative Features
5.1.1. Provenance of data
We stated a concern earlier that there seemed to be insufficient use of documentation during multiple stages of data science projects and fewer practices of documentation for data as opposed to code. Partly this may be due to a lack of expectations that one’s data will ever be re-used by another. In addition, there are now mature tools for collaborating on code due to over a decade of research and practice on this topic in the field of software engineering (Treude et al., 2009; Dabbish et al., 2012); however, fewer tools exist for data and are not yet widely adopted. This absence of documentation may obscure the source of datasets as well as computations performed over datasets in the steps involving data cleaning or transformation. The problem can be compounded if there is a need to combine datasets for richer records. When teams of data science workers share data, then the knowledge of one person may be obscured, and the organizational knowledge of one team may not be passed along to a second team. Thus, there is a need for a method to record data provenance. A method for embedding this information within the data themselves would be a superior outcome as opposed to within external documents. As one example, the DataHub project (Bhardwaj et al., 2014) replicates GitHub-like features of version control and provenance management but for datasets. In a similar vein, the ModelDB project provides version control and captures metadata about machine learning models over the course of their development (Vartak et al., 2016).
5.1.2. Provenance of code
There also remain subtle issues in the provenance of code. At first, it seems as if the reliance of data science on code packages and code libraries should obviate any need for documentation of code in data science. However, in Section 4.3.3, we discussed the invisibility of much of the work on feature extraction and feature engineering. The code for these activities is generally not based on a well-known and well-maintained software package or product. If this code becomes lost, the important knowledge about the nature and meaning of the features (Muller et al., 2019a, b) may also be lost.
However, lack of motivation to document lower-level decision-making may be a limiting factor towards stronger documentation practices, particularly in an “exploration” mindset. In the software engineering realm, tools have been proposed to support more lightweight ways for programmers to externalize their thought processes, such as social tagging of code (Storey et al., 2006) or clipping rationales from the web (Liu et al., 2019a). Other tools embed and automatically capture context while programmers are foraging for information to guide decisions, such as search (Brandt et al., 2010) and browsing history (Hartmann et al., 2011; Fourney and Morris, 2013). Similar ideas could be applied in the case of data science, where users may be weighing the use of different code libraries or statistical methods. Other decisions such as around feature engineering may result from conversations between team members (Park et al., 2018) that then could be linked in the code.
In addition, we notice a drop-off in collaborative code practices when it came to testing already-written code. This has important implications for developing standards around testing for data and model issues of bias, which will only be more important in years to come. Thus, preserving the provenance of code may also be important to keep the data processing steps transparent and accountable.
More broadly, data science projects may inadvertently involve many assumptions, improvisations, and hidden decisions. Some of these undocumented commitments may arise through the assumption that everyone on the team shares certain knowledge—but what about the next team that “inherits” code or data from a prior project? As we just noted, this kind of transparent transmission of knowledge may be important with regard to the design of features (Feinberg, 2017; Muller et al., 2019b). It can also be important for the earlier step of establishing a data management plan, which may define, in part, what qualifies as data in this project (Pine and Liboiron, 2015).
We advocate to make invisible activities more visible—and thus discuss-able and (when necessary) debatable and accountable. This argument for transparency is related but distinct from the ongoing “Explainable AI” (XAI) initiative—XAI emphasizes using various techniques (e.g., visualization (Weidele et al., 2020)) and designs to make machine learning algorithms understandable by non-technical users (Amershi et al., 2019; Drozdal et al., 2020; Heer, 2019; Liao et al., 2020; Zhang et al., 2020), whereas we argue for the explanation of decisions among the various data science creators of machine learning algorithms. Recent work in this space similarly argues for more documentation to improve transparency, as well as greater standardization around documentation (Mitchell et al., 2019; Gebru et al., 2018), particularly important when it comes to publicly-released datasets and models.
5.2. Collaborating with Whom? and When?
These concerns for provenance and transparency may be important to multiple stakeholders. Team members are obvious beneficiaries of good record-keeping. In the language of value sensitive design (Friedman et al., 2013), team members are direct stakeholders—i.e., people who directly interact with data science tools in general, and the project’s code in particular. Again using concepts from value sensitive design, there are likely to be multiple indirect stakeholders—i.e., people who are affected by the data science system, or by its code, or by its data.
5.2.1. Other Data Science Teams
5.2.2. Data Science Social Implications
For data science projects that affect bank loans (Bruckner, 2018; O’neil, 2016) prison sentences (Picard et al., ), or community policing (Verma and Dombrowski, 2018), the public are also indirect stakeholders as they worry about the possibility of inequitable treatment. Finally, another possible beneficiary of provenance and transparency is one’s own future self, who may return to a data science project after a year or two of other engagements, only to discover that the team has been dispersed, personal memories have faded, and the project needs to be learned like any unfamiliar data science resource.
5.2.3. “Imbalanced” Collaboration
In Section 4.1.2, we observed that there is a mismatch around perceived collaborations between different roles. For example, Communicators believed they collaborate a lot with Managers/Executives, but the Managers/Executives perceived they collaborated the least with Communicators. This result is the normalized proportions of the reported collaborations from each role in Figure 3, so it is possible that Managers/Executives may collaborate a lot with all other roles and the collaboration with Communicators has the smallest proportion among these collaborations.
We speculate that the collaborations reported by our informants may have been highly directional. Communicators may have received information from other roles—or may simply have read shared documents or observed meetings—to find the information that they needed to communicate. Their role may have been largely to receive information, and they are likely to have been aware of their dependencies on other members of the team. By contrast, the other roles may have perceived Communicators as relatively passive team members. These other roles may have considered that they themselves received little from the Communicators, and may have down-reported their collaborations accordingly.
Different roles also reported different intra-role collaboration patterns in Section 4.1.3. These patterns suggest that the people in these roles may have different relationships with their own communities of practice (Duguid, 2005; Wenger, 2011). There may be stronger peer communities for each of Engineers, Researchers, and Communicators, and there may be weaker peer communities for each of Managers and Domain Experts. It may be that Domain Experts are focused within their own domain, and may not collaborate much with Domain Experts who work in other domains. It may be that Managers/Executives focus on one data science project at-a-time, and do not consult with their peers about the technical details within each project.
This paper reports a large-scale quantitative survey of data science workers. A future paper should take a more qualitative approach to examine the relations within these types of teams.
5.3. AI Fairness and Bias
The detection, assessment, and mitigation of bias in data science systems is inherently complex and multidisciplinary, involving expertise in prediction and modeling, statistical assessments in particular domains, domain knowledge of an area of possible harms, and aspects of regulations and law. There may also be roles for advocates for particular affected groups, and possibly advocates for commercial parties who favor maintaining the status quo.
In these settings, a data science pipeline becomes an object of contention. Making sense of the data science pipeline requires multiple interpreters from diverse perspectives, including adversarial interpreters (Friedman et al., 2013). All of the issues raised above regarding provenance and transparency are relevant.
Our result confirmed that in data science collaborations, there are activities around AI fairness and bias detection and mitigation happening along the data science workflow in Section 4.2.4. And it appears to be treated largely as a technical matter. For example, data scientists and engineers need to involve in the process as they follow up with the latest technical algorithms on how to detect bias and fix it. Our results also suggest that Domain Experts also plays a role in the Bias detection and Mitigation process, presumably because they know more about how bias may creep into work in their own domains.
In addition, we did not see much involvement from Communicators and Managers/Executives. This is surprising, as Communicators and Managers are the ones who may know the most about policy requirements and worry the most about the negative consequences of a biased AI algorithm. We speculate that Managers may become more involved in this stage in the future, as bias issues become more salient in industry and academia (Abiteboul et al., 2016; Garcia, 2016; Hajian et al., 2016).
5.4. Limitations and Future Directions
Our survey respondents were all recruited from IBM—a large, multinational technology company—and their views may not be fully representative of the larger population of professionals working in the data science related projects.
One example of how our results might be skewed comes from the fact that almost all of our respondents worked in small teams, typically with 5 or 6 collaborators in a team. While this number is consistent to what previous literature reported (e.g.,(Wang et al., 2019b) reported 2-3 data scientists in a team, and our work also counts managers, communicators, researchers, and engineers), in other contexts the size of data science teams may vary. Also, due to the fact that all these respondents are from the same company, their preference in selecting tools and how to use these tools tools may be dominated by the company culture. The findings may be different if we study the data science teams collaborative practice in different scenarios, such as in offline data hackathons (Hou and Wang, 2017).
Another limitation is that our findings are based on self-reported data using an online survey. Despite this research method’s power of covering a broader user population, it is also known that survey respondents may have bias in answering those behavioral questions. We see this rather as a new promising research direction than limitation, and we look forward to conducting further studies with Contextual Inquiry (Siek et al., 2014), Participatory Analysis (Muller, 2001), or Value Sensitive Design (Friedman et al., 2013) to observe and track how people actually behave in a data science team collaboration.
We should also note that data science teams may not always appreciate the proposed features that increase transparency and accountability of each team member’s contribution, as they may have negative effects. In the co-editing activities enabled by Google Doc-liked features, writers sometime do not want to have such high level transparency due to various reasons (Wang et al., 2015). Thus, we need further user evaluation for those proposed collaboration-support features before deploying them into the real world.
In this paper, we presented results of a large-scale survey of data science workers at a major corporation that examined how data science workers collaborate. We find that not only do data science workers collaborate extensively, they perform a variety of roles, and work with a variety of stakeholders during different stages of the data science project workflow. We also investigated the tools that data scientists use when collaborating, and how tool usage relates to collaborative practices such as code and data documentation. From this analysis, we present directions for future research and development of data science collaboration tools.
In summary, we hope we have made the following contributions:
The first large in-depth survey about data science collaborative practices, and the first large study to provide roles-based analyses of collaborations.
The first large-scale study of data science activities during specific stages of data science projects.
The first analysis of collaborative tools usage across the stages of data science projects.
The first large-scale analysis of documentation practices in data science.
Acknowledgements.We appreciate all the informants for participating in our online survey. This work is generously supported by MIT-IBM Watson AI Lab under the “Human-in-the-loop Automated Machine Learning” project.
- Data, responsibly (dagstuhl seminar 16291). In Dagstuhl Reports, Vol. 6. Cited by: §5.3.
Human-guided machine learning for fast and accurate network alarm triage.
Twenty-Second International Joint Conference on Artificial Intelligence, Cited by: §2.3.
- Guidelines for human-ai interaction. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 3. Cited by: §2.3, §2.3, §5.1.3.
- Developing a research agenda for human-centered data science. In Proceedings of the 19th ACM Conference on Computer Supported Cooperative Work and Social Computing Companion, pp. 529–535. Cited by: §2.1.
- A survey of mba spreadsheet users. Spreadsheet Engineering Research Project. Tuck School of Business 9. Cited by: footnote 3.
- Effecting change: coordination in large-scale software development. In Proceedings of the 2008 international workshop on Cooperative and human aspects of software engineering, pp. 17–20. Cited by: §2.1.
- AI fairness 360: an extensible toolkit for detecting and mitigating algorithmic bias. IBM Journal of Research and Development 63 (4/5), pp. 4–1. Cited by: §4.2.
- Datahub: collaborative data science & dataset version management at scale. arXiv preprint arXiv:1409.0798. Cited by: §5.1.1.
- Latent social structure in open source projects. In Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, pp. 24–35. Cited by: §2.1.
- Disempowered by data: nonprofits, social enterprises, and the consequences of data-driven work. In Proceedings of the 2017 CHI conference on human factors in computing systems, pp. 3608–3619. Cited by: §2.2.
- Who’s got the data? interdependencies in science and technology collaborations. Computer Supported Cooperative Work (CSCW) 21 (6), pp. 485–523. Cited by: §2.2, §2.3, §5.
- Example-centric programming: integrating web search into the development environment. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 513–522. Cited by: §5.1.2.
- ICICLE: groupware for code inspection. In Proceedings of the 1990 ACM conference on Computer-supported cooperative work, pp. 169–181. Cited by: §4.3.1.
- The promise and perils of algorithmic lenders’ use of big data. Chi.-Kent L. Rev. 93, pp. 3. Cited by: §5.2.2.
- Deep blue. Artificial intelligence 134 (1-2), pp. 57–83. Cited by: §2.1.
- Designing comments. Note: Poster at JupyterCon 2018 Cited by: §2.3, §2.3.
- Social coding in github: transparency and collaboration in an open software repository. In Proceedings of the ACM 2012 conference on computer supported cooperative work, pp. 1277–1286. Cited by: §2.1, §5.1.1.
- Predict saturated thickness using tensorboard visualization. In Proceedings of the Workshop on Visualisation in Environmental Sciences, pp. 35–39. Cited by: §2.3.
- Tools for spreadsheet auditing. International Journal of Human-Computer Studies 45 (4), pp. 429–442. Cited by: footnote 3.
- How ibm builds an effective data science team. VentureBeat. External Links: Cited by: §1.
- Exploring information needs for establishing trust in automated data science systems. In IUI’20, pp. in press. Cited by: §5.1.3.
- “The art of knowing”: social and tacit dimensions of knowledge and the limits of the community of practice. The information society 21 (2), pp. 109–118. Cited by: §5.2.3.
- A design perspective on data. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, pp. 2952–2963. Cited by: §4.3.3, §5.1.3, §5.2.1.
- Enhancing technical q&a forums with citehistory. In Seventh International AAAI Conference on Weblogs and Social Media, Cited by: §5.1.2.
- Value sensitive design and information systems. In Early engagement and new technologies: Opening up the laboratory, pp. 55–95. Cited by: §5.2, §5.3, §5.4.
- Racist in the machine: the disturbing implications of algorithmic bias. World Policy Journal 33 (4), pp. 111–117. Cited by: §5.3.
- Datasheets for Datasets. arXiv e-prints, pp. arXiv:1803.09010. External Links: Cited by: §5.1.3.
- Towards human-guided machine learning. In Proceedings of the 24th International Conference on Intelligent User Interfaces, pp. 614–624. Cited by: §2.3, §2.3.
-  (Website) External Links: Cited by: §1.
-  (Website) External Links: Cited by: §2.3.
- JupyterLab: the next generation jupyter frontend. JupyterCon 2017. Cited by: §1.
- The semantic snake charmer search engine: a tool to facilitate data science in high-tech industry domains. In Proceedings of the 2019 Conference on Human Information Interaction and Retrieval, pp. 355–359. Cited by: §2.2, §2.3, §5.
- AI and hci: two fields divided by a common focus. Ai Magazine 30 (4), pp. 48–48. Cited by: §2.1.
- Proactive wrangling: mixed-initiative end-user programming of data transformation scripts. In Proceedings of the 24th annual ACM symposium on User interface software and technology, pp. 65–74. Cited by: §1, §1, §2.1, §2.1, §2.1, §2.1, §2.2, §4.3.3.
- Algorithmic bias: from discrimination discovery to fairness-aware data mining. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 2125–2126. Cited by: §5.3.
- Designing task visualizations to support the coordination of work in software development. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, pp. 39–48. Cited by: §2.1.
- HyperSource: bridging the gap between source and code-related web sites. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 2207–2210. Cited by: §5.1.2.
- Top 10 challenges to practicing data science at work. Note: http://businessoverbroadway .com/top-10-challengesto-practicing-data-science-at-work Cited by: §1.
- Interactive dynamics for visual analysis. Queue 10 (2), pp. 30. Cited by: §1.
- Voyagers and voyeurs: supporting asynchronous collaborative information visualization. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 1029–1038. Cited by: §2.3.
- Agency plus automation: designing artificial intelligence into interactive systems. Proceedings of the National Academy of Sciences 116 (6), pp. 1844–1850. Cited by: §5.1.3.
- An empirical study of global software development: distance and speed. In Proceedings of the 23rd international conference on software engineering, pp. 81–90. Cited by: §2.1, §2.2.
- Hacking with npos: collaborative analytics and broker roles in civic data hackathons. Proceedings of the ACM on Human-Computer Interaction 1 (CSCW), pp. 53. Cited by: §1, §1, §2.1, §2.2, §2.2, §2.2, §2.2, §2.3, §5.4, §5.
- Cultural influences and globally distributed information systems development: experiences from chinese it professionals. In Proceedings of the 2007 ACM SIGMIS CPR conference on Computer personnel research: The global information technology workforce, pp. 36–45. Cited by: §2.1.
-  (Website) External Links: Cited by: §2.3.
-  (Website) External Links: Cited by: §2.3.
- External Links: Cited by: §1, §2.1.
- Wrangler: interactive visual specification of data transformation scripts. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 3363–3372. Cited by: §2.1, §4.3.3.
- Towards effective foraging by data scientists to find past analysis choices. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 92. Cited by: §1, §2.3, §2.3, §4.4.2.
- The story in the notebook: exploratory data science using a literate programming tool. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 174. Cited by: §1, §1, §1, §2.1, §2.1, §2.3, §2.3.
- The emerging role of data scientists on software development teams. In Proceedings of the 38th International Conference on Software Engineering, pp. 96–107. Cited by: §1, §1.
- Jupyter notebooks-a publishing format for reproducible computational workflows.. In ELPUB, pp. 87–90. Cited by: §1.
- Coordination in software development. Communications of the ACM 38 (3), pp. 69–82. Cited by: §2.1, §2.2.
- Practitioners teaching data science in industry and academia: expectations, workflows, and challenges. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, pp. 263. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §2.2, §2.3.
- The nine roles you need on your data science research team. Note: TechTarget. https://searchcio.techtarget.com/news/252445605/The-nine-roles-you-need-on-your-data-science-research-team Cited by: §1.
- Data career paths: data analyst vs. data scientist vs. data engineer: 3 data careers decoded and what it means for you. Note: Udacity. https://blog.udacity.com/2014/12/data-analyst-vs-data-scientist-vs-data-engineer.html Cited by: §1.
- Questioning the ai: informing design practices for explainable ai user experiences. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, Cited by: §5.1.3.
- SAS for mixed models 2nd edition. SAS Institute, Cary, North Carolina, USA. Cited by: §1.
- Unakite: scaffolding developers’ decision-making using the web. In Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology, pp. 67–80. Cited by: §5.1.2.
- An admm based framework for automl pipeline configuration. External Links: Cited by: §1.
- How data scientists work together with domain experts in scientific collaborations. In Proceedings of the 2020 ACM conference on GROUP, Cited by: §1, §2.1, §2.2, §2.2, §2.2, §2.3, §5.
- The science of managing data science. Queue 13 (4), pp. 30. Cited by: §1.
- Data mining tools. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 1 (5), pp. 431–443. Cited by: §1.
- Collaborative approaches needed to close the big data skills gap. Journal of Organization design 3 (1), pp. 26–30. Cited by: §1.
- Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–229. Cited by: §5.1.3.
- Human-centered study of data science work practices. In Extended Abstracts of the 2019 CHI Conference on Human Factors in Computing Systems, pp. W15. Cited by: §2.1, §5.1.2.
- Layered participatory analysis: new developments in the card technique. In Proceedings of the SIGCHI conference on Human factors in computing systems, pp. 90–97. Cited by: §5.4.
- How data science workers work with data: discovery, capture, curation, design, creation. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems, CHI ’19, New York, NY, USA, pp. Forthcoming. Cited by: §1, §1, §1, §1, §2.1, §2.1, §2.1, §2.1, §2.1, §2.2, §2.3, §2.3, §4.2, §4.3.3, §5.1.2, §5.1.3, §5.2.1.
- Explore new features with us. Note: Lab demo at JupyterCon 2018 Cited by: §2.3, §2.3.
- Why do people tag?: motivations for photo tagging. Communications of the ACM 53 (7), pp. 128–131. Cited by: §1.
- Weapons of math destruction: how big data increases inequality and threatens democracy. Broadway Books. Cited by: §5.2.2.
- How people write together now: beginning the investigation with advanced undergraduates in a project course. ACM Transactions on Computer-Human Interaction (TOCHI) 24 (1), pp. 4. Cited by: §2.3.
- Post-literate programming: linking discussion and code in software development teams. In The 31st Annual ACM Symposium on User Interface Software and Technology Adjunct Proceedings, UIST ’18 Adjunct, New York, NY, USA, pp. 51–53. External Links: Cited by: §2.1, §5.1.2.
- Programmer, interrupted. In 2013 IEEE Symposium on Visual Languages and Human Centric Computing, pp. 171–172. Cited by: §1.
- Trust in data science: collaboration, translation, and accountability in corporate data science projects. Proceedings of the ACM on Human-Computer Interaction 2 (CSCW), pp. 136. Cited by: §2.1.
- Data vision: learning to see through algorithmic abstraction. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 2436–2447. Cited by: §2.1, §2.1, §2.2, §2.3, §4.2, §5.
- Building data science teams. Note: Stanford University. http://web.stanford.edu/group/ mmds/slides2012/s-patil1.pdf Cited by: §1.
-  Beyond the algorithm. Cited by: §5.2.2.
- The politics of measurement and action. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3147–3156. Cited by: §1, §2.1, §4.2, §4.3.3, §5.1.3, §5.2.1.
- Principles of data wrangling: practical techniques for data preparation. ” O’Reilly Media, Inc.”. Cited by: §2.1, §4.3.3.
- Exploration and explanation in computational notebooks. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 32. Cited by: §1, §1, §2.1, §2.3, §2.3.
- Field deployments: knowing from using in context. In Ways of Knowing in HCI, pp. 119–142. Cited by: §5.4.
- How to make sense of team sport data: from acquisition to data modeling and research aspects. Data 2 (1), pp. 2. Cited by: §2.2, §2.3, §5.
- Shared waypoints and social tagging to support collaboration in software development. In Proceedings of the 2006 20th anniversary conference on Computer supported cooperative work, pp. 195–198. Cited by: §1, §5.1.2.
- Data diff: interpretable, executable summaries of changes in distributions for data wrangling. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2279–2288. Cited by: §2.1.
- Cooperation between developers and operations in software engineering projects. In Proceedings of the 2008 international workshop on Cooperative and human aspects of software engineering, pp. 105–108. Cited by: §2.1.
- Empirical studies on collaboration in software development: a systematic literature review. Cited by: §2.1, §2.2, §5.1.1.
- Beyond interactive: notebook innovation at netflix. Retrieved from Medium The Netflix Tech Blog: https://medium. com …. Cited by: §1.
- M odel db: a system for machine learning model management. In Proceedings of the Workshop on Human-In-the-Loop Data Analytics, pp. 14. Cited by: §5.1.1.
- Confronting social criticisms: challenges when adopting data-driven policing strategies. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, pp. 469. Cited by: §5.2.2.
- Data scientists aren’t domain experts. IT Professional 15 (6), pp. 12–17. Cited by: §2.2, §2.3, §5.
- How data scientists use computational notebooks for real-time collaboration. In Proceedings of the 2019 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. article 39. Cited by: §1, §1, §2.1, §2.3, §2.3, §2.3, §2.3, §4.4.2.
- Callisto: capturing the “why” by connecting conversations with computational narratives. In Proceedings of the 2020 CHI Conference Extended Abstracts on Human Factors in Computing Systems, pp. in press. Cited by: §1, §2.3, §2.3, §2.3, §4.4.2.
- DocuViz: visualizing collaborative writing. In Proceedings of CHI’15, New York, NY, USA, pp. 1865–1874. External Links: Cited by: §5.4.
- Human-ai collaboration in data science: exploring data scientists’ perceptions of automated ai. To appear in Computer Supported Cooperative Work (CSCW). Cited by: §1, §1, Figure 1, §2.1, §2.1, §2.1, §2.1, §5.4, §5.
- Where does alphago go: from church-turing thesis to alphago thesis and beyond. IEEE/CAA Journal of Automatica Sinica 3 (2), pp. 113–120. Cited by: §2.1.
- AutoAIViz: opening the blackbox of automated artificial intelligence with conditional parallel coordinates. In IUI’20, pp. in press. Cited by: §5.1.3.
- Communities of practice: a brief introduction. Cited by: §5.2.3.
- Translation between software designers and users. Communications of the ACM 36 (6), pp. 102–104. Cited by: §2.2.
- Exploring the ecosystem of software developers on github and other platforms. In Proceedings of the companion publication of the 17th ACM conference on Computer supported cooperative work & social computing, pp. 265–268. Cited by: §1.
- Managing collaborative activities in project management. In Proceedings of the 2007 symposium on Computer human interaction for the management of information technology, pp. 3. Cited by: §2.1.
- Effect of confidence and explanation on accuracy and trust calibration in ai-assisted decision making. In Proceedings of the Conference on Fairness, Accountability, and Transparency, Cited by: §5.1.3.
- Survey on automated machine learning. arXiv preprint arXiv:1904.12054. Cited by: §2.1, §4.3.3.