State of the Art
Researchers and practitioners have discussed whether and how ML changes software engineering with the introduction of learned models as components in software systems, e.g., [40, 4, 63, 96, 73, 82, 75, 104]. To lay the foundation for our interview study and inform the questions we ask, we first provide an overview of the related work and existing theories on collaboration in traditional software engineering and discuss how ML may change this.
Collaboration in Software Engineering
Most software projects exceed the capacity of a single developer, requiring multiple developers and teams to collaborate (“work together”) and coordinate (“align goals”). Collaboration happens across teams, often in a more formal and structured form, and within teams, where familiarity with other team members and frequent colocation fosters informal communication . At a technical level, to allow multiple developers to work together, abstraction and a divide and conquer strategy are essential. Dividing software into components (modules, functions, subsystems) and hiding internals behind interfaces is a key principle of modular software development that allows teams to divide work, and work mostly independently until the final system is integrated [65, 58]. Teams within an organization tend to align with the technical structure of the system, with individuals or teams assigned to components , hence the technical structure (interfaces and dependencies between components) influences the points where teams collaborate and coordinate. Coordination challenges are especially observed when teams cannot easily and informally communicate, often studied in the context of distributed teams of global corporations [62, 36]
and open source ecosystems[14, 87]. More broadly, interdisciplinary collaboration often poses challenges. It has been shown that when team members differ in their academic and professional backgrounds, it leads to communication, cultural or methodical challenges when working together . Key insights are that successful interdisciplinary collaboration depends on professional role, structural characteristics, personal characteristics, and a history of collaboration; specifically, structural factors such as unclear mission, insufficient time, excessive workload, and lack of administrative support are barriers to collaboration . The component interface plays a key role in collaboration as a negotiation and collaboration point. It is where teams (re-)negotiate how to divide work and assign responsibilities . Team members often seek information that may not be captured in interface descriptions, as interfaces are rarely fully specified . In an idealized development process, interfaces are defined early based on what is assumed to remain stable , because changes to interfaces later are expensive and require the involvement of multiple teams. In addition, interfaces reflect key architectural decisions for the system, aimed to achieve desired overall qualities . In practice though, the idealized divide-and-conquer approach following top-down planning does not always work without friction. Not all changes can be anticipated, leading to later modifications and renegotiations of interfaces [28, 14]. It may not be possible to identify how to decompose work and design stable interfaces until substantial experimentation has been performed . To manage, negotiate, and communicate changes of interfaces, developers have adopted a wide range of strategies for communication [14, 90, 30], often relying on informal broadcast mechanisms to share planned or performed changes with other teams. Software lifecycle models  also address this tension of when and how to design stable interfaces: Traditional top-down models (e.g., waterfall) plan software design after careful requirements analysis; the spiral model pursues a risk-first approach in which developers iterate to prototype risky parts, which then informs future system design iterations; agile approaches deemphasize upfront architectural design for fast iteration on incremental prototypes. The software architecture community has also grappled with the question of how much upfront architectural design is feasible, practical, or desirable [9, 100], showing a tension between the desire for upfront planning on one side and technical risks and unstable requirements on the other. In this context, our research explores how introducing ML into software projects challenges collaboration.
Software Engineering with ML Components
In a ML-enabled system, ML contributes one or multiple components to a larger system with traditional non-ML components. We refer to the whole system that an end user would use as the product. In some systems, the learned model may be a relatively small and isolated addition to a large traditional software system (e.g., audit prediction in tax software), in others it may provide the system’s essential core with only minimal non-ML code around it (e.g., a sales prediction system sending daily predictions by email). In addition to models, an ML-enabled system typically also has components for training and monitoring the model(s) [40, 48]. Much attention in practice recently focuses on building robust ML pipelines for training and deploying models in a scalable fashion, often under names such as “ML engineering,” “SysML,” and “MLOps” [48, 82, 55, 61]. In this work, we focus more broadly on the development of the entire ML-enabled system, including both ML and non-ML components. Compared to traditional software systems, ML-enabled systems require additional expertise in data science to build the models and may place additional emphasis on expertise such as data management, safety, and ethics [47, 4]. In this paper, we primarily focus on the roles of software engineers and data scientists, who typically have different skills and educational backgrounds [46, 104, 76, 47]: Data science education tends to focus more on statistics, ML algorithms, and practical training of models from data (typically given a fixed dataset, not deploying the model, not building a system), whereas software engineering education focuses on engineering tradeoffs with competing qualities, limited information, limited budget, and the construction and deployment of systems. Research shows that software engineers engaging in data science without further education are often naive when building models  and that data scientists prefer to focus narrowly on modeling tasks  but are frequently faced with engineering work . While there is plenty of work on supporting collaboration among software engineers [30, 23, 77, 107] and more recently on supporting collaboration among data scientists [98, 106], we are not aware of work exploring collaboration challenges between these roles, which we explore in this work. The software engineering community has recently started to explore software engineering for ML-enabled systems as a research field, with many contributions on bringing software-engineering techniques to ML tasks, such as testing models and ML algorithms [103, 25, 8, 18], deploying models [48, 33, 3, 11, 26], robustness and fairness of models [86, 94, 73], lifecycle for ML models [32, 66, 4, 57], and engineering challenges or best practices for developing ML components [42, 24, 38, 2, 4, 82, 56, 16]. A smaller body of work focuses on the ML-enabled system beyond the model, such as exploring system-level quality attributes , requirements engineering , architectural design , safety mechanisms [15, 75], and user interaction design [22, 6]. In this paper, we adopt this system-wide scope and explore how data scientists and software engineers work together to build the system with ML and non-ML components.
Because there is limited research on collaboration in building ML-enabled systems, we adopt a qualitative research strategy to explore collaboration points and corresponding challenges, primarily with stakeholder interviews. We proceeded in three steps: (1) Prepared interviews based on an initial literature review, (2) conducted interviews, and (3) triangulated results with literature findings. We base our research design on Straussian Grounded Theory [92, 91], which derives research questions from literature, analyzes interviews with open and axial coding, and consults literature throughout the process. In particular, we conduct interviews and literature analysis in parallel, with immediate and continuous data analysis, performing constant comparisons, and refining our codebook and interview questions throughout the study.
Step 1: Scoping and interview guide. To scope our research and prepare for interviews, we looked for collaboration problems mentioned in existing literature on software engineering for ML-enabled systems (Sec. Document). In this phase, we selected 15 papers opportunistically through keyword search and our own knowledge of the field. We marked all sections in those papers that potentially relate to collaboration challenges between team members with different skills or educational backgrounds, following a standard open coding process . Even though most papers did not talk about problems in terms of collaboration, we marked discussions that may plausibly relate to collaboration, such as data quality issues between teams. We then analyzed and condensed these codes into nine initial collaboration areas and developed an initial codebook and interview guide (details in the supplementary material on HotCRP).
Step 2: Interviews. We conducted semi-structured interviews with 45 participants from 28 organizations, 30 to 60 minutes long. All participants are involved in professional software projects using ML and are either already or planned to be deployed in production.
We tried to sample participants purposefully (maximum variation sampling ) to cover participants in different roles, types of companies, and countries. We intentionally recruited most participants from organizations outside of big tech companies, as they represent the vast majority of projects that have recently adopted ML and often face substantially different challenges  (21% big tech, 39% mid-size tech, 18% startups, 14% non-IT, and 7% consulting). Among the 45 participants, their roles related to machine learning (51%), software engineering (20%), management (11%), and others (details in the supplementary material). Where possible, we tried to separately interview multiple participants in different roles within the same organization to get different perspectives. We identified potential participants through personal networks, ML-related networking events, LinkedIn, and recommendations from previous interviewees and local tech leaders. We adapted our recruitment strategy throughout the research based on our findings, at later stages focusing primarily on specific roles and organizations to fill gaps in our understanding, until reaching saturation. For confidentiality, we refer to organizations by number and to participants by PXy where X refers to the organization number and y distinguishes participants in the same organization.
We transcribed and analyzed all interviews. Then, to map challenges to collaboration points, we created visualizations of organizational structure and responsibilities in each organization (we show two examples in Fig. Document) and mapped collaboration problems mentioned in the interviews to collaboration points within these visualizations. We used these visualizations to further organize our data; in particular, we explored whether collaboration problems associate with certain types of organizational structures.
Step 3: Triangulation with literature. As we gained insights from interviews, we returned to the literature to identify related discussions and possible solutions (even if not originally framed in terms of collaboration) to triangulate our interview results. Relevant literature spans multiple research communities and publication venues, including ML, HCI, software engineering, systems, and various application domains (e.g., healthcare, finance), and does not always include obvious keywords; simply searching for machine-learning research yields a far too wide net. Hence, we decided against a systematic literature review and pursued a best effort approach that relied on keyword search for topics surfaced in the interviews, as well as backward and forward snowballing. Out of over 300 papers read, we identified 61 as possibly relevant and coded them with the same evolving codebook.
Threats to validity and credibility. Our work exhibits the typical threats common and expected for this kind of qualitative research. Generalizations beyond the sampled participant distribution should be made with care; for example, we interviewed few managers, no dedicated data experts, and no clients. In several organizations, we were only able to interview a single person, giving us a one-sided perspective. Observations may be different in organizations in specific domains or geographic regions not well represented in our data. Self-selection of participants may influence results; for example developers in government-related projects more frequently declined interview requests.
Diversity of Org. Structures
Throughout our interviews, we found that the number and type of teams that participate in ML-enabled system development differs widely, as do their composition and responsibilities, power dynamics, and the formality of their collaborations. To illustrate these differences, we provide simplified descriptions of teams found in two organizations in Fig. Document. We show teams and their members, as well as the artifacts for which they are responsible, such as, who develops the model, who builds a repeatable pipeline, who operates the model (inference), who is responsible for or owns the data, and who is responsible for the final product. A team often has multiple responsibilities and interfaces with other teams at multiple collaboration points. Where unambiguous, we refer to teams by their primary responsibility as product team or model team. Organization 3 (Fig. Document, top) develops an ML-enabled system for a government client. The product (health domain), including an ML model and non-ML components, is developed by a single 8-person team. The team focuses on training a model first, before building a product around it. Software engineering and data science tasks are distributed within the team, where members cluster into groups with different responsibilities and roughly equal negotiation power. A single data scientist is part of this team, though they feel somewhat isolated. Data is sourced from public sources. The relationship between the client and development team is somewhat distant and formal. The product is delivered as a service, but the team only receives feedback when things go wrong. Organization 7 (Fig. Document, bottom) develops a product for in-house use (quality control for a production process). A small team is developing and using the product, but model development is delegated to an external team (different company) composed of four data scientists, of which two have some software engineering background. The product team interacts with the model team to define and revise model requirements based on product requirements. The product team also provides confidential proprietary data for training. The model team deploys the model and provides a ready-to-use API to the product team. As the relationship between the teams crosses company boundaries, it is rather distant and formal. The product team clearly has the power in negotiations between the teams. As illustrated by the differences between these two examples, we found no clear global patterns when looking across organizations, but patterns did emerge when focusing on three specific collaboration aspects, as we will discuss in the next sections.
Collaboration Point: Requirements and Planning
In an idealized top-down process, one would first solicit product requirements and then plan and design the product by dividing work into components (ML and non-ML), deriving each component’s requirements/specifications from the product requirements. In this process, there are multiple collaboration points: (1) product team needs to negotiate product
requirements with clients and other stakeholders; (2) product team needs to plan and design product decomposition, negotiating with component teams the requirements for individual components; and (3) project manager of the product needs to plan and manage the work across teams in terms of budgeting, effort estimation, milestones, and work assignments.
Common Development Trajectories
Few organizations, if any, follow such an idealized top-down process, and it may not even be desirable, as we will discuss later. While we did not find any global patterns for organizational structures (Sec. Document), there are indeed distinct patterns relating to how organizations elicit requirements and decompose their systems. Most importantly, we see differences in terms of the order in which teams identify product and model requirements: Model-first trajectory: 14 of the 28 organizations (3, 5, 10, 14–17, 19, 20, 22, 23, 25–27) focus on building the model first, and build a product around the model later. In these organizations, product requirements are usually shaped by model capabilities after the (initial) model has been created, rather than being defined upfront. In organizations with separate model and product teams, the model team typically starts the project and the product team joins later with low negotiating power to build a product around the model. Product-first trajectory: In 12 organizations (1, 4, 7–9, 11–13, 18, 21, 24, 28), models are built later to support an existing product. In these cases, a product often already exists and product requirements are collected for how to extend the product with new ML-supported functionality. Here, the model requirements are derived from the product requirements and often include constraints on model qualities, such as latency, memory and explainability. Parallel trajectory: Two organizations (2, 6) follow no clear temporal order; model and product teams work in parallel.
Product and Model Requirements
We found a constant tension between product and model requirements in our interviews. Functional and nonfunctional product requirements set expectations for the entire product. Model requirements set goals and constraints for the model team, such as expected accuracy and latency, target domain, and available data. Product requirements require input from the model team (, ). A common theme in the interviews is that it is difficult to elicit product requirements without a good understanding of ML capabilities, which almost always requires involving the model team and performing some initial modeling when eliciting product requirements. Regardless of whether product requirements or model requirements are elicited first, data scientists often mentioned unrealistic expectations of model capabilities. Participants interacting with clients to negotiate product requirements (which may involve members of the model team) indicate that they need to educate clients about capabilities of ML techniques to set correct expectations (P3a, P6a, P6b, P7b, P9a, P10a, P15c, P19b, P22b, P24a). This need to educate customers about ML capabilities has also been raised in the literature [47, 96, 15, 42, 99, 93]. For many organizations, especially in product-first trajectories, the model team indicates similar challenges when interacting with the product team. If the product team does not involve the model team in negotiating product requirements, the product team may not identify what data is needed for building the model, and may commit to unrealistic requirements. For example, P26a shared “For this project, [the project manager] wanted to claim that we have no false positives and I was like, that’s not gonna work.” Members of the model team report lack of ML literacy in members of the product team and project managers (P1b, P4a, P7a, P12a, P26a, P27a) and a lack of involvement (e.g., P7b: “The [product team] decided what type of data would make sense. I had no say on that.”). Usually the product team cannot identify product requirements alone, instead product and model teams need to interact to explore what is achievable. In organizations with a model-first trajectory, members of the model team sometimes engage directly with clients (and also report having to educate them about ML capabilities). However, when requirements elicitation is left to the model team, members tend to focus on requirements relevant for the model, but neglect requirements for the product, such as expectations for usability, e.g., P3c’s customers “were kind of happy with the results, but weren’t happy with the overall look and feel or how the system worked.” O’Leary and Uchida raise similar concerns about model-centric development where product requirements are not obvious at modeling time . Model development without clear model requirements is common (). Participants from model teams frequently explain how they are expected to work independently, but are given sparse model requirements. They try to infer intentions behind them, but are constrained by having limited understanding of the product that the model will eventually support (P3a, P3b, P16b, P17b, P19a). Model teams often start with vague goals and model requirements evolve over time as product teams or clients refine their expectations in response to provided models (P3b, P7a, P9a, P5b, P19b, P21a). Especially in organizations following the model-first trajectory, model teams may receive some data and a goal to predict something with high accuracy, but no further context, e.g., P3a shared “there isn’t always an actual spec of exactly what data they have, what data they think they’re going to have and what they want the model to do.” Several papers similarly report projects starting with vague model goals [75, 104, 69]. Even in organizations following a product-first trajectory, product requirements are often not translated into clear model requirements. For example, participant P17b reports how the model team was not clear about the model’s intended target domain, thus could not decide what data was considered in scope. As a consequence, model teams usually cannot focus just on their component, but have to understand the entire product to identify model requirements in the context of the product (P3a, P10a, P13a, P17a, P17b, P19b, P20b, P23a), requiring interactions with the product team or even bypassing the product team to talk directly to clients. The difficulty of providing clear requirements for an ML model has also been raised in the literature [47, 72, 51, 104, 97, 84], partially arguing that uncertainty makes it difficult to specify model requirements upfront [42, 63, 99]. Ashemore et al. report mapping product requirements to model requirements as an open challenge . Provided model requirements rarely go beyond accuracy and data security (, ). Requirements given to model teams primarily relate to some notion of accuracy. Beyond accuracy, requirements for data security and privacy are common, typically imposed by the data owner or by legal requirements (P5a, P7a, P9a, P13a, P14a, P18a, P20a-b, P21a-b, P22a, P23a, P24a, P25a, P26a). Literature also frequently discusses how privacy requirements impact and restrict ML work [13, 70, 51, 39, 52, 41]. We rarely heard of any qualities other than accuracy. Some participants report that ignoring qualities such as latency or scalability has resulted in integration and operation problems (P3c, P11a). In a few cases requirements for inference latency were provided (P1a, P6a, P14a) and in one case hardware resources provided constraints on memory usage (P14a), but no other qualities such as training latency, model size, fairness, or explainability were required that could be important for product integration and deployment later. When prompted, very few of our interviewees report considerations for fairness either at the product or the model level. Only two participants from model teams (P14a, P22a) reported receiving fairness requirements, whereas many others explicitly mentioned that fairness is not a concern for them yet (P4a, P5b, P6b, P11a, P15c, P20a, P21b, P25a, P26a). Similarly, no participant brought up requirements for explainability of models. This is in stark contrast to the emphasis that fairness and explainability receive in the literature, e.g., [106, 37, 53, 13, 84, 81, 102, 22, 6, 38]. Recommendations. Our observations suggest that involving data scientists early when soliciting product requirements is important () and that pursuing a model-first trajectory entirely without considering product requirements is problematic (). Conversely, model requirements are rarely specific enough to allow data scientists to work in isolation without knowing the broader context of the system and interaction with the product team should likely be planned as part of the process. Requirements form a key collaboration point between product and model teams, which should be emphasized even in more distant collaboration styles (e.g., outsourced model development). The few organizations that use the parallel trajectory report fewer problems by involving data scientists in negotiating product requirements to discard unrealistic ones early on (P6b). Vogelsang and Borg also provide similar recommendations to consult data scientists from the beginning to help elicit requirements . While many papers place emphasis on clearly defining ML use cases and scope [47, 93, 85], several others mention how collaboration of technical and non-technical stakeholders such as domain experts helps [81, 97, 99]. ML literacy for customers and product teams appears to be important (). P22a and P19a suggested conducting technical ML training sessions to educate clients; similar training is also useful for members of product teams. Several papers argue for similar training for non-technical users of ML products [42, 81, 96]. Most organizations elicit requirements only rather informally and rarely have good documentation, especially, but not only, when it comes to model requirements. It seems beneficial to adopt more formal requirements documentation for product and model (), as several participants reported that it fosters shared understanding at this collaboration point (P11a, P13a, P19b, P22a, P22c, P24a, P25a, P26a). Checklists could help to cover a broader range of model quality requirements, such as training latency, fairness, and explainability, in such requirements. Formalisms such as model cards  could be extended to cover more of these model requirements.
ML uncertainty makes effort estimation difficult (). Irrespective of trajectory, 19 participants (P3a, P4a, P7a-b, P8a, P14b, P15b-c, P16a, P17a, P18a, P19a-b, P20a, P22a-c, P23a, P25a) mentioned that the uncertainties associated with ML components make it difficult to estimate the timeline for development of ML components and by extension the product. Model development is typically seen as a science-like activity, where iterative experimentation and exploration is needed to identify whether and how a problem can be solved, rather than as an engineering activity that follows a somewhat predictable process. This science-like nature makes it difficult for the model team to set expectations or contracts with clients or the product team regarding effort, cost, or accuracy. While data scientists find effort estimation difficult, lack of ML literacy in managers makes it worse (P15b, P16a, P19b, P20a, P22b). Teams report deploying subpar models when running out of time (P3a, P15b, P19a), or postponing or even canceling deployment (P25a). These findings align with literature mentioning difficulties associated with effort estimation for ML tasks due to uncertainty [7, 57, 99] and planning projects in a structured manner with diverse methodologies, with diverse trajectories, and without practical guidance [15, 57, 99]. Generally, participants frequently report that synchronization between teams is challenging because of different team pace, different development processes, and tangled responsibilities (P2a, P11a, P12a, P14-b, P15b-c, P19a; see also Sec. Document). Recommendations. Participants suggested several mitigation strategies: keeping extra buffer times and adding additional time-boxes for R&D in initial phases (P8a, P19a, P22b-c, P23a; ), continuously involving clients in every phase so that they can understand the progression of the project and be aware of potential missed deadlines (P6b, P7a, P22a, P23a; ). From the interviews, we also observe the benefits of managers who understand both software engineering and machine learning and can align product and model teams toward common goals (P2a, P6a, P8a, P28a; ).
Collaboration Point: Training Data
Data is essential for machine learning, but disagreements and frustrations around training data were the most common collaboration challenges mentioned in our interviews. In most organizations, the team that is responsible for building the model is not the team that collects, owns, and understands the data, making data a key collaboration point between teams in ML-enabled systems development.
Common Organizational Structures
We observed three patterns around data that influence collaboration challenges from the perspective of the model team: r0.15
[trim=2cm 1.5cm 0.5cm 2.5cm, width=0.125]Fig/data-provided.png
Provided data: The product team has the responsibility of providing data to the model team (org. 6–8, 13, 18, 21, 23). The product team is the initial point of contact for all data-related questions from the model team. The product team may own the data or acquire it from a separate data team (internal or external). Coordination regarding data tends to be distant and formal, and the product team tends to hold more negotiation power. r0.15
[trim=2cm 1.5cm 0.5cm 2.5cm, width=0.125]Fig/external.png
External data: The product team does not have direct responsibility for providing data, but instead, the model team relies on external data providers. Commonly, the model team (i) uses publicly available resources (e.g., academic datasets, org. 2–4, 6, 19) or (ii) hires a third party for collecting or labeling data (org. 9, 15–17, 22, 23). In the former case, the model team has little to no negotiation power over data; in the latter, it can set expectations. r0.15
[trim=2cm 1.5cm 0.5cm 2.5cm, width=0.125]Fig/in-house.png
In-house data: Product, model, and data teams are all part of the same organization and the model team relies on internal data from that organization (org. 1, 5, 9–12, 14, 20, 24–28). In these cases, both product and model teams often find it challenging to negotiate access to internal data due to differing priorities, internal politics, permissions, and security constraints.
Negotiating Data Quality and Quantity
Disagreements and frustrations around training data were the most common collaboration challenges in our interviews. In almost every project, data scientists were unsatisfied with the quality and quantity of data they received at this collaboration point, in line with a recent survey showing data availability and management to be the top-ranked challenge in building ML-enabled systems . Provided and public data is often inadequate (, ). In organizations where data is provided by the product team (P7a, P8a, P13a, P22a, P22c), the model team commonly states that it is difficult to get sufficient data. The data that they receive is often of low quality, requiring significant investment in data cleaning. Similar to the requirements challenges discussed earlier, they often state that the product team has little knowledge or intuition for the amount and quality of data needed. For example, participant P13a stated that they were given a spreadsheet with only 50 rows to build a model and P7a reported having to spend a lot of time convincing the product team of the importance of data quality. This aligns with past observations that software engineers often have little appreciation for data quality concerns [50, 47, 69] and that training data is often insufficient and incomplete [75, 5, 69, 41, 99, 52, 85]. When the model team uses public data sources, its members also have little influence over data quality and quantity and report significant effort for cleaning low quality and noisy data (P2a, P3a, P4a, P3c, P6b, P19b, P23a). Papers have similarly questioned the representativeness and trustworthiness of public training data [102, 96, 33] as “nobody gets paid to maintain such data” . Training-serving skew is a common challenge when training data is provided to the model team: models show promising results, but do not generalize to production data because it differs from training data (P4a, P8a, P13a, P15a, P15c, P21a, P22c, P23a) [7, 20, 93, 70, 51, 102, 69, 71, 52, 108]
. Our interviews show that this skew often originates from inadequate training data combined with unclear information about production data, and therefore no chance to evaluate whether the training data is representative of production data.Data understanding and access to domain experts is a bottleneck (, ). Existing data documentation (e.g, data item definitions, semantics, schema) is almost never sufficient for model teams to understand the data (also mentioned a prior study ). In the absence of clear documentation, team members often collect information and keep track of unwritten details in their heads (P5a), known as institutional or tribal knowledge [38, 4]. Data understanding and debugging often involve members from different teams and thus cause challenges at this collaboration point. Model teams receiving their data from the product team report struggling with data understanding and having a difficult time receiving help from the product team (or the data team that the product team works with) (P8a, P7b, P13a). As the model team does not have direct communication with the data team, the data understanding issues often cannot be resolved effectively. For example, P13a reports “Ideally for us it would be so good to spend maybe a week or two with one person continuously trying to understand the data. It’s one of the biggest problems actually, because even if you have the person, if you’re not in contact all the time, then you misinterpreted some things and you build on it.” The low negotiation power of the model team hinders access to domain experts. Model teams using public data similarly struggle with data understanding and receiving help (P3a, P4a, P19a), relying on sparse data documentation or trying to reach any experts on the data. For in-house projects, in several organizations the model team relies on data in shared databases (org. 5, 11, 26, 27, 28), collected by instrumenting a production system, but shared by multiple teams. Several teams shared problems with evolving and often poorly documented data sources, as participant P5a illustrates “[data rows] can have 4,000 features, 10,000 features. And no one really cares. They just dump features there. […] I just cannot track 10,000 features.” Model teams face challenges in understanding data and identifying a team that can help (P5a, P25a, P20b, P27a), a problem also reported in a prior study on data scientists at Microsoft . Challenges in understanding data and needing domain experts are also frequently mentioned in the literature [47, 69, 44, 11, 39, 38], as is the danger of building models with insufficient understanding of the data [96, 33]. Although we are not aware of literature discussing the challenges of accessing domain experts, papers have shown that even when data scientists have access, effective knowledge transfer is challenging [83, 64]. Ambiguity when hiring a data team (). When the model team hires an external data team for collecting or labelling data (org. 9, 15, 16, 17, 22, 23), the model team has much more negotiation power over setting data quality and quantity expectations (though Kim et al. report that model teams may have difficulty getting buy-in from the product team for hiring a data team in the first place ). Our interviews did not surface the same frustrations as with provided data and public data, but instead participants from these organizations reported communication vagueness and hidden assumptions as key challenges at this collaboration point (P9a, P15a, P15c, P16a, P17b, P22a, P22c, P23a). For example, P9a related how different labelling companies given the same specification widely disagreed on labels, when the specification was not clear enough. We found that expectations between model and data team are often communicated verbally without clear documentation. The data team often does not have sufficient context to understand what data is needed. For example, participant P17b states “Data collectors can’t understand the data requirements all the time. Because, when a questionnaire [for data collection] is designed, the overview of the project is not always described to them. Even if we describe it, they can’t always catch it.” Reports about low quality data from hired data teams have been also discussed in the literature [96, 8, 51, 99, 41]. Need to handle evolving data (, ). In most projects, models need to be regularly retrained with more data or adapted to changes in the environment (e.g., data drift) [40, 51], which is a challenge for many model teams (P3a, P3c, P5a, P7a-b, P11a, P15c, P18a, P19b, P22a). When product teams provide the data, they often have a static view and provide only a single snapshot of data rather than preparing for updates, where model teams with their limited negotiation power have a difficult time fostering a more dynamic mindset (P7a-b, P15c, P18a, P22a), as expressed by participant P15c: “People don’t understand that for a machine learning project, data has to be provided constantly.” It can be challenging for a model team to convince the product team to invest in continuous model maintenance and evolution (P7a, P15c) . Conversely, if data is provided continuously (most commonly with public data sources, in-house sources, and own data teams), model teams struggle with ensuring consistency over time. Data sources can suddenly change without announcement (e.g., changes to schema, distributions, semantics), surprising model teams that make but do not check assumptions about the data (P3a, P3c, P19b). For example, participants P5a and P11a report similar challenges with in-house data, where their low negotiation power does not allow them to set quality expectations, and face undesired and unannounced changes in data sources made by other teams. Most organizations do not have a monitoring infrastructure to detect changes in data quality or quantity, as we will discuss in Sec Document. In-house priorities and security concerns often obstruct data access (). In in-house projects, we frequently heard about the product or model team struggling to work with another team within the same organization that owns the data. Often, these in-house projects are local initiatives (e.g., logistics optimization) with more or less buy-in from management and without buy-in from other teams that have their own priorities; sometimes other teams explicitly question the business value of the product. The interviewed model teams usually have little negotiation power to request data (especially if it involves collecting additional data) and almost never get an agreement to continuously receive data in a certain format, quality, or quantity (P5a, P10a, P11a, P20a-b, P27a) (also observed in studies at Microsoft and ING [47, 33]). For example, P10a shared “we wanted to ask the data warehouse team to [provide data], and it was really hard to get resources. They wouldn’t do that because it was hard to measure the impact [our in-house project] had on the bottom line of the business.” Model teams in these settings tend to work with whatever data they can get eventually. Security and privacy concerns can also limit access to data (P7a, P7b, P21a-b, P22a, P24a) [69, 44, 51, 52], especially when data is owned by a team in a different organization, causing frustration, lengthy negotiations, and sometimes expensive data-handling restrictions (e.g., no use of cloud resources) for model teams. Recommendations. Data quality and quantity is important to model teams, yet they often find themselves in a position of low negotiation power, leading to frustration and collaboration inefficiencies. Model teams that have the freedom to set expectations and hire their own data teams are noticeably more satisfied. When planning the entire product, it seems important to pay special attention to this collaboration point, and budget for data collection, access to domain experts, or even a dedicated data team (). Explicitly planning to provide substantial access to domain experts early in the project was suggested as important (P25a). We found it surprising that despite the importance of this collaboration point there is little written agreement on expectations and often limited documentation (), even when hiring a dedicated data team – in stark contrast to more established contracts for traditional software components. Not all organizations allow the more agile, constant close collaboration between model and data teams that some suggest [69, 71]. With a more formal or distant relationship (e.g., across organizations, teams without buy-in), it seems beneficial to adopt a more formal contract, specifying data quantity and quality expectations, which are well researched in the database literature  and have been repeatedly discussed in the context of ML-enabled systems [83, 47, 44, 41, 52]. This has also been framed as data requirements in the software engineering literature [96, 93, 75]. When working with a dedicated data team, participants suggested to invest in making expectations very clear, for example, by providing precise specifications and guidelines (P9a, P6b, P28a), running training sessions for the data collectors and annotators (P17b, P22c), and measuring inter-rater agreement (P6b). Automated checks are also important as data evolves (). For example, participant P13a mentioned proactively setting up data monitoring to detect problems (e.g., schema violations, distribution shifts) at this collaboration point; a practice suggested also in the literature [93, 81, 69, 71, 52] and supported by recent tooling, e.g., [78, 71, 45]. The risks regarding possible unnoticed changes to data make it important to consider data validation and monitoring infrastructure as a key feature of the product early on (, ), as also emphasized by several participants (P5a, P25a, P26a, P28a).
Collaboration Point: Product-Model Integration
As discussed earlier, to build an ML-enabled system both ML components and traditional non-ML components need to be integrated and deployed, requiring data scientists and software engineers to work together, typically across multiple teams. We found many conflicts at this collaboration point, stemming from unclear processes and responsibilities, as well as differing practices and expectations.
Common Organizational Structures
We saw large differences among organizations in how engineering responsibilities were assigned, which is most visible in how responsibility for model deployment and operation is assigned, which typically involves significant engineering effort for building reproducible pipelines, API design, or cloud deployment, often with MLOps technologies. We found the following patterns: r0.21
[trim=1.5cm 1.75cm 0cm 2cm, scale=.33]Fig/code-shared.png
Shared model code: In some organizations (2, 6, 23, 25), the model team is responsible only for model development and delivers training code (e.g., in a notebook) or model files to the product team; the product team takes responsibility for deployment and operation of the model, possibly rewriting the training code as a pipeline. Here, the model team has little or no engineering responsibilities. r0.19
[trim=1.5cm 1.75cm 0cm 2cm, scale=.33]Fig/api.png
Model as API: In most organizations (18 out of 28), the model team is responsible for developing and deploying the model. Hence, the model team requires substantial engineering skills in addition to data science expertise. Here, some model teams are mostly composed of data scientists with little engineering capabilities (org. 7, 13, 17, 22, 26), some model teams consist mostly of software engineers who have picked up some data science knowledge (org. 4, 15, 16, 18, 19, 21, 24), and others have mixed team members (org. 1, 9, 11, 12, 14, 28). These model teams typically provide an API to the product team, or release individual model predictions (e.g., shared files, email; org. 17, 19, 22) or install models directly on servers (org. 4, 9, 12). r0.17
[trim=1.5cm 1.75cm 0cm 2cm, scale=.33]Fig/all-in-one.png
If only few people work on model and product, sometimes a single team (or even a single person) shares all responsibilities (org. 3, 5, 10, 20, 27). It can be a small team with only data scientists (org. 10, 20, 27) or mixed teams with data scientists and software engineers (org. 3, 5). We also observed two outliers: One startup (org. 8) had a distinct model deployment team, allowing the model team to focus on data science without much engineering responsibility. In one large organization (org. 28), an engineering-focused model team (model as API) was supported by a dedicatedresearch team focused on data-science research with fewer engineering responsibilities.
Responsibility and Culture Clashes
Interdisciplinary collaboration is challenging (cf. Sec. Document). We observed many conflicts between data science and software engineering culture, made worse by unclear responsibilities and boundaries. Team responsibilities often do not match capabilities and preferences (). When the model team has responsibilities requiring substantial engineering work, we observed some dissatisfaction when its members were assigned undesired responsibilities. Data scientists preferred engineering support rather than needing to do everything themselves (P7a-b, 13a), but can find it hard to convince management to hire engineers (P10a, P20a, P20b). For example P10a describes “I was struggling to change the mindset of the team lead, convincing him to hire an engineer…I just didn’t want this to be my main responsibility.” Especially in small teams, data scientists report struggling with the complexity of the typical ML infrastructure (P7b, P9a, P14a, P26a, P28a). In contrast, when deployment is the responsibility of software engineers in the product team or of dedicated engineers in all-in-one teams, some of those engineers report problems integrating the models due to insufficient knowledge on model context or domain, and the model code not being packaged well for deployment (P20b, P23a, P27a). In several organizations, we heard about software engineers performing ML tasks without having enough ML understanding (P5a, P15b-c, P16b, 18b, 19b, 20b). Mirroring observations from past research , P5a reports “there are people who are ML engineers at [company] , but they don’t really understand ML. They were actually software engineers… they don’t understand [overfitting, underfitting, …]. They just copy-paste code.” Siloing data scientists fosters integration problems (, ). We observed data scientists often working in isolation—known as siloing—in all types of organizational structures, even within single small teams (see Sec. Document) and within engineering-focused teams. In such settings, data scientists often work in isolation with weak requirements (cf. Sec. Document) without understanding the larger context, seriously engaging with others only during integration (P3a, P3c, P6a, P7b, P11a, P13a, P15b, P25a) , where problems may surface. For example, participant P11a reported a problem where product and model teams had different assumptions about the expected inputs and the issue could only be identified after a lot of back and forth between teams at a late stage in the project. Technical jargon challenges communication (). Participants frequently described communication issues arising from differing terminology used by members from different backgrounds (P1a-b, P2a, P3a, P5b, P8a, P12a, P14a-b, P16a, P17a-b, P18a-b, P20a, P22b, P23a), leading to ambiguity, misunderstandings, and inconsistent assumptions (on top of communication challenges with domain experts [44, 68, 97]). P1b reports, “There are a lot of conversations in which disambiguation becomes necessary. We often use different kinds of words that might be ambiguous.” For example, data scientists may refer to prediction accuracy as performance, a term many software engineers associate with response time. These challenges can be observed more frequently between teams, but they even occur within a team with members from different backgrounds (P3a-c, P20a). Code quality, documentation, and versioning expectations differ widely and cause conflicts (, ). Many participants reported conflicts around development practices between data scientists and software engineers during integration and deployment. Participants report poor practices that may also be observed in traditional software projects; but particularly software engineers expressed frustration in interviews that data scientists do not follow the same development practices or have the same quality standards when it comes to writing code. Reported problems relate to poor code quality (P1b, P2a, P3b, P5a, P6a-b, P10a, P11a, P14a, P15b-c, P17a, P18a, P19a, P20a-b, P26a) [67, 35, 79, 7, 99, 33, 24], insufficient documentation (P5a-b, P6a-b, P10a, P15c, P26a) [44, 106, 60], and not extending version control to data and models (P3c, P7a, P10a, P14a, P20b). In two shared-model-code organizations, participants report having to rewrite code from the data scientists (P2a, P6a-b). Missing documentation for ML code and models is considered the cause for different assumptions that lead to incompatibility between ML and non-ML components (P10a) and for losing knowledge and even the model when faced with turnover (P6a-b). Recent papers similarly hold poor documentation responsible for team decisions becoming invisible and inadvertently causing hidden assumptions [41, 33, 68, 38, 106, 44]. Hopkins and Booth called model and data versioning in small companies as desired but “elusive” . Recommendations. Many conflicts relate to boundaries of responsibility (especially for engineering responsibilities) and to different expectations by team members with different backgrounds. Better teams tend to define processes, responsibilities, and boundaries more carefully (), document APIs at collaboration points between teams (), and recruit dedicated engineering support for model deployment (), but also establish a team culture with mutual understanding and exchange (). Big tech companies usually have more established processes, with more teams with clear responsibilities, than smaller organizations and startups that often follow ad-hoc processes or figure out responsibilities as they go. The need for engineering skills for ML projects has frequently been discussed [4, 61, 88, 79, 82, 108], but our interviewees differ widely in whether all data scientists should have substantial engineering responsibilities or whether engineers should support data scientists so that they can focus on their core expertise (). Especially interviewees from big tech emphasized that they expect engineering skills from all data science hires (P28a). Others emphasized that recruiting software engineers and operations staff with basic data-science knowledge can help at many communication and integration tasks, such as converting experimental ML code for deployment (P2a, P3b), fostering communication (P3c, P25a), and monitoring models in production (P5b). Generally, siloing data scientists is widely recognized as problematic and many interviewees suggest practices for improving communication (), such as training sessions for establishing common terminology (P11a, P17a, P22a, P22c, P23a), weekly all-hands meetings to present all tasks and synchronize (P2a, P3c, P6b, P11a), and proactive communication to broadcast upcoming changes in data or infrastructure (P11a, P14a, P14b). This mirrors suggestions to invest in interdisciplinary training [47, 4, 68, 63, 46] and proactive communication .
Quality Assurance for Model and Product
During development and integration, questions of responsibility for quality assurance frequently arise, often requiring coordination and collaboration between multiple teams. This includes evaluating components individually (including the model) as well as their integration and the whole system, often including evaluating and monitoring the system online (in production). Model adequacy goals are difficult to establish (, ). Offline accuracy evaluation of models is almost always performed by the model team who is responsible for building the model, though often have difficulty deciding locally when the model is good enough (P1a, P3a, P5a, P6a, P7a, P15b, P16b, P23a) [42, 33]. As discussed in Sec. Document and Document, model team members often receive little guidance on model adequacy criteria and are unsure about the actual distribution of production data. They also voice concerns about establishing ground truth, for example, needing to support data for different clients, and hence not being able to establish (offline) measures for model quality (P1b, P16b, P18a, P28a). As quality requirements beyond accuracy are rarely provided for models, model teams usually do not feel responsible for testing latency, memory consumption, or fairness (P2a, P3c, P4a, P5a, P6b, P7a, P14a, P15b, P20b). Whereas literature discussed challenges in measuring business impact of a model [47, 8, 12, 41], interviewed data scientists were concerned about this only with regards to convincing clients, managers or product teams to provide resources (P7a-b, P10a, P26a, P27a). Limited confidence without transparent model evaluation (). Participants in several organizations report that model teams do not prioritize model evaluation and have no systematic evaluation strategy (especially if they do not have established adequacy criteria they try to meet), performing occasional “ad-hoc inspections” instead (P2a, P15b, P16b, P18b, P19b, P20b, P21b, P22a, P22b). Without transparency about their test processes and test results, other teams voiced reduced confidence in the model, leading to skepticism to adopt the model (P7a, P10a, P21b, P22a). Unclear responsibilities for system testing (). Teams often struggle with testing the entire product, integrating ML and non-ML components. Model teams frequently explicitly mentioned that they assume no responsibility for product quality (including integration testing and testing in production) and have not been involved in planning for system testing, but that their responsibilities end with delivering a model evaluated for accuracy (P3a, P14a, P15b, P25a, P26a). However, in several organizations, product teams also did not plan for testing the entire system with the model(s) and, at most, conducted system testing in an ad-hoc way (P2a, P6a, P16a, P18a, P22a). Recent literature has reported a similar lack of focus on system testing in product teams [106, 11], mirroring also a focus in academic research on testing models rather than testing the entire system [8, 18]. Interestingly, some established software development organizations delegated testing to an existing separate quality assurance team, which however had no process or experience for testing ML products (P2a, P8a, P16a, P18b, P19a). Planning for online testing and monitoring rare (, , ). Due to possible training-serving skew and data drift, literature emphasizes the need for online evaluation [40, 48, 20, 45, 96, 82, 3, 12, 8, 80, 11, 42, 79]. With collected telemetry, one can usually approximate both product and model quality, monitor updates, and experiment in production . Online testing usually requires coordination among multiple teams responsible for product, model, and operation. We observed that most organizations do not perform monitoring or online testing, as it is considered difficult, in addition to lack of standard process, automation, or even test awareness (P2a, P3a, P3b, P4a, P6b, P7a, P10a, P15b, P16b, P18b, P19b, 25a, P27a). Only 11 out of 28 organizations collected any telemetry; it is most established in big tech organizations. When to retrain models is often decided based on intuition or manual inspection, though many aspire to more automation (P1a, P3a, P3c, P5a, P10a, P22a, P25a, P27a). Responsibilities around online evaluation are often neither planned nor assigned upfront as part of the project. Most model teams are aware of possible data drift, but many do not have any monitoring infrastructure for detecting and managing drift in production. If telemetry is collected, it is the responsibility of the product or operations team and it is not always accessible to the model team. Four participants report that they rely on manual feedback about problems from the product team (P1a, P3a, P4a, P10a). At the same time, others report that product and operation teams do not necessarily have sufficient data science knowledge to provide meaningful feedback (P3a, P3b, P5b, P18b, P22a) . Recommendations. Quality assurance involves multiple teams and benefits from explicit planning and making it a high priority (). While the product team should likely take responsibility for product quality and system testing, such testing often involves building monitoring and experimentation infrastructure (), which requires planning and coordination with teams responsible for model development, deployment, and operation (if separate) to identify the right measures. Model teams benefit from receiving feedback on their model from production systems, but such support needs to be planned explicitly, with corresponding engineering effort assigned and budgeted, even in organizations following a model-first trajectory. We suspect that education about benefits of testing in production and common infrastructure (often under the label DevOps/MLOps ) can increase buy-in from all involved teams (). Organizations that have established monitoring and experimentation infrastructure strongly endorse it (P5a, P25a, P26a, P28a). Defining clear quality requirements for model and product can help all teams to focus their quality assurance activities (cf. Sec. Document; ). Even when it is challenging to define adequacy criteria upfront, teams can together develop a quality assurance plan for model and product. Participants and literature emphasized the importance of human feedback to evaluate model predictions (P11a, P14a) , which requires planning to collect such feedback (). System and usability testing may similarly require planning for user studies with prototypes and shadow deployment [102, 81, 93].
Discussion and Conclusions
Through our interviews we identified three central collaboration points where organizations building ML-enabled systems face substantial challenges: (1) requirements and project planning, (2) training data, and (3) product-model integration. Other collaboration points surfaced, but were mentioned far less frequently (e.g., interaction with legal experts and operators), did not relate to problems between multiple disciplines (e.g., data scientists documenting their work for other data scientists), or mirrored conventional collaboration in software projects (e.g., many interviewees wanted to talk about unstable ML libraries and challenges interacting with teams building and maintaining such libraries, though the challenges largely mirrored those of library evolution generally [28, 14]). Data scientists and software engineers are certainly not the first to realize that interdisciplinary collaborations are challenging and fraught with communication and cultural problems , yet it seems that many organizations building ML-enabled systems pay little attention to fostering better interdisciplinary collaboration. Organizations differ widely in their structures and practices, and some organizations have found strategies that work for them (see recommendation sections). Yet, we find that most organizations do not deliberately plan their structures and practices and have little insight into available choices and their tradeoffs. We hope that this work can (1) encourage more deliberation about organization and process at key collaboration points, and (2) serve as a starting point for cataloging and promoting best practices. Beyond the specific challenges discussed throughout this paper, we see four broad themes that benefit from more attention both in engineering practice and in research: Communication: Many issues are rooted in miscommunication between participants with different backgrounds. To facilitate interdisciplinary collaboration, education is key, including AI literacy for software engineers and managers (and even customers) but also training software engineers to understand data science concerns. The idea of T-shaped professionals  (deep expertise in one area, broad knowledge of others) can provide guidance for hiring and training. Documentation: Clearly documenting expectations between teams is important. Traditional interface documentation familiar to software engineers may be a starting point, but practices for documenting model requirements (Sec. Document), data expectations (Sec. Document), and assured model qualities (Sec. Document) are not well established. Recent suggestions like model cards  are a good starting point for encouraging better, more standardized documentation of ML components, but capture only select qualities. Given the interdisciplinary nature at these collaboration points, such documentation must be understood by all involved – theories of boundary objects  may help to develop better interface description mechanisms. Engineering: With attention focused on ML innovations, many organizations seem to underestimate the engineering effort required to turn a model into a product to be operated and maintained reliably. Arguably ML increases software complexity [46, 79, 63] and makes engineering practices such as data quality checks, deployment automation, and testing in production even more important. Project managers should ensure that the ML and the non-ML parts of the project have sufficient engineering capabilities and foster product and operations thinking from the start. Process: Finally, ML with its more science-like process challenges traditional software process life cycles. It seems clear that product requirements cannot be established without involving data scientists for model prototyping, and often it may be advisable to adopt a model-first trajectory to reduce risk. But while a focus on the product and overall process may be delayed, neglecting it entirely invites the kind of problems reported by our participants. Whether it may look more like the spiral model or agile , more research into integrated process life cycles for ML-enabled systems (covering software engineering and data science) is needed.
Kaestner’s and Nahar’s work was supported in part by NSF grants NSF award 1813598 and 2131477. We would thank all our interview participants (K M Jawadur Rahman, and others), and people who helped us connect to them.
-  Akkerman, S.F. and Bakker, A. 2011. Boundary Crossing and Boundary Objects. Review of educational research. 81, 2 (Jun. 2011), 132–169.
-  Akkiraju, R., Sinha, V., Xu, A., Mahmud, J., Gundecha, P., Liu, Z., Liu, X. and Schumacher, J. 2020. Characterizing Machine Learning Processes: A Maturity Framework. Business Process Management (2020), 17–31.
-  Ameisen, E. 2020. Building Machine Learning Powered Applications: Going from Idea to Product. O’Reilly Media, Inc.
-  Amershi, S., Begel, A., Bird, C., DeLine, R., Gall, H., Kamar, E., Nagappan, N., Nushi, B. and Zimmermann, T. 2019. Software Engineering for Machine Learning: A Case Study. In Proc. of 41st Int’l Conf. on Software Engineering: Software Engineering in Practice (ICSE-SEIP) (2019), 291–300.
-  Amershi, S., Chickering, M., Drucker, S.M., Lee, B., Simard, P. and Suh, J. 2015. ModelTracker: Redesigning Performance Analysis Tools for Machine Learning. In Proc. of 33rd Conf. on Human Factors in Computing Systems (2015), 337–346.
-  Amershi, S., Weld, D., Vorvoreanu, M., Fourney, A., Nushi, B., Collisson, P., Suh, J., Iqbal, S., Bennett, P.N., Inkpen, K., Teevan, J., Kikin-Gil, R. and Horvitz, E. 2019. Guidelines for Human-AI Interaction. In Proc. of CHI Conf. on Human Factors in Computing Systems (2019), 1–13.
Arpteg, A., Brinne, B., Crnkovic-Friis, L. and Bosch, J. 2018. Software Engineering Challenges of Deep Learning.In Proc. Euromicro Conf. Software Engineering and Advanced Applications (SEAA) (2018), 50–59.
-  Ashmore, R., Calinescu, R. and Paterson, C. 2019. Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges. arXiv 1905.04223.
-  Bass, L., Clements, P. and Kazman, R. 1998. Software Architecture in Practice. Addison-Wesley Longman Publishing Co., Inc.
-  Bass, M., Herbsleb, J.D. and Lescher, C. 2009. A Coordination Risk Analysis Method for Multi-site Projects: Experience Report. In Proc. of Int’l Conf. on Global Software Engineering (2009), 31–40.
Baylor, D. et al. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform.In Proc. Int’l Conf. on Knowledge Discovery and Data Mining (2017).
-  Bernardi, L., Mavridis, T. and Estevez, P. 2019. 150 successful machine learning models. In Proc. of 25th Int’l Conf. on Knowledge Discovery & Data Mining (2019).
-  Bhatt, U., Xiang, A., Sharma, S., Weller, A., Taly, A., Jia, Y., Ghosh, J., Puri, R., Moura, J.M.F. and Eckersley, P. 2020. Explainable machine learning in deployment. In Proc. of Conf. on Fairness, Accountability, and Transparency (2020), 648–657.
-  Bogart, C., Kästner, C., Herbsleb, J. and Thung, F. 2021. When and how to make breaking changes: Policies and practices in 18 open source software ecosystems. ACM Transactions on Software Engineering and Methodology. 30, 4 (2021), 1–56.
-  Borg, M., Englund, C., Wnuk, K., Duran, B., Levandowski, C., Gao, S., Tan, Y., Kaijser, H., Lönn, H. and Törnqvist, J. 2019. Safely Entering the Deep: A Review of Verification and Validation for Machine Learning and a Challenge Elicitation in the Automotive Industry. Journal of Automotive Software Engineering. 1, 1 (2019), 1–9.
-  Bosch, J., Olsson, H.H. and Crnkovic, I. 2021. Engineering AI Systems: A Research Agenda. Artificial Intelligence Paradigms for Smart Cyber-Physical Systems. IGI Global. 1–19.
-  Boujut, J.-F. and Blanco, E. 2003. Intermediary Objects as a Means to Foster Co-operation in Engineering Design. Computer supported cooperative work: CSCW: an international journal. 12, 2 (2003), 205–219.
-  Braiek, H.B. and Khomh, F. 2020. On testing machine learning programs. The Journal of systems and software. 164, (2020), 110542.
-  Brandstädter, S. and Sonntag, K. 2016. Interdisciplinary Collaboration. Advances in Ergonomic Design of Systems, Products and Processes (2016), 395–409.
-  Breck, E., Cai, S., Nielsen, E., Salib, M. and Sculley, D. 2017. The ML test score: A rubric for ML production readiness and technical debt reduction. In Proc. of Int’l Conf. on Big Data (Big Data) (2017), 1123–1132.
-  Brown, G.F.C. 1995. Factors that facilitate or inhibit interdisciplinary collaboration within a professional bureaucracy. University of Arkansas.
-  Cai, C.J., Winter, S., Steiner, D., Wilcox, L. and Terry, M. 2019. “hello AI”: Uncovering the onboarding needs of medical practitioners for human-AI collaborative decision-making. Proceedings of the ACM on human-computer interaction. 3, CSCW (2019), 1–24.
-  Cataldo, M., Wagstrom, P.A., Herbsleb, J.D. and Carley, K.M. 2006. Identification of Coordination Requirements: Implications for the Design of Collaboration and Awareness Tools. In Proc. of Conf. Computer Supported Cooperative Work (CSCW) (2006), 353–362.
-  Chattopadhyay, S., Prasad, I., Henley, A.Z., Sarma, A. and Barik, T. 2020. What’s Wrong with Computational Notebooks? Pain Points, Needs, and Design Opportunities. In Proc. of CHI Conf. on Human Factors in Computing Systems (2020), 1–12.
-  Cheng, D., Cao, C., Xu, C. and Ma, X. 2018. Manifesting Bugs in Machine Learning Code: An Explorative Study with Mutation Testing. In Proc. of Int’l Conf. on Software Quality, Reliability and Security (QRS) (2018), 313–324.
-  Chen, Z., Cao, Y., Liu, Y., Wang, H., Xie, T. and Liu, X. 2020. Understanding Challenges in Deploying Deep Learning Based Software: An Empirical Study. arXiv 2005.00760.
-  Conway, M.E. 1968. How Do Committees Invent? Datamation. 14, 4 (1968), 28–31.
-  Cossette, B.E. and Walker, R.J. 2012. Seeking the Ground Truth: A Retroactive Study on the Evolution and Migration of Software Libraries. In Proc. of Int’l Symposium Foundations of Software Engineering (FSE) (2012), 1–11.
-  Curtis, B., Krasner, H. and Iscoe, N. 1988. A field study of the software design process for large systems. Communications of the ACM. 31, 11 (1988), 1268–1287.
-  Dabbish, L., Stuart, C., Tsay, J. and Herbsleb, J. 2012. Social Coding in GitHub: Transparency and Collaboration in an Open Software Repository. In Proc. of Conf. Computer Supported Cooperative Work (CSCW) (2012), 1277–1286.
-  Gartner Identifies the Top Strategic Technology Trends for 2021: . Accessed: 2021-07-30.
-  Haakman, M., Cruz, L., Huijgens, H. and van Deursen, A. 2021. AI lifecycle models need to be revised. Empirical Software Engineering. 26, 5 (2021), 1–29.
-  Haakman, M., Cruz, L., Huijgens, H. and van Deursen, A. 2020. AI Lifecycle Models Need To Be Revised. An Exploratory Study in Fintech. arXiv 2010.02716.
-  Harsh, S. 2011. Purposeful Sampling in Qualitative Research Synthesis. Qualitative Research Journal. 11, 2 (2011), 63–75.
-  Head, A., Hohman, F., Barik, T., Drucker, S.M. and DeLine, R. 2019. Managing messes in computational notebooks. In Proc. of CHI Conf. on Human Factors in Computing Systems (2019).
-  Herbsleb, J.D. and Grinter, R.E. 1999. Splitting the Organization and Integrating the Code: Conway’s Law Revisited. In Proc. of Int’l Conf. Software Engineering (ICSE) (1999), 85–95.
-  Holstein, K., Wortman Vaughan, J., Daumé, H., Dudik, M. and Wallach, H. 2019. Improving Fairness in Machine Learning Systems: What Do Industry Practitioners Need? In Proc. of CHI Conf. on Human Factors in Computing Systems (2019), 1–16.
-  Hopkins, A. and Booth, S. 2021. Machine learning practices outside big tech: How resource constraints challenge responsible development. In Proc. of AAAI/ACM Conference on AI, Ethics, and Society (2021).
-  Hukkelberg, I. and Rolland, K. 2020. EXPLORING MACHINE LEARNING IN A LARGE GOVERNMENTAL ORGANIZATION: AN INFORMATION INFRASTRUCTURE PERSPECTIVE. European Conference on Information Systems. (2020).
-  Hulten, G. 2019. Building Intelligent Systems: A Guide to Machine Learning Engineering. Apress.
-  Humbatova, N., Jahangirova, G., Bavota, G., Riccio, V., Stocco, A. and Tonella, P. 2020. Taxonomy of real faults in deep learning systems. In Proc. of 42nd Int’l Conf. on Software Engineering (ICSE) (2020).
-  Ishikawa, F. and Yoshioka, N. 2019. How do engineers perceive difficulties in engineering of machine-learning systems? - questionnaire survey. In Proc. of Joint 7th International Workshop on Conducting Empirical Studies in Industry (CESI) and 6th International Workshop on Software Engineering Research and Industrial Practice (SER&IP) (2019).
-  Islam, M.J., Nguyen, H.A., Pan, R. and Rajan, H. 2019. What Do Developers Ask About ML Libraries? A Large-scale Study Using Stack Overflow. arXiv [cs.SE].
-  Kandel, S., Paepcke, A., Hellerstein, J.M. and Heer, J. 2012. Enterprise data analysis and visualization: An interview study. IEEE transactions on visualization and computer graphics. 18, 12 (2012), 2917–2926.
-  Kang, D., Raghavan, D., Bailis, P. and Zaharia, M. 2020. Model Assertions for Monitoring and Improving ML Models. arXiv 2003.01668.
-  Kästner, C. and Kang, E. 2020. Teaching Software Engineering for Al-Enabled Systems. In Proc. of 42nd Int’l Conf. on Software Engineering: Software Engineering Education and Training (ICSE-SEET) (2020), 45–48.
-  Kim, M., Zimmermann, T., DeLine, R. and Begel, A. 2018. Data Scientists in Software Teams: State of the Art and Challenges. IEEE Transactions on Software Engineering. 44, 11 (2018), 1024–1038.
-  Lakshmanan, V., Robinson, S. and Munn, M. 2020. Machine Learning Design Patterns. O’Reilly Media, Inc.
-  Lewis, G.A., Bellomo, S. and Ozkaya, I. 2021. Characterizing and Detecting Mismatch in Machine-Learning-Enabled Systems. arXiv [2103.14101.
-  Li, P.L., Ko, A.J. and Begel, A. 2017. Cross-Disciplinary Perspectives on Collaborations with Software Engineers. In Proc. of 10th Int’l Workshop on Cooperative and Human Aspects of Software Engineering (CHASE) (2017), 2–8.
-  Lwakatare, L.E., Raj, A., Bosch, J., Olsson, H.H. and Crnkovic, I. 2019. A taxonomy of software engineering challenges for machine learning systems: An empirical investigation. International Conference on Agile Software Development (2019), 227–243.
-  Lwakatare, L.E., Raj, A., Crnkovic, I., Bosch, J. and Olsson, H.H. 2020. Large-scale machine learning systems in real-world industrial settings: A review of challenges and solutions. Information and software technology. 127, 106368 (2020), 106368.
-  Madaio, M.A., Stark, L., Wortman Vaughan, J. and Wallach, H. 2020. Co-Designing Checklists to Understand Organizational Challenges and Opportunities around Fairness in AI. In Proc. of CHI Conf. on Human Factors in Computing Systems (2020), 1–14.
-  Mahanti, R. 2019. Data Quality: Dimensions, Measurement, Strategy, Management, and Governance. Quality Press.
-  Mäkinen, S., Skogström, H., Laaksonen, E. and Mikkonen, T. 2021. Who Needs MLOps: What Data Scientists Seek to Accomplish and How Can MLOps Help? arXiv 2103.08942.
-  Martínez-Fernández, S., Bogner, J., Franch, X., Oriol, M., Siebert, J., Trendowicz, A., Vollmer, A.M. and Wagner, S. 2021. Software Engineering for AI-Based Systems: A Survey. arXiv 2105.01984.
-  Martinez-Plumed, F., Contreras-Ochando, L., Ferri, C., Hernandez Orallo, J., Kull, M., Lachiche, N., Ramirez Quintana, M.J. and Flach, P.A. 2021. CRISP-DM twenty years later: From data mining processes to data science trajectories. IEEE transactions on knowledge and data engineering. 33, 8 (2021), 3048–3061.
-  Meyer, B. 1997. Object-Oriented Software Construction. Prentice-Hall.
-  Mistrík, I., Grundy, J., van der Hoek, A. and Whitehead, J. 2010. Collaborative Software Engineering. Springer Science & Business Media.
-  Mitchell, M., Wu, S., Zaldivar, A., Barnes, P., Vasserman, L., Hutchinson, B., Spitzer, E., Raji, I.D. and Gebru, T. 2019. Model Cards for Model Reporting. In Proc. of Conf. on Fairness, Accountability, and Transparency (2019), 220–229.
-  O’Leary, K. and Uchida, M. 2020. Common problems with creating machine learning pipelines from existing code. in Proc of 3rd Conf. on Machine Learning and Systems (MLSys) (2020).
-  Ovaska, P., Rossi, M. and Marttiin, P. 2003. Architecture as a coordination tool in multi-site software development. Software Process Improvement and Practice. 8, 4 (2003), 233–247.
-  Ozkaya, I. 2020. What Is Really Different in Engineering AI-Enabled Systems? IEEE Software. 37, 4 (2020), 3–6.
-  Park, S., Wang, A., Kawas, B., Vera Liao, Q., Piorkowski, D. and Danilevsky, M. 2021. Facilitating Knowledge Sharing from Domain Experts to Data Scientists for Building NLP Models. arXiv 2102.00036.
-  Parnas, D.L. 1972. On the Criteria to be used in Decomposing Systems into Modules. Communications of the ACM. 15, 12 (1972), 1053–1058.
-  Patel, K., Fogarty, J., Landay, J.A. and Harrison, B. 2008. Investigating statistical machine learning as a tool for software development. In Proc. of SIGCHI Conf. on Human Factors in Computing Systems (2008), 667–676.
-  Pimentel, J.F., Murta, L., Braganholo, V. and Freire, J. 2019. A large-scale study about quality and reproducibility of jupyter notebooks. In Proc. of 16th Int’l Conf. on Mining Software Repositories (MSR) (2019).
-  Piorkowski, D., Park, S., Wang, A.Y., Wang, D., Muller, M. and Portnoy, F. 2021. How AI Developers Overcome Communication Challenges in a Multidisciplinary Team: A Case Study. arXiv 2101.06098.
-  Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2018. Data Lifecycle Challenges in Production Machine Learning: A Survey. SIGMOD Rec. 47, 2 (2018), 17–28.
-  Polyzotis, N., Roy, S., Whang, S.E. and Zinkevich, M. 2017. Data Management Challenges in Production Machine Learning. In Proc. of ACM Int’l Conf. on Management of Data (2017), 1723–1726.
-  Polyzotis, N., Zinkevich, M., Roy, S., Breck, E. and Whang, S. 2019. Data validation for machine learning. In Proc. of Machine Learning and Systems (2019), 334–347.
-  Rahimi, M., Guo, J.L.C., Kokaly, S. and Chechik, M. 2019. Toward Requirements Specification for Machine-Learned Components. In Proc. of 27th Int’l Requirements Engineering Conference Workshops (REW) (2019), 241–244.
-  Rakova, B., Yang, J., Cramer, H. and Chowdhury, R. 2021. Where Responsible AI meets Reality: Practitioner Perspectives on Enablers for Shifting Organizational Practices. Proc. ACM Hum.-Comput. Interact. 5, CSCW1 (2021), 1–23.
-  Ré, C., Niu, F., Gudipati, P. and Srisuwananukorn, C. 2019. Overton: A data system for monitoring and improving machine-learned products. arXiv [cs.LG].
-  Salay, R., Queiroz, R. and Czarnecki, K. 2017. An Analysis of ISO 26262: Using Machine Learning Safely in Automotive Software. arXiv 1709.02435.
-  Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P. and Aroyo, L.M. 2021. “Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI. In Proc. of CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery. 1–15.
-  Sarma, A., Redmiles, D.F. and van der Hoek, A. 2012. Palantir: Early Detection of Development Conflicts Arising from Parallel Code Changes. IEEE Transactions on Software Engineering. 38, 4 (2012), 889–908.
-  Schelter, S., Lange, D., Schmidt, P., Celikel, M., Biessmann, F. and Grafberger, A. 2018. Automating Large-scale Data Quality Verification. Proceedings of the VLDB Endowment International Conference on Very Large Data Bases. 11, 12 (2018), 1781–1794.
-  Sculley, D., Holt, G., Golovin, D., Davydov, E., Phillips, T., Ebner, D., Chaudhary, V., Young, M., Crespo, J.-F. and Dennison, D. 2015. Hidden Technical Debt in Machine Learning Systems. Advances in Neural Information Processing Systems 28. Curran Associates, Inc. 2503–2511.
-  Sculley, D., Otey, M.E., Pohl, M., Spitznagel, B., Hainsworth, J. and Zhou, Y. 2011. Detecting adversarial advertisements in the wild. In Proc. of 17th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data Mining (2011), 274–282.
-  Sendak, M.P. et al. 2020. Real-World Integration of a Sepsis Deep Learning Technology Into Routine Clinical Care: Implementation Study. JMIR medical informatics. 8, 7 (2020), e15182.
-  Serban, A., van der Blom, K., Hoos, H. and Visser, J. 2020. Adoption and Effects of Software Engineering Best Practices in Machine Learning. In Proc. of 14th Int’l Symposium on Empirical Software Engineering and Measurement (ESEM) (2020), 1–12.
-  Seymoens, T., Ongenae, F. and Jacobs, A. 2018. A methodology to involve domain experts and machine learning techniques in the design of human-centered algorithms. In Proc. IFIP Working Conf. on Human Work Interaction Design (2018).
-  Shneiderman, B. 2020. Bridging the gap between ethics and practice. ACM transactions on interactive intelligent systems. 10, 4 (Dec. 2020), 1–31.
-  Siebert, J., Joeckel, L., Heidrich, J., Nakamichi, K., Ohashi, K., Namba, I., Yamamoto, R. and Aoyama, M. 2020. Towards Guidelines for Assessing Qualities of Machine Learning Systems. In Proc. of Int’l Conf. on the Quality of Information and Communications Technology (2020), 17–31.
Singh, G., Gehr, T., Püschel, M. and Vechev, M. 2019. An abstract domain for certifying neural networks.Proc. ACM Program. Lang. 3, POPL (2019), 1–30.
-  Smith, D., Alshaikh, A., Bojan, R., Kak, A. and Manesh, M.M.G. 2014. Overcoming barriers to collaboration in an open source ecosystem. Technology Innovation Management Review. 4, 1 (2014).
-  d. S. Nascimento, E., Ahmed, I., Oliveira, E., Palheta, M.P., Steinmacher, I. and Conte, T. 2019. Understanding Development Process of Machine Learning Systems: Challenges and Solutions. In Proc. Int’l Symposium on Empirical Software Engineering and Measurement (ESEM) (2019), 1–6.
-  Braude, Eric J and Bernstein, Michael E. 2011. Software Engineering: Modern Approaches 2nd Edition. Wiley. ISBN-13: 978-0471692089.
-  de Souza, C.R.B. and Redmiles, D.F. 2008. An Empirical Study of Software Developers’ Management of Dependencies and Changes. In Proc. Int’l Conf. Software Engineering (ICSE) (2008), 241–250.
-  Strauss, A. and Corbin, J. 1994. Grounded theory methodology: An overview. Handbook of qualitative research. N.K. Denzin, ed. 273–285.
-  Strauss, A. and Corbin, J.M. 1990. Basics of Qualitative Research: Grounded Theory Procedures and Techniques. SAGE Publications.
-  Studer, S., Bui, T.B., Drescher, C., Hanuschkin, A., Winkler, L., Peters, S. and Mueller, K.-R. 2020. Towards CRISP-ML(Q): A Machine Learning Process Model with Quality Assurance Methodology. arXiv 2003.05155.
-  Tramèr, F., Atlidakis, V., Geambasu, R., Hsu, D., Hubaux, J., Humbert, M., Juels, A. and Lin, H. 2017. FairTest: Discovering Unwarranted Associations in Data-Driven Applications. In Proc. European Symposium on Security and Privacy (EuroS P) (2017), 401–416.
-  Tranquillo, J. 2017. The T-Shaped Engineer. Journal of Engineering Education Transformations. 30, 4 (Apr. 2017), 12–24.
-  Vogelsang, A. and Borg, M. 2019. Requirements Engineering for Machine Learning: Perspectives from Data Scientists. In Proc. of 27th Int’l Requirements Engineering Conference Workshops (REW) (2019), 245–251.
-  Wagstaff, K. 2012. Machine Learning that Matters. arXiv 1206.4656.
-  Wang, A.Y., Mittal, A., Brooks, C. and Oney, S. 2019. How Data Scientists Use Computational Notebooks for Real-Time Collaboration. Proceedings of the ACM on Human-Computer Interaction. 3, CSCW (2019), 39.
-  Wan, Z., Xia, X., Lo, D. and Murphy, G.C. 2019. How does Machine Learning Change Software Development Practices? IEEE Transactions on Software Engineering. (2019), 1–1.
-  Waterman, M., Noble, J. and Allan, G. 2015. How Much Up-Front? A Grounded theory of Agile Architecture. In Proc. of 37th Int’l Conf.on Software Engineering (2015), 347–357.
-  Why do 87% of data science projects never make it into production? 2019. . Accessed: 2021-07-30.
-  Wiens, J., Saria, S., Sendak, M., Ghassemi, M., Liu, V.X., Doshi-Velez, F., Jung, K., Heller, K., Kale, D., Saeed, M., Ossorio, P.N., Thadaney-Israni, S. and Goldenberg, A. 2019. Do no harm: a roadmap for responsible machine learning for health care. Nature medicine. 25, 9 (2019), 1337–1340.
Xie, X., Ho, J.W.K., Murphy, C., Kaiser, G., Xu, B. and Chen, T.Y. 2011. Testing and Validating Machine Learning Classifiers by Metamorphic Testing.The Journal of systems and software. 84, 4 (2011), 544–558.
-  Yang, Q., Suh, J., Chen, N.-C. and Ramos, G. 2018. Grounding Interactive Machine Learning Tool Design in How Non-Experts Actually Build Models. In Proc. of Conf. on Designing Interactive Systems (2018), 573–584.
-  Yokoyama, H. 2019. Machine Learning System Architectural Pattern for Improving Operational Stability. In Proc. of Int’l Conf. on Software Architecture Companion (ICSA-C) (2019), 267–274.
-  Zhang, A.X., Muller, M. and Wang, D. 2020. How do data science workers collaborate? Roles, workflows, and tools. Proceedings of the ACM on human-computer interaction. 4, CSCW1 (2020), 1–23.
-  Zhou, S., Vasilescu, B. and Kästner, C. 2020. How Has Forking Changed in the Last 20 Years? A Study of Hard Forks on GitHub. In Proc. of 42nd Int’l Conf. on Software Engineering (ICSE) (2020), 445–456.
-  Zinkevich, M. 2017. Rules of machine learning: Best practices for ML engineering. URL: https://developers. google. com/machine-learning/guides/rules-of-ml. (2017).