The development of machine learning (ML) models is often narrow- and near-sighted, only considering the publication target or minimum viable product requirements.
A main concern is models are typically trained and tested on only a handful of curated datasets, without measures and safeguards for future scenarios.
Code quality is typically subpar and poorly documented, a technical debt that is exacerbated because future users are rarely the original researchers or developers. Models and algorithms for deployment are integrated in a software stack that is robust and documented, but without regard for the inherent stochasticity111 Consider the massive effect random seeds have on deep reinforcement learning model performance, shown by
Consider the massive effect random seeds have on deep reinforcement learning model performance, shown byHenderson et al. (2018) and failure modes of the hidden ML components.
Other domains of engineering, such as civil and spacecraft, follow well-defined processes and testing standards to streamline development for high-quality, reliable results. Technology Readiness Level (TRL) is a systems engineering protocol for deep tech and scientific endeavors at scale, ideal for integrating many interdependent components and cross-functional teams of people. No surprise TRL is standard process and parlance in NASA and DARPA (NASA, 2003).
For a spaceflight project there are several defined phases, from pre-concept to prototyping to deployed operations, each with a series of development cycles and reviews. This is in stark contrast to machine learning and software workflows, which promote quick iteration, rapid deployment, and simple linear progressions. Yet the NASA technology readiness process is overkill. We aim to bring systems engineering to machine learning by defining and putting into action a lean Technology Readiness Levels for ML (TRL4ML)
framework. We draw on decades of AI development, from research through production, across domains and applications: for example, computer vision in medical diagnostics and factory robotics, NLP in commerce and social media, streaming time-series in predictive maintenance and finance.
In this paper we define our proven framework for developing and deploying robust ML systems, with a real example of advancing a novel algorithm from R&D through productization and deployment within a massive system. Our aim is to standardize TRL4ML to enable ML and SWE teams to develop principled, robust AI technologies. Ultimately, TRL4ML gets people across the organization speaking the same language.
TRL4ML defines technology readiness levels (TRLs) (NASA, 2003) to guide and communicate machine learning development and deployment. A TRL represents the maturity of a model or algorithm222Note we use “model” and “algorithm” somewhat interchangeably when referring to the technology under development. The same TRL4ML process applies for e.g. a machine translation model or an algorithm for A/B testing., data pipes, software module, or composition thereof; a typical ML system consists of many interconnected subsystems and components, and the TRL of the systems is the lowest level of its constituent parts. The levels are briefly defined as follows, and elucidated with an example project in Fig. 1:
Level 0 - Brainstorming
A stage for greenfield research.
The outcome is a set of concrete ideas with sound maths, to pursue through low-level experimentation in the next stage. To graduate, the basic principles, hypotheses, and research plans need to be stated, referencing relevant papers. The reviewer here is solely the research team lead.
Level 1 - Goal-Oriented Research
Moving from basic principles to practical use.
Here we design and run low-level experiments to analyze model/algorithm properties, which need to pass a peer-review process before graduating to level 2 – the review panel includes additional members of the research team.
Level 2 - Proof of Principle (PoP) Development
Active R&D is initiated.
The models run in testbeds: simulated environments and/or surrogate data that closely matches the conditions and data of real scenarios – note these are not product-driven. An important deliverable at this stage is the formal requirements document (with well-specified verification and validation steps). The culmination of this stage is often a bifurcation: some work moves to applied AI, while some circles back for more research.
Level 3 - System Development
Sound software engineering.
Here we have checkpoints that push code development towards interoperability, reliability, maintainability, extensibility, and scalability. In TRL4ML we develop with the mindset that research code will be thrown away when the project development calls for more legitimate software engineering. The level 3 review includes teammates whom focus more on applied AI and engineering.
Level 4 - Proof of Concept (PoC) Development
Demonstration in a real scenario.
This stage is the seed of application-driven development; for many organizations this is the first touch-point with product managers and stakeholders beyond the R&D group. In review, we demonstrate the utility towards one or more practical applications, taking care to communicate assumptions and limitations. Ideally the organization has an AI ethics review process, which would be appropriate at this stage (as the AI capabilities and datasets are known).
Level 5 - Machine Learning “Capability”
The R&D to product handoff.
An interdisciplinary working group is defined, as we start developing the tech in the context of a larger real-world process – i.e., transitioning the model or algorithm from an isolated solution to a module of a larger application. Graduation from level 5 should be difficult, as it signifies the dedication of resources to push this ML technology through to productization.
Level 6 - Application Development
Robustification of ML modules, specifically towards one or more use-cases.
The main work here is significant software engineering to bring the code up to product-caliber, as well as defining product-specific requirements and data pipelines spec.
Level 7 - Integrations
ML infrastructure, product platform, data pipes, security protocols.
For integrating the technology into existing production systems, we recommend the working group has a balance of infrastructure engineers and applied AI engineers – we find this stage of development is vulnerable to latent model assumptions and failure modes. The review should focus on the data pipelines and test suites; a scorecard like the ML Testing Rubric is useful (Breck et al., 2016). We stress the need for tests that run use-case specific critical scenarios and data-slices – a proper risk-quantification table will highlight these.
Level 8 - Flight-ready
The end of system development.
The technology is demonstrated to work in its final form and under expected conditions. There should be additional tests implemented at this stage covering deployment aspects: A/B tests, blue/green deployment tests, shadow testing, canary testing, and others. Review panel is representative of the full slate of stakeholders. We diligently walk through every technical and product requirement, and corresponding validations.
Level 9 - Deployment
Monitoring the current version, improving the next.
Maintenance engineering (i.e. monitoring and update methods) takeover; CI/CD should regularly stress test the system, and regression tests on ML components send logs to relevant applied and research engineers. There is a defined communication path for user feedback, without roadblocks to R&D; we encourage real-world feedback all the way to research, providing valuable problem constraints and perspectives.
2.1 Key components in the process
At the end of each stage is a dedicated review period: present the tech developments and their validations, make key decisions on path(s) forward (or backward), and debrief the process.333TRL4ML should include regular debriefs and meta-evaluations such that process improvements can be made in a data-driven, efficient way (rather than an annual meta-review). TRL4ML is a high-level framework that each organization should operationalize in a way that suits their specific capabilities and resources. The designated reviewers will “graduate” the technology to the next level, or provide a list of specific tasks that are still needed (ideally with quantitative remarks). After graduation at each level, the working group does a brief post-mortem; we find that a quick day or two pays dividends in cutting away technical debt and improving team processes.
In Fig. 2 we succinctly showcase a key deliverable: TRL cards. The model cards proposed by Google (Mitchell et al., 2019) are a useful development for external user-readiness with ML. On the other hand, our TRL cards are more like “report cards” that grow and improve upon graduating levels, and provide a means of inter-team and cross-functional communication. The content of a TRL card is roughly in two categories: project info, and implicit knowledge. The former clearly states info such as project owners and reviewers, development status, and semantic versioning (for code, models, and data). In the latter category are specific insights that are typically siloed in the ML development team but should be communicated to other stakeholders: modeling assumptions, dataset biases, corner cases, etc.
Identifying and addressing risks in a software project is not a new practice. However, akin to the TRL4ML roots in spacecraft engineering, risk is a “first-class citizen” here. In the definition of technical and product requirements, each entry has a calculation of the form , where the value of a component is an integer . Being diligent about quantifying risks across the technical requirements is a useful mechanism for flagging ML-related vulnerabilities that can sometime be hidden by layers of other software. TRL4ML also specifies that risk quantification and testing strategies are required for sim-to-real development. That is, there is nearly always a non-trivial gap in transferring a model or algorithm from a simulation testbed to the real world. Requiring explicit sim-to-real testing steps in the workflow helps mitigate unforeseen (and often hazardous) failures.
Non-linear, non-monotonic paths
We observe many projects benefit from cyclic paths, dialing components of a technology back to a lower level following the stage review. Our framework not only uses cycles, but actively discourages the straight path approach that is typically assumed in ML projects. It’s also important to note that most projects do not start at level 0; very few ML companies engage in this low-level theoretical research. For example, a team looking to use an off-the-shelf object recognition model would start that technology at level 4. However no technology can skip levels after the TRL4ML process has been initiated.
By defining technology maturity in a quantitative way, TRL4ML enables teams to accurately and consistently define their ML progress metrics; OKRs and KPIs can be defined as achieving certain levels in a given period of time. Even more, meta-review of TRL4ML progress over multiple projects can provide useful insights at the organization level. For example, analysis of the time-per-level and the most frequent development paths/cycles can bring to light operational bottlenecks. Compared to conventional SWE metrics based on sprint stories and tickets, or time-tracking tools, TRL4ML provides a more accurate analysis of ML workflows.
There are several key areas where machine learning (ML) development is unique from software engineering (SWE). For instance, the behavior of ML systems is learned from data, not specified directly in code. The data requirements around ML (i.e., data discovery, management, and monitoring) adds significant complexity not seen in other types of SWE. Not to mention an array of ML-specific failure modes; for example, models that become mis-calibrated due to subtle data distributional shifts in the deployment setting, resulting in models that are more confident in predictions than they should be. These are a couple instances of broader themes we’ve observed, where ML systems depart from the rest of SWE. A recent case study from Microsoft Research (Amershi et al., 2019) similarly identifies a few themes. Also related to our work, Google teams have proposed ML testing recommendations (Breck et al., 2016) and validating the data fed into ML systems Breck et al. (2019). These analyses provide useful insights, but they do not provide a holistic, regimented process for the full ML lifecycle.
We’ve introduced TRL4ML, a proven systems engineering process for machine learning. Our hope is the framework is adopted broadly in AI/ML organizations, and that “technology readiness levels” becomes common nomenclature across stakeholders – from researchers and engineers to sales-people and CEOs.
- Software engineering for machine learning: a case study. 2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). Cited by: §3.
- What’s your ml test score? a rubric for ml production systems. Cited by: §2, §3.
- Data validation for machine learning. Cited by: §3.
- Deep reinforcement learning that matters. In AAAI, Cited by: footnote 1.
- Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency. Cited by: §2.1.
- NASA systems engineering handbook. Cited by: §1, §2.
- Taking the human out of the loop: a review of bayesian optimization. Proceedings of the IEEE 104, pp. 148–175. Cited by: Figure 1.