Recent surveys run by many organizations, research advisory companies, government entities and media outlets have been pointing at the importance and potential of AI technologies to transform the way healthcare is delivered. In (McGrail, 2020), authors noted that 91% of healthcare stakeholders believe that adoption of AI will lead to improved patient access to care. Despite such an optimistic outlook about the impact of AI in healthcare, many fear that several barriers need to be overcome to fulfill this potential. These barriers are driven by several factors, including the need for more standardized and interoperable ways to access, manage and maintain data and AI models, the need to provide trust in AI modeling through complete transparency, explainability, data and model provenance, and the need for enhanced security and privacy around the secondary reuse of patient data (Donovan, 2020). Even though partial solutions for addressing each of these individual issues are available, a system that brings all of them together into a cohesive architecture for healthcare is novel and is our main contribution.
To this end, we propose a canonical architecture for the complete management of predictive healthcare AI applications throughout all phases of their life cycle, such as data ingestion, model building, and model promotion into production environments. The architecture is designed to accommodate trust and reproducibility as an inherent part of the AI life cycle and support the needs for a deployed AI system in healthcare. In what follows, we start with a crisp articulation of challenges that we have identified to derive the requirements for this architecture. We then follow with a description of this architecture before providing qualitative evidence of its capabilities in real world settings.
While AI offers powerful tools for building useful complex prediction systems quickly, it is common to incur massive ongoing maintenance costs in real-world AI systems. Many of the production AI systems are inherently brittle due to various reasons from underutilized data dependencies to lack of code-reuse between training and inference pipelines (Sculley et al., 2015). Systems such as Facebook’s FBLearner (Dunn, 2016), Uber’s Michelangelo (Hermann and Del Balso, 2017) and DataBricks MLFlow (Zaharia et al., 2018) have developed approaches and platforms to manage machine learning workflows for general use cases. However, healthcare workflows pose additional challenges while incorporating AI. They necessitate new approaches for each step from data collection, model development to validation, deployment and monitoring (Yun Liu, 2019; Oleg S. Pianykh, [n.d.]). Some of the domain specific challenges are given below.
Integration of data from multiple sources such as insurance claims, clinical data from EHRs, provider profiles, population statistics, social and community data, oncologies and other curated resources is essential for generating useful machine learning features.
Inference processes are often complex, involving multiple models to support explainable, actionable, and bias mitigated predictions.
Computing prediction at low-latency in the context of long-term historical event data is important.
High degree of accuracy is required of predictions for critical decision-making which in turn requires continuous monitoring and tuning of model performance (Yun Liu, 2019).
The system should provide generated results in a transparent manner to drive trust; it should also be able to provide explanations to end users on how these results were obtained.
Adherence to security and privacy regulations such as HIPAA and protecting the AI modeling from various attacks.
While these challenges are present in other domains, addressing their aggregation is imperative to stand up production AI systems in healthcare.
3. Desiderata for Architecture
To meet the above challenges we identify the following as desirable characteristics of a solution architecture.
Modularity: The system should have the ability to incorporate new features through extension rather than modification. Building an AI solution involves enabling collaboration between cross-functional teams with diverse set of skills needed to handle data, code and model development. To enable such collaborative research and development, the architecture should be highly modular with a well-defined AI life-cycle management process.
Trust & Transparency: Some desired properties are:
The reliability of predictive behavior of models in real world deployments varies when confronted with real world data. Thus, there is a need for continuous monitoring of model performance (Breck et al., 2016)
and for a systematic approach to address model staleness: continuous model evaluation, ”training/inference skew” detection, and model drift to trigger retraining.
Maintaining provenance of data, models, and software is crucial for reproducibility and monitoring model performance in production.
The system should have a mechanism to reliably consume and incorporate user feedback to improve models.
The system should support state-of-the-art explainable AI modeling technologies.
Ease of Integration: The system should integrate seamlessly with existing clinician workflows within existing tools and applications that are already familiar to clinicians(Yun Liu, 2019). Also, the system should support integration of diverse data sources.
Healthcare Interoperability: Support for open standards for healthcare such as FHIR.
Security & Privacy: The system must adhere to security and privacy regulations such as HIPAA.
Performance: The architecture should ensure that non-functional requirements for performance such as throughput, latency, or memory usage are met in addition to the requirements above (Adnan Qayyum1 and Al-Fuqaha3, 2020).
In the next section, we describe a system that builds upon the ideas from (Fowler, 2019) to meet the architectural goals mentioned above.
Our proposed architecture consists of four subsystems: 1. Inference Framework, 2. Model Management Subsystem, 3. Feature Repository and 4. Model Development Toolkit.
4.1. Inference Framework
The Inference Framework, is responsible for the generation of insights for end user consumption. It runs inside the Inference Service, and organizes the various software components into a cohesive set of modules through contracts. These modules are best described through their interactions with each other during the request processing control flow.
Upon receiving request for insight through the API Gateway, the framework calls the Model Registry to fetch the model specification (ModelSpec).
The returned ModelSpec consists of (a) a handle to the micro-service corresponding to the deployed model in the Model Serving runtime environment, (b) a list of machine learning feature generation components (Feature Generators) that were used during training, (c) a reference to the list Model Metadata components that can generate metadata associated with the prediction such as explainability, actionability and robustness, (d) provenance information that details the model algorithm, training inputs, parameters and metrics.
The framework then executes a sequence of steps based on the ModelSpec and the incoming request parameters to generate a prediction and prediction metadata.
A response is composed and returned to the client. The inference framework also logs the request-response pair in a repository for later use in model re-evaluation.
The system allows for capturing user feedback. On receipt of user feedback for a previous prediction through the API Gateway, the feedback along with all metadata associated with feedback (such as the state within the clinical workflow when the feedback was submitted) is logged in to a Feedback Repository.
4.2. Model Management Subsystem
The Model Management Subsystem manages many elements of the model life-cycle. It is used to register models after training, retrieve model specification and provenance, and to execute models at runtime. It is built on top of MLflow (Zaharia et al., 2018) and consists of following components: (1) Model Registry where the model, its specifications, metrics and provenance are registered. “Training pipeline” uses the Model Registry to log the model specifications, feature generators, prediction metadata generators, metrics, and provenance and register the model. It is also used by the Inference Framework to retrieve the best model for the machine learning task. (2) Model Serving allows models to be turned into micro-services and inference is run within the service usually via a request-response paradigm.
4.3. Feature Repository
Feature generation, the process of transforming raw input data into features in formats expected by the machine learning algorithm is needed both during training and real-time inference. In most large scale machine learning projects, feature generation is done by a diverse team that utilizes a variety of methods, tools, and implementation approaches. How features are generated, maintained, and made available has an impact on the complexity of the system. Improperly managed feature generation can affect feature discovery, reuse, and overall the reliability of predictions (Miao and Deshpande, 2018) resulting in technical debt (Sculley et al., 2015). The concept of a Feature Repository was introduced by Uber (Hermann and Del Balso, 2017) and since gained a prominence. Implementations are typically based on a NoSQL database and service APIs to retrieve, add update feature data (Perez, 2019). The Feature Repository provides several benefits as follows:
Provides provenance: The Feature Repository supports storage and retrieval of historical, versioned feature values.
Aids modularity: It enables standardization of definition, storage and access to feature data, promoting reuse and less duplication.
Accelerates innovation: Easy discovery of feature sets can jump start machine learning models, increase learning efficiency and lowering model development costs. This component provides interfaces and visualization tools to for data exploration, error analysis, and model tuning by data scientists.
Improves run-time performance: It reduces latency of prediction by using pre-computed features that can be queried against a database instead of through a complex operation involving aggregations of large amounts of historical data. This also improves the throughput of data ingestion by capturing and exploiting the inter-dependencies between features to trigger re-computation or incremental updates. Feature Generators can also be grouped together for efficient execution.
4.4. Model Development Toolkit
Many teams build custom tools and use adhoc ways to address requirements in the AI life-cycle resulting in massive technical debt (Sculley et al., 2015). To address this, our solution incorporates a comprehensive toolkit for model development with integrated AI life-cycle management. It accelerates development and reduces maintenance efforts by promoting repeatable workflows. It promotes reuse while providing the flexibility needed for data scientists to innovate. The model development process has several stages:
Data Acquisition: This step provides an interface for ingesting data from a variety of sources, such as EMR, claims, and Social and Behavioral Determinants of Health (SBDoH), for periodic retraining of the models as necessary. This also transforms and normalizes the data into a common data model for downstream tasks.
Cohort Construction: The first step in model building is to construct a cohort. Cohort construction interfaces with the raw data and the Feature Repository. Domain features that can be directly extracted from the data such as demographics, admission diagnosis, can be added to the Feature Repository. In a production setting, cohorts may be extended in two ways: (1) addition of new types of data, (e.g., new types of claims, or claims from different time periods), (2) defining new target events (e.g. unplanned admission, opiate use disorder, maternal morbidity, etc.). The output of this process provides the data for the training pipeline.
Data Exploration: Data scientists and model developers can search the Feature Repository to visualize and extract feature sets to explore and discover features to be applied to cohort construction.
Feature Generation: As new features are defined, metadata is added to the catalog and Feature Generators that produce the features are associated with the feature metadata. Feature definition includes specifying dependencies and grouping of features that should be treated as a unit.
Embeddings Generation: Embeddings may be generated from any subset of available data including from EHR which provide demographics data, encounter data, and notes (Choi et al., 2016). These embedding become available as features and are registered and stored in the Feature Repository.
Training Pipeline: The training pipeline consists of multiple models that predict the target event, perform model calibration, bias removal, calculate uncertainty metrics, expose feature importance, and perform explainability post-hoc analysis.
The modeling process starts with the creation of a model definition in the Model Registry. The model definition contains the complete specification and provenance, including what train and test data were used, model algorithm, hyper-parameters, feature generators, metrics and thresholds. This definition is continuously updated using Model Registry Client API as pipeline is executed.
Multiple Feature Generators, as part of the training pipeline, are scaled out and run in parallel. Inter-feature dependencies are taken into consideration in specifying the sequence and parallelization. Feature Generators subscribe to feature data in the Feature Repository for incremental update.
Model Registry tracks the metrics of training experiments run by data scientists and metrics to identify algorithms and parameters that result in the best model performance. At this stage, based on a release management process, the model is promoted to the production environment.
Integrated AI Life-cycle Management: The architecture facilitates complete or partial automation of key AI life-cycle management activities:
Model Monitoring: The data from Feedback Repository is used by monitoring components for retrospective testing of model accuracy, sensing of model drift and skew. Anomalies and significant changes in accuracy trigger a notification for a data scientist to do further analysis and decide whether to retrain the model.
AutoML: AutoML allows AI researchers to automate many of the complicated and time-consuming tasks of feature engineering, model selection and hyper-parameter tuning, to optimize model performance end-to-end.
Provenance: Tracking provenance of all aspects of model building is essential for reproducibility. This includes training data, software implementation of the feature generation logic (Feature Generators), machine learning algorithm and hyper-parameters used for training. We use DVC library to version the data files which uses the same identifiers as Git(Community, 2020) source code manager, enabling the unique combined versioning of code and data. Finally, we employ the Model Registry
to track the model hyperparameters, metrics and models binaries.
We note that despite the automation, an interdisciplinary team having deep domain knowledge with diverse skills such as machine learning, data analysis and cloud engineering and is required to operate and support the production system.
5. Results and Conclusions
Currently we have implemented many aspects of the end-to-end system, including the Model Development Toolkit and Model Management subsystem for the prediction of various health outcomes from medical insurance claims data. We are in the process of operationalizing the Inference Service and Feature Repository.
The system was used to analyze a longitudinal medical claim database spanning patient lives covering medical claims. We used a wide variety of features including demographic, diagnosis history, and procedure history of the patient’s medical claims to build models for two endpoints. The endpoints were selected with hospital administrators and clinicians as our end users. The scalable nature of the system allowed for semi-concurrent training of more than
architectures, spanning both classical and deep-learning models, over multiple variations/subsets of patient data, covering more thanraw features, culminating in more than experiments over a month period. All of these experiments were traceable from the input data, to the feature generation, and all the way to model training traces and final trained model. The system enabled easy access to various metrics on standardized comparison scenarios to assist in the final promotion of such models for deployment. The system was used collaboratively in a decentralized manner by a team of members, including data scientists, ML/AI researchers, and ML/AI engineers, working across various time zones. Our early experiences in using this system has been quite positive and led to a more effective collaboration between different stakeholders that let them focus on their sub-problem while being connected to the overall analysis.
Our architecture is designed to address the desired characteristics of AI/ML system for Healthcare (see Section 2) and has empowered us to conduct large scale experiments in a repeatable and reproducible manner. We intend to carry out formal user studies to further investigate the benefits and gaps in our implementation. Our current efforts also include application of the system to many other problems and improve the system over time by pro-actively incorporating feedback from our users.
- Adnan Qayyum1 and Al-Fuqaha3 (2020) Junaid Qadir1 Muhammad Bilal2 Adnan Qayyum1, Adnan and Ala Al-Fuqaha3. 2020. Secure and Robust Machine Learning for Healthcare: A Survey. arXiv preprint arXiv:2001.08103v1.
- Breck et al. (2016) Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2016. What’s your ML Test Score? A rubric for ML production systems. (2016).
- Choi et al. (2016) Edward Choi, Mohammad Taha Bahadori, Elizabeth Searles, Catherine Coffey, Michael Thompson, James Bost, Javier Tejedor-Sojo, and Jimeng Sun. 2016. Multi-layer representation learning for medical concepts. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 1495–1504.
The Git Open Source Community. 2020.Git. https://git-scm.com/.
Fred Donovan. 06/25/2019
White House Wants Transparency in Healthcare Artificial Intelligence. https://hitinfrastructure.com/news/white-house-wants-transparency-in-healthcare-artificial-intelligence
- Dunn (2016) Jeffrey Dunn. 2016. Introducing FBLearner flow: Facebook’s AI backbone. Facebook Code 9 (2016), 2016.
- Fowler (2019) Martin Fowler. 2019. https://martinfowler.com/articles/cd4ml.html.
- Hermann and Del Balso (2017) Jeremy Hermann and Mike Del Balso. 2017. Meet Michelangelo: Uber’s machine learning platform. https://eng.uber.com/michelangelo.
- McGrail (2020) Samantha McGrail. 02/14/2020 (accessed 06/14/2020). Challenges of Artificial Intelligence Adoption in Healthcare. https://hitinfrastructure.com/news/challenges-of-artificial-intelligence-adoption-in-healthcare
- Miao and Deshpande (2018) Hui Miao and Amol Deshpande. 2018. ProvDB: Provenance-enabled Lifecycle Management of Collaborative Data Analysis Workflows. IEEE Data Eng. Bull. 41, 4 (2018), 26–38.
- Oleg S. Pianykh ([n.d.]) Darren Parke Chengzhao Zhang Pari Pandharipande James Brink Daniel Rosenthal Oleg S. Pianykh, Steven Guitron. [n.d.]. Improving healthcare operations management with machine learning. Nat Mach Intell 2, 266–273 (2020) ([n. d.]). https://doi.org/10.1038/s42256-020-0176-3
- Perez (2019) Oscar Perez. 2019. Accelerating Machine Learning with the Feature Store Service. https://technology.condenast.com/story/accelerating-machine-learning-with-the-feature-store-service/
- Sculley et al. (2015) David Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison. 2015. Hidden technical debt in machine learning systems. In Advances in neural information processing systems. 2503–2511.
- Yun Liu (2019) Po-Hsuan Cameron Chen Yun Liu. 2019. https://ai.googleblog.com/2019/12/lessons-learned-from-developing-ml-for.html.
- Zaharia et al. (2018) Matei Zaharia, Andrew Chen, Aaron Davidson, Ali Ghodsi, Sue Ann Hong, Andy Konwinski, Siddharth Murching, Tomas Nykodym, Paul Ogilvie, Mani Parkhe, et al. 2018. Accelerating the Machine Learning Lifecycle with MLflow. IEEE Data Eng. Bull. 41, 4 (2018), 39–45.