Picture archiving and communication systems (PACS) (huang1988picture) and other clinical systems are designated for day to day clinical use. They hence lack computational capabilities to efficiently execute machine learning (ML) pipelines on radiology images (cho2015medical). Radiology departments in the hospitals consist of several PACS that receive images from various scanners. However, these PACS are clinical systems that have limited processing and memory resources to run ML pipelines. Biomedical informatics (BMI) research clusters and cloud environments are mainly designed for supporting heavy computing workload for machine learning. Several factors should be satisfied to efficiently run the computational pipelines in real-time on such BMI clusters. First, there should a fast and secure continuous data transfer from the healthcare PACS to the research clusters. Second, an efficient processing framework must be built to process the received radiology images, facilitating the execution of ML pipelines on the acquired images. Despite the advancements in ML frameworks on medicine, such a real-time big data processing framework, spanning healthcare PACS to a BMI research cluster, is still lacking.
Rapid progress in the last decade, in computer vision and natural language processing, has ignited imaginations that Artificial Intelligence (AI) will lead to lower costs, fewer errors, more efficiency, and better health care (noorbakhsh2019artificial). Patient wait times for examination and diagnosis significantly impacts the effectiveness of cancer patient care. As radiologists are often a scarce resource, their availability affects the wait times. ML pipelines have been proposed to reliably prognosis and predict cancer to mitigate the workload of the diagnostic radiologists. However, most published research about AI in health care cannot be implemented in real clinical settings because they lack the robustness required of production-ready systems. AI models may become more robust by training on heterogenous, multi-institutional, carefully curated datasets. Despite technical validation, the lack of clinical integration renders the value or utility of these AI systems minimal and affects their adoption into real-life clinical care to improve patient outcomes.
A medical professional requires a certain amount of time to access and process the images that are produced in real-time by the scanners in the hospital networks. The growth of the research potential of deep learning in radiology highlights the potential for real-time processing of images from scanners with minimal latency. With the help of image processing and ML pipelines, such diagnosis can be automated and accelerated. However, the clinical systems do not offer access and the ability to deploy software systems to perform complex computations on data and metadata. Furthermore, they lack resources and capabilities to run ML pipelines or extract and process data and metadata from the healthcare images in their PACS.
Enabling secured transfer and access to large volumes of data to be processed by the algorithms is mandatory for such executions. Digital Imaging and Communications in Medicine (DICOM) (pianykh2009digital) standardizes the format the healthcare imaging and communications are stored and transferred across the network. DICOM network protocol facilitates the reliable transfer of imaging data and Structured Reports (SR) (clunie2000dicom) between the PACS in the radiology departments and computing servers such as data centers, clouds, and BMI research clusters. However, running the workflows on scalable public clouds maintained by third-parties comes with privacy concerns on sensitive healthcare data. Therefore, executing the ML pipelines on BMI clusters is often the only viable and secure option for healthcare images with protected health information (PHI). We hypothesize that ML pipelines can execute real-time in secure and scalable clusters of the BMI research departments on the cancer images and structured reports, by an efficient and reliable data transfer between the radiology and research departments.
In this paper, we designed Niffler, an ML framework that can receive a continuous DICOM messaging stream from healthcare PACS using DICOM network listeners. Niffler extracts and processes metadata from the DICOM images acquired at the BMI research clusters. It then executes ML pipelines and runs real-time analytics pipelines on radiology images, together with their textual metadata in their headers. The data transfer includes a push-based real-time data transfer as well as a query-driven data pull, to analyze both real-time and past examinations. We demonstrate the capability and stability of Niffler in receiving data in real-time securely, by running it continuously over 18 months on the BMI research clusters to receive images from two healthcare PACS. The Niffler prototype deployment receives real-time data from one PACS while also retrieving query-based data from an enterprise archive PACS that store historical data.
By leveraging medical image curation, knowledge discovery, information infrastructures, and natural language processing, Niffler extends a research imaging pipeline to include clinical data from the Real-Time Analytics (RTA) (trinks2017real) platform. Thus, Niffler provides an inference pipeline that allows for the clinical validation of AI models. We propose and prototype two AI use cases. The first case is an IVC filter detection framework that will understand the context of care for patients with IVC filters, including whether the patients are anticoagulated, their anticoagulation profile, and if they have an upcoming appointment for filter retrieval. We applied Inferior vena cava (IVC) filter detection algorithms on radiology images of modality XR, DX, CR, DR, and DX CR, for body parts of chest, spine, abdomen, small intestines, gallbladder, and thorax. We used the RetinaNet object detection (lin2017focal) as our IVC filter detection model on the images. We observe high accuracy and efficiency in using Niffler to run ML pipelines in real-time images obtained from the PACS. The second use case is to validate end-to-end testing of a radiology orders vetting model. Through end-to-end real-time testing of AI models using this inference pipeline, we can understand how AI works and when it fails, without affecting patient care in production systems. Thus, we aim to increase the participation of radiologists in clinical validation. We envision radiologists, machine learning engineers, and hospital administrators who wish to validate AI tools on institutional data as the primary beneficiaries.
In the upcoming sections, we present our approach to run machine learning pipelines against real-time image streams from the PACS. We discuss the background and motivation in Section 2. We elaborate our solution architecture of Niffler in Section 3. We further present the evaluations of our prototype with the IVC filter use case in Section 4. Section 5 finally concludes the paper with a summary of findings and potential future work.
2 Background and Motivation
The DICOM standard (parisot1995dicom) enables interoperability in storing and transferring data between healthcare and research environments. In this paper, we present a framework for an efficient transfer oF DICOM images from the healthcare PACS to the BMI research computer clusters and executing the ML algorithms on the BMI servers. In this section, we look more into the state-of-the-art and the rationale behind proposing Niffler as a framework for ML pipelines.
) and Tensorflow (abadi2016tensorflow), as well as the release of pre-trained models through various model zoos, have catalyzed the maturation of AI tools for AI training. Multiple systems are now available for testing AI models in real-life settings. A review of the exhibits at the annual radiology meeting – RSNA 2019 shows these tools are developed as market places for already approved FDA algorithms or systems that are embedded into existing PACS and imaging equipment like chest Xrays. The American College of Radiology has developed an open-source system called the ACR AI-LAB (acr) to democratize AI among radiologists by supporting on-premise federated learning.
burns2020just develops a real-time decision support pipeline for radiologists to support triage of emergent studies and reduce the rate of addendums. It also supports direct calls from the referring doctor to the correct person – for example, calls to technologists when a study is ordered and not performed, and to the reading room when a study has been dictated. They deployed an IVC detector running on imaging data filtered at the Study level, with the corresponding output of a JPG image with a bounding box around the IVC filter. epad provides a web-based framework for quantitative imaging informatics. It offers radiology image metadata with the Annotation and Image Markup (AIM) standard, and implements AIM as a web service with semantic image annotation. It is extensible but is limited in its capabilities. For example, a de-identification pipeline cannot be attached to it. Rather, DICOM images must be manually de-identified to provide anonymization before loading them to the web service.
Despite advances in Artificial Intelligence (AI) in medicine, there is a low rate of validation of medical imaging AI studies where only 6% (31/516) of published studies in 2018 performed external validation (i.e., diagnostic cohort, multi-institution data, prospective) (kim2019design). Even the FDA’s streamlined AI approval process often only requires a few test samples, with further testing viewed as post-marketing surveillance (fda1; fda2). Telling users to “try before you buy” is insufficient to determine which algorithms apply to specific clinical practices or to separate the clinically relevant ”wheat” from the marketing “chaff.” However, beyond infrastructure for training algorithms, there is still a critical gap in providing tooling to help inference evaluation of AI systems in local systems before large scale deployment. Real-life testing of ML algorithms can take months to set up after the initial training, with time invested in developing imaging streams that are not in production, security requirements for on-premise deployment and with minimal disruption on the workflow. Among 20 ML articles in Nature Medicine in 2018/2019, only one had a graphical user interface for model interaction, while ten provided code that requires significant time to set up (abid2020online). Visual analytic tools and interactive interfaces can close this accessibility gap and increase collaboration between engineers and end-users.
AI models are brittle, and they do not generalize. Dataset shift refers to original training data characteristics change, causing declines in AI performance over time (subbaswamy2020development). It requires continuous monitoring and recalibration. Differences in radiology equipment within/across institutions affect generalizability, and a model can learn and fine-tune itself based on equipment-specific details, affecting performance and clinical utility (zech2018variable). Clinical features like chest tubes for pneumothorax undermine model performance, as they detect the tubes rather than pneumothorax (taylor2018automated).
There is a gap between engineering metrics to evaluate algorithms, and what is clinically useful. Model results are typically presented as confusion matrices or ROC (Receiver Operating Characteristic) curves, but they do not translate to clinical use. Tools to investigate what features drive predictions remain rudimentary (e.g., heat maps) and are not clinically useful as they indicate where models derive the highest probability, but not why. Identifying sampling biases usually requires manual review and domain expertise and may not be apparent during model testing before actual images are reviewed. Most significantly, radiologists must participate in AI development, collecting test cases, establishing ground truth, choosing appropriate metrics and performance thresholds, and evaluating test cases with continuously monitoring outputs. Our vision forNiffler is to develop and test an AI inference pipeline that combines both clinical and imaging data with novel visual analytic (VA) tools that do not require a high level of engineering expertise and mirror the clinical workflow of radiologists. Thus, we aim to facilitate radiologists’ participation in AI clinical validation.
3 The Niffler Framework
The Niffler framework securely receives DICOM imaging data and radiology reports from the radiology department’s PACS to a research computing cluster. Niffler uses DICOM networking protocols to acquire real-time data and also pulled query-based data to our primary storage server from the archival storage. A computing script that executes continuously extracts and stores the metadata in a scalable indexed data store deployed in a cluster in a near real-time. The data acquired in real-time is deleted nightly. The archival data retrieved with a query is stored for further processing. The processing server executes queries on the metadata and the data of the storage servers.
In this section, we elaborate on its design principle and describe the implementation strategy in the following section. Niffler combines clinical data from the Real-Time Analytics (RTA) and a live imaging pipeline deployed in the BMI research servers to develop an AI inference pipeline that supports the testing of AI models in real-time. The imaging pipeline system receives imaging studies continuously from the PACS production system, extracts and indexes the DICOM metadata, temporarily stores the images until an inference is performed by running the algorithm on the image, and after that, the image is purged from the pipeline.
Figure 1 depicts the overall deployment architecture of Niffler. By extracting and analyzing the metadata at the research clusters, Niffler enables the creation of DICOM image subsets that can be further analyzed or shared with others. Niffler deletes the data periodically after the extraction and execution of the ML pipelines. Relevant images are analyzed and shared with diagnostic radiologists for further examinations through the secured Application Layer interfaces of Niffler.
The radiology departments consist of several PACS, each receiving radiology images from scanners of various modalities. We deploy Niffler in a server in the BMI research cluster. We configure the DICOM Listener in Niffler to receive DICOM images from the PACS. The DICOM Listener consists of a Store SCP to store the images received real-time as well as images retrieved by running a query on historical data stored in enterprise archives. The Metadata Extractor executes its extraction query on all the acquired images, extracts the relevant metadata from the DICOM headers, and stores the metadata in a Metadata Store. A Processing Engine is configured on the metadata store to run queries on the metadata. The ML algorithms run directly on the images stored in the binary data storage. An application layer provides access to the ML algorithms and enables the sharing of images and results. Subsets of images that are relevant for a study can be shared with the other researchers, deidentifying them or converting them into png images, or after running the ML pipelines on them to alter the images.
The Metadata Store is a NoSQL (han2011survey) database, with several collections that can be dynamically updated. The choice of a NoSQL database is due to their scalability and support for data in JSON format, supporting hierarchical entries. The metadata is used to filter cohorts and sub cohorts that allow for dataset creation for model inference. For example, to test whether the IVC filter model performance drops with the change of equipment, cohorts of data filtered by modality, and manufacturer are easily created at the metadata level.
Users will access both resources using a unified data explorer, that will allow the users to determine the cohort components required for model inference. For example, an end-user will access the metadata (without specific clinical information) and filter a query like “I want all Abdomen X-rays for studies between 2012 and 2019 with sub cohorts of manufacturers and their anticoagulation medication and problem list”. Since the imaging pipeline is a prospectively populated system with an option for a query to extract images meeting a specific criterion, this limits the amount of information stored in the research clusters that are duplicated. Without the integrated pipeline, then a researcher would have to submit multiple queries to the PACS research team and clinical data warehouse, work on anonymizing the data collected, merge the data and then run the model inference. Thus, Niffler supports prospective dynamic cohort and subcohort creation, eliminating the need for duplicate data storage and aggregation, with anonymized model output. Through its cloud-native architecture that natively supports the execution of algorithms as containers, Niffler provides an infrastructure-agnostic execution with seamless scaling and migration.
3.2 Niffler Execution
Niffler autostarts at login as a service and starts the DCM4CHEE StoreSCP tool as a separate process to receive images real-time from the PACS or based on queries on historical data stored in the enterprise archives. The DICOM images are stored in the local file system by default. Niffler Metadata Extractor opens a Pickle file that consists of a set of series that are already processed but still not deleted (), and set of series that are already processed and deleted (). If the Pickle file does not exist, the sets are initialized as empty sets, and the Pickle file is created.
Then Niffler reads the folder (specified by the user) consisting of the profiles stored as text files. The administrator creates various profiles – each consisting of elements that should be extracted from the DICOM headers of the images. For each profile, the Metadata Extractor initially would create a separate collection, if one does not exist already, in the same database. Each experiment can use existing profiles or create and deploy a new profile at run time without halting the execution. Hence, each experiment can have its own collection, with the elements specified by the user in the features files.
An extract_metadata process is the core of the Metadata Extractor. It runs periodically (by default, set at every 10 minutes) as a thread. However, only one instance of the extract_metadata thread is run at any given time. It goes through the DICOM file tree received and stored by the StoreSCP in the hierarchical patient-folder/study-folder/series-folder/instance.dcm. The process checks all the series available (). As the data is, by default, stored in the file system, the Metadata Extractor uses the operating system command, as a sub-process. In each iteration, the Metadata Extractor extracts metadata from the first image of each series that is not extracted yet. For performance reasons, Niffler extracts metadata from only one image per series. However, we can configure it to extract more than one (such as first, last, and a middle instance in any given series) or all the images of each series. The sets are updated as depicted in (1) following an iteration of extraction.
Niffler has a periodic clear_storage process that deletes the stored DICOM images every night. We have currently configured this deletion process to run at 23:59 each night. The extraction runs in real-time continuously. Specific images of interest and the images that are currently analyzed by an ML pipeline are prevented from being deleted. The sets are updated as depicted in (2) following an iteration of deletion.
An update_pickle process runs periodically (by default, set at every 20 minutes), to ensure that the sets and are written to the Pickle file. Thus, if the Niffler terminates its execution in the middle, it resumes without losing data or the tracking of the progress. This state-awareness allows seamless updates to Niffler without losing track of the extraction and deletion processes. Therefore, even when Niffler halts involuntarily or stopped manually, it does not lose track of what is extracted and deleted, and thus, when it restarts, it resumes execution where it left.
We developed the entire Niffler prototype, including the IVC filter container, with Python3. Pydicom library is used to extract metadata and process the DICOM images. The source code of Niffler is maintained in a github repository111The access information to the GitHub repository is omitted at the time of the submission due to double-blind requirements.. The DCM4CHEE (warnock2007benefits) StoreSCP tool is configured to receive all images in real-time, whereas a C-MOV is configured to retrieve images based on specific queries.
Niffler uses MongoDB as its Metadata Store. A replicated MongoDB cluster is used to support the scaling and redundancy of the data. Mongo replicasets can be added to the MongoDB cluster without reconfiguring the database. We use Apache Spark as our processing engines. We run our queries directly on the metadata store using MongoDB’s client interface, as well as Apache Spark. Currently, Niffler stores the DICOM images in the local file system, each DICOM instance in a folder hierarchy of patients/studies/series. The folders are identified their unique IDs such as PatientID, StudyInstanceUID, and SeriesInstanceUID, and thus indexed and easily identifiable from the metadata.
Niffler supports running ML algorithms as Docker (merkel2014docker
) containers on the images and metadata that it stores. By using Docker container instances, we minimize the complicated configuration steps while automating the end-to-end process. Container executions are shared as Docker scripts, rather than complex bundles. The sample IVC Filter algorithm is built into a Docker container. Therefore, we can seamlessly deploy it to run on our server consisting of the DICOM images with minimal repeated configurations. It uses Keras Retinet as its base and uses pre-trained models to predict the existence of the IVC filter in the identified subcategories of the images. Given below is a sample (anonymized) entry in the metadata store for a DICOM image, from one of the three profiles that we have currently deployed.
We deployed Niffler in a server that is secured by strict firewall rules and configured the MongoDB instances with authentication. For data transfer efficiency, Niffler supports receiving data in a secure compressed DICOM data stream. The images are in JPEG lossless compressed form. Niffler uses GDCM (developers2010grass) to export the compressed DICOM images to a PNG format, which is further consumed by the ML pipelines. The metadata extraction works efficiently on the compressed DICOM images.
We designed Niffler in a way that ML models can be used as a plug-in and executed with the minimal tuning of infrastructure. The Niffler prototype provides fast processing of the model in real-time with a high-efficiency, by running the filtering at metadata level and the ML pipelines using CPU only on the identified images. Niffler extracted data from 715 scanners, reaching up to 350 GB each day.
To measure the performance efficiency and viability of Niffler, we built and integrated an IVC filter detection algorithm on top of Niffler
. We used the Keras implementation of RetinaNet object detection to identify whether an IVC filter is detected in the radiology images of the studies. We trained the algorithm on 827 abdominal, thoracoabdominal, and lumbar radiographs from various projection views. The radiographs included 348 images with the presence of an IVCF and 480 images without IVCF from the same anatomical regions. All the images were labeled and reviewed by two interventional radiologists. A bounding box was manually drawn to localize the filter on all the positive images. The dataset was randomly divided into three subsets: training (503 images), validation (127 images), and test (200 images). An object detection CNN based on the RetinaNet architecture was trained during 15 epochs, with a batch size of 1 and a learning rate of 0.00001. The backbone encoder CNN was based on the Resnet-50 architecture (he2016deep) pre-trained on the COCO object detection dataset (lin2014microsoft).
As we receive the images real-time in BMI clusters, the Metadata Extractor applies the filters on modality and body parts to create a subset of data. The object detection algorithm executes, taking the identified images as its input. The IVC filter container first converts the DICOM images into PNG images before running its inference on the PNG images, including chest Xray, abdomen radiographs, and Spine Xrays from the BMI imaging pipeline. The algorithm, after that, draws a bounding box around the filter and outputs a PNG image with the detection box, as shown by Figure 2
. We observe that the Retinanet model classified the test images with a high accuracy of 96.0%.
At this point, clinical validation and deployment are incomplete, as we do not know if the patient is anticoagulated and can have this filter removed, or whether they have contraindication of filter removal or already have an upcoming scheduled appointment for filter retrieval. Therefore, the most logical step to support end-to-end clinical validation of this algorithm would involve the consumption of Electronic Medical Record (EMR) from the RTA on the laboratory information (INR, anticoagulation profile), medications (whether the patient is on any anticoagulant), problem list (for example, if a patient has a history of GI bleed and hence cannot be anticoagulated) and the upcoming clinical appointments where a patient can be seen in the clinic. Linkage to an ADT message would allow just in time clinical review of the patients in same-day appointments or education to providers on benefits of the IVC filter removal when no longer required.
Thus, linkage to EMR is essential to make this testing complete. The output of the IVC filter model is not actionable due to missing clinical data, including the determination of whether the patient is anticoagulated or if they have an upcoming appointment for filter retrieval. Therefore, by merging the imaging pipeline with the RTA stream, we can create a live AI inference pipeline that accelerates the development of clinically useful algorithms. To our knowledge, this will be the first AI inference pipeline that combines real-time image and clinical data information during AI validation. We propose to integrate the RTA clinical data pipeline with the imaging pipeline to provide tooling for data curation for model inference and training. Specifically, we will receive an HL7 feed from the RTA system, which will be normalized into Fast Healthcare Interoperability Resources (FHIR) resource groups. For this pilot, we will limit the scope of integration to cover the following resource groups – Patient, Organization, Appointment and Schedule, Medication, and Observation. Our choice for these resource groups is driven by our specific use case of an IVC filter detector that will be evaluated for this study. The FHIR resource groups will be added to our existing DICOM metadata and stored as document collections.
In this paper, we presented Niffler, a framework that supports the seamless transfer of data from the healthcare PACS to the BMI research clusters and enables efficient execution of ML pipelines on the images, reports, and the extracted metadata. With the efficient data transfer architecture between the healthcare and research departments and processing capabilities from the research clusters, we demonstrated the potential for seamless execution of ML pipelines in real-time from scanners to BMI research clusters.
As future work, we propose to add de-identification pipelines from The Cancer Imaging Archive (TCIA) (clark2013cancer), allowing human curation and a centralized process for IRB, thus ensuring that the data pipelines produce high-quality data in a secure and scalable manner. Our evaluations and prolonged execution of the Niffler prototype highlight its support for efficient distributed processing of data. Thus, Niffler enables the development of models against real-time data stream and also helps in gathering large-scale prospective data in a centralized store to facilitate imaging research.
We also propose to use standardized instruments of the unified theory of acceptance and use of technology (UTAUT) (venkatesh2016unified) to measure the perceived usefulness and satisfaction of the inference pipeline for AI validation. We will use the applied cognitive task analysis (ACTA) (militello1998applied) previously validated for extracting task information from subject matter experts and adapt it for clinical validation. This approach will entail a task interview with “think aloud” observations as radiologists validate AI models on the inference pipeline, followed by a knowledge audit of the decision making for determining clinical performance and utility of the AI algorithms.