SCALPEL extraction library to fetch concepts from flattened SNDS
This article introduces SCALPEL3, a scalable open-source framework for studies involving Large Observational Databases (LODs). Its design eases medical observational studies thanks to abstractions allowing concept extraction, high-level cohort manipulation, and production of data formats compatible with machine learning libraries. SCALPEL3 has successfully been used on the SNDS database (see Tuppin et al. (2017)), a huge healthcare claims database that handles the reimbursement of almost all French citizens. SCALPEL3 focuses on scalability, easy interactive analysis and helpers for data flow analysis to accelerate studies performed on LODs. It consists of three open-source libraries based on Apache Spark. SCALPEL-Flattening allows denormalization of the LOD (only SNDS for now) by joining tables sequentially in a big table. SCALPEL-Extraction provides fast concept extraction from a big table such as the one produced by SCALPEL-Flattening. Finally, SCALPEL-Analysis allows interactive cohort manipulations, monitoring statistics of cohort flows and building datasets to be used with machine learning libraries. The first two provide a Scala API while the last one provides a Python API that can be used in an interactive environment. Our code is available on GitHub. SCALPEL3 allowed to extract successfully complex concepts for studies such as Morel et al (2017) or studies with 14.5 million patients observed over three years (corresponding to more than 15 billion healthcare events and roughly 15 TeraBytes of data) in less than 49 minutes on a small 15 nodes HDFS cluster. SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps to build studies with full reproducibility, leading to improved maintainability and audit of studies performed on LODs.READ FULL TEXT VIEW PDF
mlpy is a Python Open Source Machine Learning library built on top of
LIBS2ML is a library based on scalable second order learning algorithms ...
Apache Spark is a popular open-source platform for large-scale data
With the growth of the open-source data science community, both the numb...
Population-scale drug prescription data linked with adverse drug reactio...
In energy modelling, open data and open source code can help enhance
Effective decision-making for crisis mitigation increasingly relies on
SCALPEL extraction library to fetch concepts from flattened SNDS
This repository host code related SNDS database flattening
Python based project to explore cohort data extracted with the SCALPEL3 framework.
In recent years, healthcare data volume and accessibility rose quickly. In France, the SNDS database contained data on 86% of the population in 2010  to reach 98.8% of the French population in 2015 . SNDS is now considered to be one of the world’s largest health Large Observational Database (LOD) [35, 7]. This database is a claims database initially built for accounting purposes. It mainly gathers data about French outpatients’ healthcare consumption reimbursement, linked with private and public hospital data. The data contains demographic characteristics of the patients and daily time-stamped data recording their interactions with the French healthcare system. This database is an extremely rich source of information, since it is almost exhaustive population-wise, leading to high statistical power and less sensitivity to population selection biases .
The use of LODs such as SNDS has proven useful for public health research. Compared to random controlled trial data, it has the advantage of giving a much broader (but cruder) representation of patients at a fraction of the RCTs’ cost [24, 32, 33]. SNDS has been used to study patient care pathways , assess prevalence of diseases , monitor adverse drug reactions , among many others (see  or the Supplementary Material of  for more examples).
This abundance of data comes at a cost: SNDS is a very complex database, with data spread across hundreds of tables and columns, while its scale makes data manipulation non-trivial on the existing SAS-Oracle Exadata infrastructure. More importantly, using this data requires a tremendous amount of knowledge from SNDS experts. Many coding or data recording subtleties, such as data duplication caused by administrative complexity, might bewilder inexperienced users. As a result, deriving proper health events definitions and extracting them accurately is a non-trivial task, having important consequences on the result of the derived studies [35, 10]. These problems are of course not unique to SNDS but shared by many LODs .
In this paper, we describe SCALPEL3, a new open-source framework (SCALable Pipeline for hEaLth data), aiming at attenuating such entry barriers to LODs. It is more than an Extract-Transform-Load (ETL) library since it also provides abstractions for interactive use, dataflow monitoring, and descriptive statistics. It is made of three open-source inter-operating libraries, developed and used on SNDS, named SCALPEL-Flattening , SCALPEL-Extraction  and SCALPEL-Analysis , each of them open-sourced in distinct GitHub repositories. Each library can be used independently of each other and can be used with other LODs with some adaptation work of SCALPEL-Flattening and SCALPEL-Extraction.
SCALPEL-Flattening converts raw SNDS data into a denormalized version stored in Parquet 
. SCALPEL-Extraction allows high-level concepts extractions from the denormalized data, such as medical acts, drug exposures, and hospitalizations. SCALPEL-Analysis eases cohort data manipulation and provides data quality checks and data flow monitoring and analysis. The manipulation of patients cohorts is facilitated by expressive and intelligible algebraic operations (such as union, difference, and intersection) that do not sacrifice the granular control of the data. A set of automated statistics and charts can be used to compare cohorts easily, such as flowcharts helping to monitor selection biases related to data manipulations. This complex data can easily be converted to various “machine learning ready” formats (such as TensorFlow, PyTorch Tensor or numpy arrays)
SCALPEL3 is based on Apache Spark  to ensure scalability. SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala  to ease code maintainability and testing, while SCALPEL-Analysis uses Spark’s Python API for interactive use, visualization and interoperability with Python machine learning libraries.
SCALPEL3 allows performing fast iterations when working on cohort designs or developing new machine learning algorithms with reproducibility and data-flow audit in mind. Moreover, users do not need to be SNDS specialists nor FHIR  or CDM  literate111 The conversion from one standard to another [12, 13] is very difficult , since concepts can be hard to translate from a representation to another . SCALPEL3 does not provide a way to load data from FHIR or CDM, but connectors will be developed once vocabulary conversion tables are available (there is an ongoing work in this direction ). to reach good productivity regarding data manipulation. SCALPEL3 provides a way to work on SNDS data with at most a few days of training, instead of the weeks or months required to grasp either SNDS native data or existing standards [35, 6].
SNDS is a large claims database, containing pseudonymized data on 98.8% of the French population (66 million patients in 2015) [35, 7]. It contains time-stamped information about medical events which led to reimbursement (see Table 1 in  for an exhaustive list of available data) in the last 3 years222which can be extended up to 20 years under some restrictions.. It contains more than 20 billion health events per year, representing roughly 70TB of data.
SNDS is composed of multiple “sub-databases”, each one with a star schema. The central table records events leading to cash flows that need to be joined to many other tables to access medical information333We work with two main sub-databases containing data relevant for public-health research. When working on drug safety studies, each of these two databases contains 8 relevant tables, representing approximately 5 billion lines a year when restricted to y.o. subjects.. In this form, retrieving patient information for statistical studies is very costly in terms of computation and expert knowledge, since targeted data can be spread across multiple databases, tens of tables and hundreds of columns, and since performing such manipulations requires a deep administrative knowledge of the French healthcare reimbursement mechanisms. Mitigating these issues is precisely the motivation of the SCALPEL3 framework.
SCALPEL3 is based on Apache Spark , a robust and widely adopted distributed computation framework providing high-level data operations that can be coupled with the Hadoop File System (HDFS) . As illustrated in Figure 1, SCALPEL3 is an open-source framework organized in the following three components.
SCALPEL-Flattening  denormalizes the data “once and for all” to avoid joining many tables each time the data of a patient is accessed. Its input is a set of CSV files extracted from the original SNDS database.
SCALPEL-Extraction  defines concepts extractors designed to process the denormalized data. For example, extractors can fetch all drug dispenses or medical acts.
SCALPEL-Analysis  implements powerful and scalable abstractions that can be used for data analysis, such as easy ways to investigate data quality issues. Its output can be commonly used formats for machine learning, such as NumPy arrays or TensorFlow or PyTorch tensors.
SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala with Spark, to access Spark’s low-level API and to use rigorous automated testing (94% of our Scala code is covered by tests). Both can be configured through textual configuration files. SCALPEL-Analysis is implemented in Python/PySpark for maximum interactivity (it can be used in a Jupyter notebook  for instance).
As mentioned earlier, using SNDS to perform data analysis on patients’ health requires many joins and can be consequently extremely slow. To circumvent this problem, we de-normalize the data by joining the tables sequentially to obtain a big table in which each line corresponds to a patient identifier and a wide representation of an event.
As expected, flattening a star-schema database results in a really big table due to values replications. To circumvent storage and computation issues, we use Parquet , an open-source columnar storage format implementing Google’s Dremel  data model, which is widely used in the Spark ecosystem . SCALPEL-Flattening maps SNDS tables to Parquet files that are combined into a single Parquet file for each SNDS sub-database, as summarized below.
SCALPEL-Flattening: DataBase[CSV] DataBase[Parquet] FlatDatabase[Parquet] List[Source].
The Source abstraction is used to encapsulate the output Parquet files and can be seen as a set of Apache Spark Row objects providing two main tools: a SourceReader handling specific reading procedures for each Source and a SourceSanitizer getting rid of some data, such as corrupted rows or duplicated data.
SCALPEL-Flattening is designed to ensure maximum scalability and allows easy incorporation of new tables and sub-databases. Scalability to an arbitrary number of tables is achieved through a multiple join strategy using temporal slices. The size of the temporal slice, schema and joining keys can be tuned by the end-user through a configuration file, which defaults to the denormalization of tables containing only medical data (as opposed to econometric and administrative data). A set of automated statistics is available for monitoring the denormalization process.
SCALPEL-Extraction provides fast medical concept extractions from the flat tables produced by SCALPEL-Flattening. It encapsulates SNDS technical knowledge but keeps medical data as raw as possible, so that end-users have access to fine-grained data which is often critical when designing observational studies [37, 11]. Extracted concepts are organized around two abstractions: Patient and Event.
The Patient abstraction has a unique patientID, a gender, a birthDate and eventually a deathDate.
The Event abstraction allows to represent any event associated to a patient. It can be punctual (e.g., medical act) or continuous (e.g., hospitalization).
All concepts are automatically extracted into Patient or Event objects by a set of Extractors and Transformers, designed to fetch the data in the relevant tables and columns of the SNDS Sources.
The Extractor abstraction maps a Row of a Source to 0 or many Events:
Extractor: Row List[Event].
Extractors select the Columns required by a medical concept, then filters out the Rows of the Sources which do not match the conditions related to the concept and finally outputs a list of correponding Events. Extractors can be controlled by textual configuration files. Many extractors are available to fetch medical acts, diagnoses, hospital stays, among others, an example being the drug dispense Extractor which allows extracting events related to specific drug subsets and outputting the events at multiple levels of granularity (drug, molecule, ATC class, custom classes) as defined in a configuration file. This simple architecture makes it easy to add new Extractors and answer any extraction need.
The Transformer abstraction transforms a collection of Events related to a unique Patient into a list of more complex Events (complex diseases, drug exposures, …):
Transformer: List[Event] List[Event].
A transformer is based on algorithms requiring multidisciplinary knowledge from epidemiologists, statisticians, clinicians, physicians, and SNDS experts . Each Transformer can be controlled using a textual configuration file. We provide many Transformers that were already used in several studies such as [20, 21]. SCALPEL-Extraction outputs a list of information that SCALPEL-Analysis uses as an input to build Cohorts, as explained below.
While SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala/Spark for performance and maintainability, SCALPEL-Analysis is implemented in Python/PySpark  since it is designed for interactive environments, such as Jupyter notebooks . SCALPEL-Analysis aims to ease cohort data manipulation and analysis. It is based on the following abstractions:
The Cohort abstraction is a set of Patients and their associated Events in a [startDate, endDate] time-window. Basic operations such as union, intersection, and difference can be performed between Cohorts, while a human-readable description is automatically updated in the results. More granular control is kept available through accesses to the underlying Spark DataFrames (using Spark DataFrame API). This combination allows for easy data engineering and fine-grained yet reproducible experiments.
The CohortCollection abstraction is a collection of Cohorts on which operations can be jointly performed. The CohortCollection has metadata that keeps information about each Cohort, such as the successive operations performed on it, the Parquet files they are stored in and a git commit hash of the code producing the extraction from the Source.
In Figure 2, data is easily loaded as a CohortCollection from a JSON file containing metadata. The Cohorts it contains can be combined to build a custom cohort in a few lines of code, and a description of the building process is easily obtained, see Figure 3, where we observe also that a few seconds is enough to manipulate a cohort with 5 million patients, thanks to PySpark.
International guidelines  regarding studies based on LODs insist on the explanation of cohort construction and the biases it might introduce to the studied population. This motivates the following CohortFlow abstraction.
The CohortFlow abstraction is an ordered CohortCollection, where each Cohort is included in the previous one. It is meant to track the stages leading to a final Cohort, where each intermediate Cohort is stored along with textual information about the filtering rules used to go from each stage to the next one. The whole CohortFlow can be stored as or loaded from a JSON file.
The scalpel.stats module produces descriptive statistics on a Cohort and their associated plots. For now, it contains more than 25 Patient-centric or Event-centric statistics, adding a custom one being very easy. Among other things, this module provides automatic reporting as text or graphical displays, with performance optimization through data caching. It can be combined with CohortFlow to compute various statistics at each analysis stage, to assess the biases induced along successive population filterings. An example is provided in Figure 4 below, where we observe that the stages leading to a cohort of patients with or without fractures induce strong age and gender biases. As illustrated in Figure 5 below, flowcharts can also be easily produced to track how many subjects were removed at each stage.
SCALPEL-Analysis also provides tools producing datasets in formats compatible with popular machine learning libraries. At the core of these tools is the FeatureDriver abstraction.
The FeatureDriver abstraction is used to transform Cohorts into data formats suitable for machine learning, such as numpy.ndarray , tensorflow.tensor  and pytorch.tensor . It is mainly a transformation of a Spark dataframe representation into a tensor-based format. A FeatureDriver has a lazy evaluation, namely, it launches computations whenever the result is needed. It performs several sanity checks, such as time-zone consistency and event dates consistency, and can be easily extended by end-users, thanks to the PySpark API.
Our experiments with SCALPEL3 are performed using a cluster of commodity hardware, with 240 2.4Ghz physical cores, 1.8Tb of RAM and 480Tb of storage distributed over 15 worker nodes driven by three master nodes. This is to be compared with the current SNDS framework with an Oracle SQL database, hosted on Oracle Exadata servers , connected to SAS Enterprise Guide for analytics . In terms of cost, the hardware used by our framework is commodity hardware, which is not only much cheaper than Oracle Exadata servers but also cheaper to scale if the data volume is increased: a Spark cluster easily scales “horizontally” by adding more nodes.
SCALPEL3 was first successfully tested on a cohort of 3.5 million patients by extracting complex concepts defined in  and used in . It is used today for research on a larger dataset of 14.5 million subjects, followed up to three years with a total of about 15 billion events (mainly drug dispenses, drug exposures with varying hypotheses regarding the exposure definition, medical procedures, diagnoses, and hospital stays).
SCALPEL-Flattening on this data takes about 6 hours, which is very satisfying since this operation is done once and for all and can be performed incrementally when new data are fed into the cluster (typically a few times a year). Note that the current SNDS framework was not designed to perform such a flattening so that there is no element of comparison for SCALPEL-Flattening.
However, we propose below a benchmark comparing SCALPEL-Extraction (its input is the output of SCALPEL-Flattening) with an extraction performed on the current SNDS framework, see Table 1 below. We consider 7 extraction tasks (a)–(g) which correspond to extractions required for an epidemiology study relating fractures to drug use.
|Extraction tasks||# lines extracted||Runtime (seconds)|
|(a) Patients demographics||15 484 594||463||126|
|(b) Drug dispenses||489 837 809||178||11385|
|(c) Prevalent drug users||5 961 189||5||132|
|(d) Drug exposures||276 856 114||97||543|
|(e) Medical acts||486 852||779||749|
|(f) Diagnoses||1 196 197||1339||91|
|(g) Fractures||882 622||40||165|
|(h) All tasks||2 901||13 190|
SCALPEL-Extraction extracts all the events 4.5 faster than the current SQL-SAS based SNDS framework. It appears in Table 1 that SCALPEL-Extraction is faster on tasks involving large data volumes such as tasks (b) and (c), or tasks involving complex operations such as (d) and (g). Task (c), which corresponds to drug dispenses extractions, is performed 64 faster with SCALPEL-Extraction: this is where distributed computing starts to shine when computation and memory requirements are too large to be satisfied by a single large server. On the other hand, tasks involving small tables and mainly table lookups (tasks (a), (e) and (f)) are slower with SCALPEL-Extraction.
Beyond such performance considerations, our framework greatly improves the maintainability, audit, and reproducibility of studies using SNDS. Firstly, continuous integration of code updates and large code coverage (94%) with unit testing is a big improvement in terms of maintainability over copy-pasted SQL snippets. Secondly, SNDS expertise encapsulation for events extraction is fully tested and maintained in SCALPEL3, so it eases extraction algorithms reuse for studies and lowers the entry-barrier to SNDS. Obviously, the relevance of extracted data (to answer a trade issue) requires some SNDS knowledge and is the responsibility of the user.
The combination of expert knowledge encapsulation (SCALPEL-Extraction) and interactive cohort manipulation (SCALPEL-Analysis) results in smaller and more readable user-code, leading to easily shared and reproducible studies, supported by data tracking and automated audit reports. Finally, SCALPEL3 allows producing datasets compatible with several Python machine learning libraries formats, enabling methodological research on SNDS data, which was not possible with the proprietary software that is currently used. We do our best to anticipate the development of vocabulary mapping tables in France, to ease the integration of data standards such as OMOP-CDM  or FHIR  to our codebase soon.
SNDS data usage is hard for research due to its scale and its conceptual complexity.
SCALPEL-Flattening and SCALPEL-Extraction ease medical event extraction by abstracting the algorithms required for such a task. SCALPEL-Extraction allows concept extraction and works as a growing library of medical events, designed by physicians, public health researchers, and engineers.
SCALPEL-Analysis allows manipulating data obtained after concept extraction very easily, for the design of cohorts for instance.
Data flows monitoring is automated and helps to control for selection biases and eventual data manipulation mistakes.
Our codebase is carefully tested and monitored to avoid producing artifacts in the data.
Improved code legibility thanks to powerful abstractions that will foster studies reproducibility.
Manuscript preparation: MM, EB, SG, DPN, YS, DS.
Concept and design of the data pipeline: YS, DS, MM.
Concept and design of the cohort manipulation library: YS, MM, DS, DPN.
Benchmarking: YS, FL, DS, MM.
Data sharing and critical review: DPN, FL.
We thank the engineers who worked on this project at some point: Firas Ben Sassi, Prosper Burq, Philip Deegan, Daniel De Paula e Silva, Xristos Giatsidis and Sathiya Prabhu Kumar.
We also thank the people from CNAM or Polytechnique who were or are currently involved in the Polytechnique-CNAM partnership, namely, for CNAM : Muhammad Abdallah, Aurélie Bannay, Hélène Caillol, Medhi Gabbas, Claude Gissot, Moussa Laanani, Anke Neumann, Cédric Pulrulczyk, Jérémie Rudant, Kévin Vu Saintonge, Alain Weill, for Polytechnique: Qing Chen, Agathe Guilloux, Anastasiia Nitavskyi, Yiyang Yu.