SCALPEL3: a scalable open-source library for healthcare claims databases

10/15/2019 ∙ by Emmanuel Bacry, et al. ∙ 0

This article introduces SCALPEL3, a scalable open-source framework for studies involving Large Observational Databases (LODs). Its design eases medical observational studies thanks to abstractions allowing concept extraction, high-level cohort manipulation, and production of data formats compatible with machine learning libraries. SCALPEL3 has successfully been used on the SNDS database (see Tuppin et al. (2017)), a huge healthcare claims database that handles the reimbursement of almost all French citizens. SCALPEL3 focuses on scalability, easy interactive analysis and helpers for data flow analysis to accelerate studies performed on LODs. It consists of three open-source libraries based on Apache Spark. SCALPEL-Flattening allows denormalization of the LOD (only SNDS for now) by joining tables sequentially in a big table. SCALPEL-Extraction provides fast concept extraction from a big table such as the one produced by SCALPEL-Flattening. Finally, SCALPEL-Analysis allows interactive cohort manipulations, monitoring statistics of cohort flows and building datasets to be used with machine learning libraries. The first two provide a Scala API while the last one provides a Python API that can be used in an interactive environment. Our code is available on GitHub. SCALPEL3 allowed to extract successfully complex concepts for studies such as Morel et al (2017) or studies with 14.5 million patients observed over three years (corresponding to more than 15 billion healthcare events and roughly 15 TeraBytes of data) in less than 49 minutes on a small 15 nodes HDFS cluster. SCALPEL3 provides a sharp interactive control of data processing through legible code, which helps to build studies with full reproducibility, leading to improved maintainability and audit of studies performed on LODs.



There are no comments yet.


page 9

Code Repositories


SCALPEL extraction library to fetch concepts from flattened SNDS

view repo


This repository host code related SNDS database flattening

view repo


Python based project to explore cohort data extracted with the SCALPEL3 framework.

view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In recent years, healthcare data volume and accessibility rose quickly. In France, the SNDS database contained data on 86% of the population in 2010 [34] to reach 98.8% of the French population in 2015 [35]. SNDS is now considered to be one of the world’s largest health Large Observational Database (LOD) [35, 7]. This database is a claims database initially built for accounting purposes. It mainly gathers data about French outpatients’ healthcare consumption reimbursement, linked with private and public hospital data. The data contains demographic characteristics of the patients and daily time-stamped data recording their interactions with the French healthcare system. This database is an extremely rich source of information, since it is almost exhaustive population-wise, leading to high statistical power and less sensitivity to population selection biases [35].

The use of LODs such as SNDS has proven useful for public health research. Compared to random controlled trial data, it has the advantage of giving a much broader (but cruder) representation of patients at a fraction of the RCTs’ cost [24, 32, 33]. SNDS has been used to study patient care pathways [36], assess prevalence of diseases [2], monitor adverse drug reactions [21], among many others (see [7] or the Supplementary Material of [35] for more examples).

This abundance of data comes at a cost: SNDS is a very complex database, with data spread across hundreds of tables and columns, while its scale makes data manipulation non-trivial on the existing SAS-Oracle Exadata infrastructure. More importantly, using this data requires a tremendous amount of knowledge from SNDS experts. Many coding or data recording subtleties, such as data duplication caused by administrative complexity, might bewilder inexperienced users. As a result, deriving proper health events definitions and extracting them accurately is a non-trivial task, having important consequences on the result of the derived studies [35, 10]. These problems are of course not unique to SNDS but shared by many LODs [17].

In this paper, we describe SCALPEL3, a new open-source framework (SCALable Pipeline for hEaLth data), aiming at attenuating such entry barriers to LODs. It is more than an Extract-Transform-Load (ETL) library since it also provides abstractions for interactive use, dataflow monitoring, and descriptive statistics. It is made of three open-source inter-operating libraries, developed and used on SNDS, named SCALPEL-Flattening [16], SCALPEL-Extraction [26] and SCALPEL-Analysis [30], each of them open-sourced in distinct GitHub repositories. Each library can be used independently of each other and can be used with other LODs with some adaptation work of SCALPEL-Flattening and SCALPEL-Extraction.

SCALPEL-Flattening converts raw SNDS data into a denormalized version stored in Parquet [19]

. SCALPEL-Extraction allows high-level concepts extractions from the denormalized data, such as medical acts, drug exposures, and hospitalizations. SCALPEL-Analysis eases cohort data manipulation and provides data quality checks and data flow monitoring and analysis. The manipulation of patients cohorts is facilitated by expressive and intelligible algebraic operations (such as union, difference, and intersection) that do not sacrifice the granular control of the data. A set of automated statistics and charts can be used to compare cohorts easily, such as flowcharts helping to monitor selection biases related to data manipulations. This complex data can easily be converted to various “machine learning ready” formats (such as TensorFlow, PyTorch Tensor or numpy arrays)

SCALPEL3 is based on Apache Spark [38] to ensure scalability. SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala [22] to ease code maintainability and testing, while SCALPEL-Analysis uses Spark’s Python API for interactive use, visualization and interoperability with Python machine learning libraries.

SCALPEL3 allows performing fast iterations when working on cohort designs or developing new machine learning algorithms with reproducibility and data-flow audit in mind. Moreover, users do not need to be SNDS specialists nor FHIR [6] or CDM [9] literate111 The conversion from one standard to another [12, 13] is very difficult [18], since concepts can be hard to translate from a representation to another [28]. SCALPEL3 does not provide a way to load data from FHIR or CDM, but connectors will be developed once vocabulary conversion tables are available (there is an ongoing work in this direction [8]). to reach good productivity regarding data manipulation. SCALPEL3 provides a way to work on SNDS data with at most a few days of training, instead of the weeks or months required to grasp either SNDS native data or existing standards [35, 6].

2 Material and Methods

2.1 The SNDS database

SNDS is a large claims database, containing pseudonymized data on 98.8% of the French population (66 million patients in 2015) [35, 7]. It contains time-stamped information about medical events which led to reimbursement (see Table 1 in [35] for an exhaustive list of available data) in the last 3 years222which can be extended up to 20 years under some restrictions.. It contains more than 20 billion health events per year, representing roughly 70TB of data.

SNDS is composed of multiple “sub-databases”, each one with a star schema. The central table records events leading to cash flows that need to be joined to many other tables to access medical information333We work with two main sub-databases containing data relevant for public-health research. When working on drug safety studies, each of these two databases contains 8 relevant tables, representing approximately 5 billion lines a year when restricted to y.o. subjects.. In this form, retrieving patient information for statistical studies is very costly in terms of computation and expert knowledge, since targeted data can be spread across multiple databases, tens of tables and hundreds of columns, and since performing such manipulations requires a deep administrative knowledge of the French healthcare reimbursement mechanisms. Mitigating these issues is precisely the motivation of the SCALPEL3 framework.

2.2 SCALPEL3: a SCAlable Pipeline for hEaLth data

SCALPEL3 is based on Apache Spark [38], a robust and widely adopted distributed computation framework providing high-level data operations that can be coupled with the Hadoop File System (HDFS) [31]. As illustrated in Figure 1, SCALPEL3 is an open-source framework organized in the following three components.

SCALPEL-Flattening [16] denormalizes the data “once and for all” to avoid joining many tables each time the data of a patient is accessed. Its input is a set of CSV files extracted from the original SNDS database.

SCALPEL-Extraction [26] defines concepts extractors designed to process the denormalized data. For example, extractors can fetch all drug dispenses or medical acts.

SCALPEL-Analysis [30] implements powerful and scalable abstractions that can be used for data analysis, such as easy ways to investigate data quality issues. Its output can be commonly used formats for machine learning, such as NumPy arrays or TensorFlow or PyTorch tensors.

SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala with Spark, to access Spark’s low-level API and to use rigorous automated testing (94% of our Scala code is covered by tests). Both can be configured through textual configuration files. SCALPEL-Analysis is implemented in Python/PySpark for maximum interactivity (it can be used in a Jupyter notebook [15] for instance).

Figure 1: Architecture of the SCALPEL3 framework as described in Section 2.2. SCALPEL3 is made of three independent open-source libraries plugged one after another, namely SCALPEL-Flattening (implemented in Scala/Spark), SCALPEL-Extraction (implemented in Scala/Spark) and SCALPEL-Analysis (implemented in Python/PySpark).

2.3 SCALPEL-Flattening: denormalization of the data

As mentioned earlier, using SNDS to perform data analysis on patients’ health requires many joins and can be consequently extremely slow. To circumvent this problem, we de-normalize the data by joining the tables sequentially to obtain a big table in which each line corresponds to a patient identifier and a wide representation of an event.

As expected, flattening a star-schema database results in a really big table due to values replications. To circumvent storage and computation issues, we use Parquet [3], an open-source columnar storage format implementing Google’s Dremel [19] data model, which is widely used in the Spark ecosystem [4]. SCALPEL-Flattening maps SNDS tables to Parquet files that are combined into a single Parquet file for each SNDS sub-database, as summarized below.

SCALPEL-Flattening: DataBase[CSV] DataBase[Parquet] FlatDatabase[Parquet] List[Source].

The Source abstraction is used to encapsulate the output Parquet files and can be seen as a set of Apache Spark Row objects providing two main tools: a SourceReader handling specific reading procedures for each Source and a SourceSanitizer getting rid of some data, such as corrupted rows or duplicated data.

SCALPEL-Flattening is designed to ensure maximum scalability and allows easy incorporation of new tables and sub-databases. Scalability to an arbitrary number of tables is achieved through a multiple join strategy using temporal slices. The size of the temporal slice, schema and joining keys can be tuned by the end-user through a configuration file, which defaults to the denormalization of tables containing only medical data (as opposed to econometric and administrative data). A set of automated statistics is available for monitoring the denormalization process.

2.4 SCALPEL-Extraction: concepts extractions

SCALPEL-Extraction provides fast medical concept extractions from the flat tables produced by SCALPEL-Flattening. It encapsulates SNDS technical knowledge but keeps medical data as raw as possible, so that end-users have access to fine-grained data which is often critical when designing observational studies [37, 11]. Extracted concepts are organized around two abstractions: Patient and Event.

The Patient abstraction has a unique patientID, a gender, a birthDate and eventually a deathDate.

The Event abstraction allows to represent any event associated to a patient. It can be punctual (e.g., medical act) or continuous (e.g., hospitalization).

All concepts are automatically extracted into Patient or Event objects by a set of Extractors and Transformers, designed to fetch the data in the relevant tables and columns of the SNDS Sources.

The Extractor abstraction maps a Row of a Source to 0 or many Events:

Extractor: Row List[Event].

Extractors select the Columns required by a medical concept, then filters out the Rows of the Sources which do not match the conditions related to the concept and finally outputs a list of correponding Events. Extractors can be controlled by textual configuration files. Many extractors are available to fetch medical acts, diagnoses, hospital stays, among others, an example being the drug dispense Extractor which allows extracting events related to specific drug subsets and outputting the events at multiple levels of granularity (drug, molecule, ATC class, custom classes) as defined in a configuration file. This simple architecture makes it easy to add new Extractors and answer any extraction need.

The Transformer abstraction transforms a collection of Events related to a unique Patient into a list of more complex Events (complex diseases, drug exposures, …):

Transformer: List[Event] List[Event].

A transformer is based on algorithms requiring multidisciplinary knowledge from epidemiologists, statisticians, clinicians, physicians, and SNDS experts [35]. Each Transformer can be controlled using a textual configuration file. We provide many Transformers that were already used in several studies such as [20, 21]. SCALPEL-Extraction outputs a list of information that SCALPEL-Analysis uses as an input to build Cohorts, as explained below.

2.5 SCALPEL-Analysis: interactive manipulation and analysis of cohorts

While SCALPEL-Flattening and SCALPEL-Extraction are implemented in Scala/Spark for performance and maintainability, SCALPEL-Analysis is implemented in Python/PySpark [38] since it is designed for interactive environments, such as Jupyter notebooks [15]. SCALPEL-Analysis aims to ease cohort data manipulation and analysis. It is based on the following abstractions:

The Cohort abstraction is a set of Patients and their associated Events in a [startDate, endDate] time-window. Basic operations such as union, intersection, and difference can be performed between Cohorts, while a human-readable description is automatically updated in the results. More granular control is kept available through accesses to the underlying Spark DataFrames (using Spark DataFrame API). This combination allows for easy data engineering and fine-grained yet reproducible experiments.

The CohortCollection abstraction is a collection of Cohorts on which operations can be jointly performed. The CohortCollection has metadata that keeps information about each Cohort, such as the successive operations performed on it, the Parquet files they are stored in and a git commit hash of the code producing the extraction from the Source.

In Figure 2, data is easily loaded as a CohortCollection from a JSON file containing metadata. The Cohorts it contains can be combined to build a custom cohort in a few lines of code, and a description of the building process is easily obtained, see Figure 3, where we observe also that a few seconds is enough to manipulate a cohort with 5 million patients, thanks to PySpark.

Figure 2: Using SCALPEL-Analysis to load a CohortCollection from a metadata file using the Python API.
Figure 3: Using SCALPEL-Analysis for manipulations on a Cohort from the CohortCollection of Figure 2.

International guidelines [5] regarding studies based on LODs insist on the explanation of cohort construction and the biases it might introduce to the studied population. This motivates the following CohortFlow abstraction.

The CohortFlow abstraction is an ordered CohortCollection, where each Cohort is included in the previous one. It is meant to track the stages leading to a final Cohort, where each intermediate Cohort is stored along with textual information about the filtering rules used to go from each stage to the next one. The whole CohortFlow can be stored as or loaded from a JSON file.

The scalpel.stats module produces descriptive statistics on a Cohort and their associated plots. For now, it contains more than 25 Patient-centric or Event-centric statistics, adding a custom one being very easy. Among other things, this module provides automatic reporting as text or graphical displays, with performance optimization through data caching. It can be combined with CohortFlow to compute various statistics at each analysis stage, to assess the biases induced along successive population filterings. An example is provided in Figure 4 below, where we observe that the stages leading to a cohort of patients with or without fractures induce strong age and gender biases. As illustrated in Figure 5 below, flowcharts can also be easily produced to track how many subjects were removed at each stage.

Figure 4: Using SCALPEL-Analysis to define a CohortFlow from a JSON file, then using scalpel.stats to obtain statistics about the distributions of gender and age along the stages. Top-left and bottom-left: excluding patients with a fracture does not introduce much changes in the gender and age distributions. Top-right and bottom-right: keeping only patients with fractures leads to an older population, with an important change in the age distribution of women (a well-known phenomenon related to osteoporosis).
Figure 5: SCALPEL-Analysis allows to generate automatically flowcharts in order to track how many subjects are removed at each stage of a CohortFlow. Each node counts how many patients are kept in the stage, while edges counts how many patients are removed from one stage to another.

SCALPEL-Analysis also provides tools producing datasets in formats compatible with popular machine learning libraries. At the core of these tools is the FeatureDriver abstraction.

The FeatureDriver abstraction is used to transform Cohorts into data formats suitable for machine learning, such as numpy.ndarray [14], tensorflow.tensor [1] and pytorch.tensor [25]. It is mainly a transformation of a Spark dataframe representation into a tensor-based format. A FeatureDriver has a lazy evaluation, namely, it launches computations whenever the result is needed. It performs several sanity checks, such as time-zone consistency and event dates consistency, and can be easily extended by end-users, thanks to the PySpark API.

3 Experimental results and discussion

Our experiments with SCALPEL3 are performed using a cluster of commodity hardware, with 240 2.4Ghz physical cores, 1.8Tb of RAM and 480Tb of storage distributed over 15 worker nodes driven by three master nodes. This is to be compared with the current SNDS framework with an Oracle SQL database, hosted on Oracle Exadata servers [23], connected to SAS Enterprise Guide for analytics [29]. In terms of cost, the hardware used by our framework is commodity hardware, which is not only much cheaper than Oracle Exadata servers but also cheaper to scale if the data volume is increased: a Spark cluster easily scales “horizontally” by adding more nodes.

SCALPEL3 was first successfully tested on a cohort of 3.5 million patients by extracting complex concepts defined in [21] and used in [20]. It is used today for research on a larger dataset of 14.5 million subjects, followed up to three years with a total of about 15 billion events (mainly drug dispenses, drug exposures with varying hypotheses regarding the exposure definition, medical procedures, diagnoses, and hospital stays).

SCALPEL-Flattening on this data takes about 6 hours, which is very satisfying since this operation is done once and for all and can be performed incrementally when new data are fed into the cluster (typically a few times a year). Note that the current SNDS framework was not designed to perform such a flattening so that there is no element of comparison for SCALPEL-Flattening.

However, we propose below a benchmark comparing SCALPEL-Extraction (its input is the output of SCALPEL-Flattening) with an extraction performed on the current SNDS framework, see Table 1 below. We consider 7 extraction tasks (a)–(g) which correspond to extractions required for an epidemiology study relating fractures to drug use.

Extraction tasks # lines extracted Runtime (seconds)
(a) Patients demographics 15 484 594 463 126
(b) Drug dispenses 489 837 809 178 11385
(c) Prevalent drug users 5 961 189 5 132
(d) Drug exposures 276 856 114 97 543
(e) Medical acts 486 852 779 749
(f) Diagnoses 1 196 197 1339 91
(g) Fractures 882 622 40 165
(h) All tasks 2 901 13 190
Table 1: Benchmarks for SCALPEL-Extraction (running on a 15-nodes cluster) versus the current SNDS framework based on SQL-SAS (running on Exadata computers). The 7 extraction tasks considered are required for an epidemiology study trying to identify drugs that increase the risk of a fracture. On these tasks, the total running time of SCALPEL3 is 4.5 times faster than the framework currently used.

SCALPEL-Extraction extracts all the events 4.5 faster than the current SQL-SAS based SNDS framework. It appears in Table 1 that SCALPEL-Extraction is faster on tasks involving large data volumes such as tasks (b) and (c), or tasks involving complex operations such as (d) and (g). Task (c), which corresponds to drug dispenses extractions, is performed 64 faster with SCALPEL-Extraction: this is where distributed computing starts to shine when computation and memory requirements are too large to be satisfied by a single large server. On the other hand, tasks involving small tables and mainly table lookups (tasks (a), (e) and (f)) are slower with SCALPEL-Extraction.

Beyond such performance considerations, our framework greatly improves the maintainability, audit, and reproducibility of studies using SNDS. Firstly, continuous integration of code updates and large code coverage (94%) with unit testing is a big improvement in terms of maintainability over copy-pasted SQL snippets. Secondly, SNDS expertise encapsulation for events extraction is fully tested and maintained in SCALPEL3, so it eases extraction algorithms reuse for studies and lowers the entry-barrier to SNDS. Obviously, the relevance of extracted data (to answer a trade issue) requires some SNDS knowledge and is the responsibility of the user.

The combination of expert knowledge encapsulation (SCALPEL-Extraction) and interactive cohort manipulation (SCALPEL-Analysis) results in smaller and more readable user-code, leading to easily shared and reproducible studies, supported by data tracking and automated audit reports. Finally, SCALPEL3 allows producing datasets compatible with several Python machine learning libraries formats, enabling methodological research on SNDS data, which was not possible with the proprietary software that is currently used. We do our best to anticipate the development of vocabulary mapping tables in France, to ease the integration of data standards such as OMOP-CDM [27] or FHIR [6] to our codebase soon.

4 Summary Table

  • SNDS data usage is hard for research due to its scale and its conceptual complexity.

  • SCALPEL-Flattening and SCALPEL-Extraction ease medical event extraction by abstracting the algorithms required for such a task. SCALPEL-Extraction allows concept extraction and works as a growing library of medical events, designed by physicians, public health researchers, and engineers.

  • SCALPEL-Analysis allows manipulating data obtained after concept extraction very easily, for the design of cohorts for instance.

  • Data flows monitoring is automated and helps to control for selection biases and eventual data manipulation mistakes.

  • Our codebase is carefully tested and monitored to avoid producing artifacts in the data.

  • Improved code legibility thanks to powerful abstractions that will foster studies reproducibility.

5 Conflicts of interest


6 Authors’ contribution

Manuscript preparation: MM, EB, SG, DPN, YS, DS.

Concept and design of the data pipeline: YS, DS, MM.

Concept and design of the cohort manipulation library: YS, MM, DS, DPN.

Benchmarking: YS, FL, DS, MM.

Data sharing and critical review: DPN, FL.

7 Acknowledgments

We thank the engineers who worked on this project at some point: Firas Ben Sassi, Prosper Burq, Philip Deegan, Daniel De Paula e Silva, Xristos Giatsidis and Sathiya Prabhu Kumar.

We also thank the people from CNAM or Polytechnique who were or are currently involved in the Polytechnique-CNAM partnership, namely, for CNAM : Muhammad Abdallah, Aurélie Bannay, Hélène Caillol, Medhi Gabbas, Claude Gissot, Moussa Laanani, Anke Neumann, Cédric Pulrulczyk, Jérémie Rudant, Kévin Vu Saintonge, Alain Weill, for Polytechnique: Qing Chen, Agathe Guilloux, Anastasiia Nitavskyi, Yiyang Yu.


  • [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from
  • [2] Hubert Allemand, Brigitte Seradour, Alain Weill, and Philippe Ricordeau. Decline in breast cancer incidence in 2005 and 2006 in france: a paradoxical trend. Bulletin du cancer, 95(1):11–15, 2008.
  • [3] Apache Parquet, 2015.
  • [4] Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD ’15, pages 1383–1394, New York, NY, USA, 2015. ACM.
  • [5] Eric I Benchimol, Liam Smeeth, Astrid Guttmann, Katie Harron, David Moher, Irene Petersen, Henrik T Sørensen, Erik von Elm, Sinéad M Langan, RECORD Working Committee, et al. The reporting of studies conducted using observational routinely-collected health data (RECORD) statement. PLoS medicine, 12(10):e1001885, 2015.
  • [6] Duane Bender and Kamran Sartipi. HL7 FHIR: An agile and RESTful approach to healthcare information exchange. In Proceedings of CBMS 2013 - 26th IEEE International Symposium on Computer-Based Medical Systems, pages 326–331. IEEE, jun 2013.
  • [7] Julien Bezin, Mai Duong, Régis Lassalle, Cécile Droz, Antoine Pariente, Patrick Blin, and Nicholas Moore. The national healthcare system claims databases in france, SNIIRAM and EGB: Powerful tools for pharmacoepidemiology. Pharmacoepidemiology and Drug Safety, 26(8):954–962, aug 2017.
  • [8] Marc Cuggia, Dominique Polton, Gilles Wainrib, and Stéphanie Combes. Health Data Hub: mission de préfiguration. Technical report, Ministère des Solidarités et de la Santé, 10 2018. In French.
  • [9] F. FitzHenry, F S Resnic, S L Robbins, J. Denton, L. Nookala, D. Meeker, L. Ohno-Machado, and M E Matheny. Creating a common data model for comparative effectiveness with the observational medical outcomes partnership. Applied clinical informatics, 6(3):536–47, dec 2015.
  • [10] Richard A. Hansen, Michael D. Gray, Brent I. Fox, Joshua C. Hollingsworth, Juan Gao, and Peng Zeng. How well do various health outcome definitions identify appropriate cases in observational studies. Drug Safety, 36(SUPPL.1):27–32, oct 2013.
  • [11] Na Hong, Ning Zhang, Huawei Wu, Shanshan Lu, Yue Yu, Li Hou, Yinying Lu, Hongfang Liu, and Guoqian Jiang. Preliminary exploration of survival analysis using the OHDSI common data model: a case study of intrahepatic cholangiocarcinoma. BMC Medical Informatics and Decision Making, 18(S5):116, dec 2018.
  • [12] Guoqian Jiang, Richard Kiefer, Eric Prud’hommeaux, and Harold R Solbrig. Building interoperable FHIR-Based vocabulary mapping services: A case study of OHDSI vocabularies and mappings. Studies in health technology and informatics, 245:1327, 2017.
  • [13] Guoqian Jiang, Richard C Kiefer, Deepak K Sharma, Eric Prud’hommeaux, and Harold R Solbrig. A consensus-based approach for harmonizing the OHDSI common data model with HL7 FHIR. In Studies in Health Technology and Informatics, volume 245, pages 887–891. NIH Public Access, 2017.
  • [14] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific tools for Python, 2001–.
  • [15] Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Pérez, Brian Granger, Matthias Bussonnier, Jonathan Frederic, Kyle Kelley, Jessica Hamrick, Jason Grout, Sylvain Corlay, Paul Ivanov, Damián Avila, Safia Abdalla, and Carol Willing. Jupyter notebooks – a publishing format for reproducible computational workflows. In F. Loizides and B. Schmidt, editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press, 2016.
  • [16] Sathiya P. Kumar, Youcef Sebiat, Firas Ben Sassi, Dian Sun, Daniel Paula e Silva, and Prosper Burq. SCALPEL-Flattening, 2019.
  • [17] David Madigan, Paul E. Stang, Jesse A. Berlin, Martijn Schuemie, J. Marc Overhage, Marc A. Suchard, Bill Dumouchel, Abraham G. Hartzema, and Patrick B. Ryan. A systematic statistical approach to evaluating evidence from observational studies. Annual Review of Statistics and Its Application, 1(1):11–39, 2014.
  • [18] J. Marc Overhage, Patrick B Ryan, Christian G Reich, Abraham G Hartzema, and Paul E Stang. Validation of a common data model for active safety surveillance research. Journal of the American Medical Informatics Association, 19(1):54–60, jan 2012.
  • [19] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale datasets. Proc. VLDB Endow., 3(1-2):330–339, September 2010.
  • [20] Maryan Morel, Emmanuel Bacry, Stéphane Gaïffas, Agathe Guilloux, and Fanny Leroy. ConvSCCS: convolutional self-controlled case series model for lagged adverse event detection. Biostatistics, 2019.
  • [21] A Neumann, A Weill, P Ricordeau, JP Fagot, F Alla, and H Allemand. Pioglitazone and risk of bladder cancer among diabetic patients in france: a population-based cohort study. Diabetologia, 55(7):1953–1962, 2012.
  • [22] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Sebastian Maneth, Stéphane Micheloud, Nikolay Mihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. An overview of the Scala programming language. Technical report, École Polytechnique Fédérale de Lausanne, 2004.
  • [23] Exadata Database Machine — Oracle, 2008.
  • [24] J. Marc Overhage and Lauren M Overhage. Sensible use of observational clinical data. Statistical Methods in Medical Research, 22(1):7–13, feb 2013.
  • [25] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch, 2017.
  • [26] Daniel Paula e Silva, Youcef Sebiat, Sathiya Prabhu Kumar, Firas Ben Sassi, Prosper Burq, Dian Sun, Maryan Morel, Kevin Vu Saintonge, and Philip Deegan. SCALPEL-Extraction, 2019.
  • [27] Stephanie J Reisinger, Patrick B Ryan, Donald J O’Hara, Gregory E Powell, Jeffery L Painter, Edward N Pattishall, and Jonathan A Morris. Development and evaluation of a common data model enabling active drug safety surveillance using disparate healthcare databases. Journal of the American Medical Informatics Association, 17(6):652–662, 2010.
  • [28] Peter R. Rijnbeek. Converting to a common data model: What is lost in translation? Drug Safety, 37(11):893–896, Nov 2014.
  • [29] SAS Enterprise Guide — SAS Support, 1976.
  • [30] Youcef Sebiat, Maryan Morel, Dian Sun, and Dinh Phong Nguyen. SCALPEL-Analysis, 2019.
  • [31] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The Hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society.
  • [32] Stuart L. Silverman. From Randomized Controlled Trials to Observational Studies. The American Journal of Medicine, 122(2):114–120, feb 2009.
  • [33] Ravi Thadhani. Formal trials versus observational studies. Oxford PharmaGenesis, 2006.
  • [34] P. Tuppin, L. de Roquefeuil, A. Weill, P. Ricordeau, and Y. Merlière. French national health insurance information system and the permanent beneficiaries sample. Revue d’Épidémiologie et de Santé Publique, 58(4):286 – 290, 2010.
  • [35] P. Tuppin, J. Rudant, P. Constantinou, C. Gastaldi-Ménager, A. Rachas, L. de Roquefeuil, G. Maura, H. Caillol, A. Tajahmady, J. Coste, C. Gissot, A. Weill, and A. Fagot-Campagna. Value of a national administrative database to guide public decisions: From the système national d’information interrégimes de l’assurance maladie (SNIIRAM) to the système national des données de santé (SNDS) in france. Revue d’Épidémiologie et de Santé Publique, 65:S149 – S167, 2017. Réseau REDSIAM.
  • [36] P Tuppin, S Samson, A Fagot-Campagna, and F Woimant. Care pathways and healthcare use of stroke survivors six months after admission to an acute-care hospital in france in 2012. Revue neurologique, 172(4-5):295–306, 2016.
  • [37] S. V. Wang, P Verpillat, J. A. Rassen, A Patrick, E. M. Garry, and D. B. Bartels. Transparency and reproducibility of observational cohort studies using large healthcare databases. Clinical Pharmacology and Therapeutics, 99(3):325–332, mar 2016.
  • [38] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, and Ion Stoica. Apache Spark: A unified engine for big data processing. Commun. ACM, 59(11):56–65, October 2016.