The research accounted for in this paper was funded by the Université Lumière Lyon 2 and the Auvergne-Rhône-Alpes Region through the COREL and AURA-PMI projects, respectively. The authors also sincerely thank the anonymous reviewers of this paper for their constructive comments and suggestions.
The 21st century is marked by an exponential growth of the amount of data produced in the world. This is notably induced by the fast development of the Internet of Things (IoT) and social media. Yet, while big data represent a tremendous opportunity for various organizations, they come in such volume, speed, heterogeneous sources and structures that they exceed the capabilities of traditional management systems for their collection, storage and processing in a reasonable time Miloslavskaya2016. A time-tested solution for big data management and processing is data warehousing. A data warehouse is indeed an integrated and historical storage system that is specifically designed to analyze data. However, while data warehouses are still relevant and very powerful for structured data, semi-structured and unstructured data induce great challenges for data warehouses. Yet, the majority of big data is unstructured Miloslavskaya2016. Thus, the concept of data lake was introduced to address big data issues, especially those induced by data variety.
A data lake is a very large data storage, management and analysis system that handles any data format. It is currently quite popular and trendy both in the industry and academia. Yet, the concept of data lake is not straightforward for everybody. A survey conducted in 2016 indeed revealed that 35% of the respondents considered data lakes as a simple marketing label for a preexisting technology, i.e., Apache Hadoop Grosser2016. Knowledge about the concept of the data lake has since evolved, but some misconceptions still exist, presumably because most of data lakes design approaches are abstract sketches from the industry that provide few theoretical or implementation details Quix2018. Therefore, a survey can be useful to give researchers and practitioners a better comprehension of the data lake concept and its design alternatives.
To the best of our knowledge, the only literature reviews about data lakes are all quite brief and/or focused on a specific topic, e.g., data lake concepts and definitions Couto2019; Madera2016, the technologies used for implementing data lakes Mathis2017 or data lakes inherent issues Giebler2019; Quix2018. Admittedly, the report proposed in Russom2017 is quite extensive, but it adopts a purely industrial view. Thus, we adopt in this paper a wider scope to propose a more comprehensive state of the art of the different approaches to design and exploit a data lake. We particularly focus on data lake architectures and metadata management, which lie at the base of any data lake project and are the most commonly cited issues in the literature (Figure 1).
More precisely, we first review data lake definitions and complement the best existing one. Then, we investigate the architectures and technologies used for the implementation of data lakes, and propose a new typology of data lake architectures. Our second main focus is metadata management, which is a primordial issue to avoid turning a data lake into an inoperable, so-called data swamp. We notably classify data lake metadata and introduce the features that are necessary to achieve a full metadata system. We also discuss the pros and cons of data lakes.
Eventually, note that we do not review other important topics, such as data ingestion, data governance and security in data lakes, because they are currently little addressed in the literature, but could still presumably be the subject of another full survey.
The remainder of this paper is organized as follows. In Section 2, we define the data lake concept. In Section 3, we review data lake architectures and technologies to help users choose the right approach and tools. In Section 4, we extensively review and discuss metadata management. Eventually, we recapitulate the pros and cons of data lakes in Section 5 and conclude the paper in Section 6 with a mind map of the key concepts we introduce, as well as current open research issues.
2 Data Lake Definitions
2.1 Definitions from the Literature
The concept of data lake was introduced by Dixon as a solution to perceived shortcomings of datamarts, which are business-specific subdivisions of data warehouses that allow only subsets of questions to be answered Dixon2010. In the literature, data lakes are also refered to as data reservoirs Chessell2014 and data hubs Ganore2015; Laskowski2016, although the terms data lake are the most frequent. Dixon envisions a data lake as a large storage system for raw, heterogeneous data, fed by multiple data sources, and that allows users to explore, extract and analyze the data.
Subsequently, part of the literature considered data lakes as an equivalent to the Hadoop technology Fang2015; Ganore2015; OLeary2014. According to this point of view, the concept of data lake refers to a methodology for using free or low-cost technologies, typically Hadoop, for storing, processing and exploring raw data within a company Fang2015. The systematic association of data lakes to low cost technologies is becoming minority in the literature, as the data lake concept is now also associated with proprietary cloud solutions such as Azure or IBM Madera2016; Sirosh2016 and various data management systems such as NoSQL solutions and multistores. However, it can still be viewed as a data-driven design pattern for data management Russom2017.
More consensually, a data lake may be viewed as a central repository where data of all formats are stored without a strict schema, for future analyses Couto2019; Khine2017; Mathis2017. This definition is based on two key characteristics of data lakes: data variety and the schema-on-read approach, also known as late binding Fang2015, which implies that schema and data requirements are not fixed until data querying Khine2017; Maccioni2018; Stein2014. This is the opposite to the schema-on-write approach used in data warehouses.
However, the variety/schema-on-read definition may be considered fuzzy because it gives little detail about the characteristics of a data lake. Thus, Madera and Laurent introduce a more complete definition where a data lake is a logical view of all data sources and datasets in their raw format, accessible by data scientists or statisticians for knowledge extraction Madera2016.
More interestingly, this definition is complemented by a set of features that a data lake should include:
data quality is provided by a set of metadata;
the lake is controlled by data governance policy tools;
usage of the lake is limited to statisticians and data scientists;
the lake integrates data of all types and formats;
the data lake has a logical and physical organization.
2.2 Discussion and New Definition
Madera and Laurent’s definition of data lakes is presumably the most precise, as it defines the requirements that a data lake must meet (Section 2.1). However, some points in this definition are debatable. The authors indeed restrain the use of the lake to data specialists and, as a consequence, exclude business experts for security reasons. Yet, in our opinion, it is entirely possible to allow controlled access to this type of users through a navigation or analysis software layer.
Moreover, we do not share the vision of the data lake as a logical view over data sources, since some data sources may be external to an organization, and therefore to the data lake. Since Dixon explicitly states that lake data come from data sources Dixon2010, including data sources into the lake may therefore be considered contrary to the spirit of data lakes.
Finally, although quite complete, Madera and Laurent’s definition omits an essential property of data lakes: scalability Khine2017; Miloslavskaya2016. Since a data lake is intended for big data storage and processing, it is indeed essential to address this issue. Thence, we amend Madera and Laurent’s definition to bring it in line with our vision and introduce scalability Sawadogo2019B.
A data lake is a scalable storage and analysis system for data of any type, retained in their native format and used mainly by data specialists (statisticians, data scientists or analysts) for knowledge extraction. Its characteristics include:
a metadata catalog that enforces data quality;
data governance policies and tools;
accessibility to various kinds of users;
integration of any type of data;
a logical and physical organization;
scalability in terms of storage and processing.
3 Data Lake Architectures and Technologies
Existing reviews on data lake architectures commonly distinguish pond and zone architectures Giebler2019; Ravat2019B. However, this categorization may sometimes be fuzzy. Thus, we introduce in Section 3.1 a new manner to classify data lakes architectures that we call Functional Maturity. In addition, we present in Section 3.2 a list of possible technologies to implement a data lake. Eventually, we investigate in Section 3.3 how a data lake system can be associated with a data warehouse in an enterprise data architecture.
3.1 Data Lake Architectures
3.1.1 Zone Architectures
Inmon designs a data lake as a set of data ponds Inmon2016. A data pond can be viewed as a subdivision of a data lake dealing with data of a specific type. According to Dixon’s specifications, each data pond is associated with a specialized storage system, some specific data processing and conditioning (i.e., data transformation/preparation) and a relevant analysis service. More precisely, Inmon identifies five data ponds (Figure 2).
The raw data pond deals with newly ingested, raw data. It is actually a transit zone, since data are then conditioned and transferred into another data pond, i.e., either the analog, application or textual data pond. The raw data pond, unlike the other ponds, is not associated with any metadata system.
Data stored in the analog data pond are characterized by a very high frequency of measurements, i.e., they come in with high velocity. Typically, semi-structured data from the IoT are processed in the analog data pond.
Data ingested in the application data pond come from software applications, and are thus generally structured data from relational Database Management Systems (DBMSs). Such data are integrated, transformed and prepared for analysis; and Inmon actually considers that the application data pond is a data warehouse.
The textual data pond manages unstructured, textual data. It features a textual disambiguation process to ease textual data analysis.
The purpose of the archival data pond is to save the data that are not actively used, but might still be needed in the future. Archived data may originate from the analog, application and textual data ponds.
So-called zone architectures assign data to a zone according to their degree of refinement Giebler2019. For instance, Zaloni’s data lake Laplante2016 adopts a six-zone architecture (Figure 3).
The transient loading zone deals with data under ingestion. Here, basic data quality checks are performed.
The raw data zone handles data in near raw format coming from the transient zone.
The trusted zone is where data are transferred once standardized and cleansed.
From the trusted area, data move into the discovery sandbox where they can be accessed by data scientists through data wrangling or data discovery operations.
On top of the discovery sandbox, the consumption zone allows business users to run “what if” scenarios through dashboard tools.
The governance zone finally allows to manage, monitor and govern metadata, data quality, a data catalog and security.
However, this is but one of several variants of zone architectures. Such architectures indeed generally differ in the number and characteristics of zones Giebler2019, e.g., some architectures include a transient zone Laplante2016; Tharrington2017; Zikopoulos2015 while others do not Hai2016; Ravat2019B.
A particular zone architecture often mentioned in the data lake literature is the lambda architecture John2017; Mathis2017. It indeed stands out since it includes two data processing zones: a batch processing zone for bulk data and a real-time processing zone for fast data from the IoT John2017. These two zones help handling fast data as well as bulk data in an adapted and specialized way.
In both pond and zone architectures, data are pre-processed. Thus, analyses are quick and easy. However, this come at the cost of data loss in the pond architectures, since raw data are deleted when transferred to other ponds. The drawbacks of the many zone architectures depend on each variant. For example, in Zaloni’s architecture Laplante2016, data flow across six areas, which may lead to multiple copies of the data and, therefore, difficulties in controlling data lineage. In the Lamda architecture John2017, speed and batch processing components follow different paradigms. Thus, data scientists must handle two distinct logics for cross analyses Mathis2017, which makes data analysis harder, overall.
Moreover, the distinction of data lake architectures into pond and zone approaches is not so crisp in our opinion. The pond architecture may indeed be considered as a variant of zone architecture, since data location depends on the refinement level of data, as in zone architectures. In addition, some zone architectures include a global storage zone where raw and cleansed data are stored altogether John2017; Quix2018, which contradicts the definition of zone architectures, i.e., components depend on the degree of data refinement.
3.1.2 Functional Maturity Architectures
To overcome the contradictions of the pond/zone categorization, we propose an alternative way to group data lake architectures regarding the type of criteria used to define components. As a result, we distinguish functional architectures, data maturity-based architectures and hybrid architectures (Figure 4).
follow some basic functions to define a lake’s components. Data lake basic functions typically include Laplante2016:
a data ingestion function to connect with data sources;
a data storage function to persist raw as well as refined data;
a data processing function;
a data access function to allow raw and refined data querying.
Quix and Hai, as well as Mehmood et al., base their data lake architectures on these functions Mehmood2019; Quix2018. Similarly, John and Misra’s lambda architecture John2017 may be considered as a functional architecture, since its components represent data lake functions such as storage, processing and serving.
Data maturity-based architectures
are data lake architectures where components are defined regarding data refinement level. In other words, it is constituted of most zone architectures. A good representative is Zaloni’s data lake architecture Laplante2016, where common basic zones are a transient zone, a raw data zone, a trusted data zone and a refined data zone Laplante2016; Tharrington2017; Zikopoulos2015.
are data lake architectures where the identified components depend on both data lake functions and data refinement. Inmon’s pond architecture is actually a hybrid architecture Inmon2016. On one hand, it is a data maturity-based architecture, since raw data are managed in a special component, i.e., the raw data pond, while refined data are managed in other ponds, i.e., the textual, analog and application data ponds. But on the other hand, the pond architecture is also functional because Inmon’s specifications consider some storage and process components distributed across data ponds (Figure 2). Ravat and Zhao also propose such an hybrid data lake architecture (Figure 5 Ravat2019B).
Functional architectures have the advantage of clearly highlighting the functions to implement for a given data lake, which helps match easily with the required technologies. By contrast, data maturity-based architectures are useful to plan and organize the data lifecycle. Both approaches are thus limited, since they only focus on a unique point of view, while it is important in our opinion to take both functionality and data maturity into account when designing a data lake.
In consequence, we advocate for hybrid approaches. However, existing hybrid architecture can still be improved. For instance, in Inmon’s data pond approach, raw data are deleted once they are refined. This process may induce some data loss, which is contrary to the spirit of data lakes. In Ravat and Zhao’s proposal, data access seems only possible for refined data. Such limitations hint that a more complete hybrid data lake architecture is still needed nowadays.
3.2 Technologies for Data Lakes
Most data lake implementations are based on the Apache Hadoop ecosystem Couto2019; Khine2017. Hadoop has indeed the advantage of providing both storage with the Hadoop Distributed File System (HDFS) and data processing tools via MapReduce or Spark. However, Hadoop is not the only suitable technology to implement a data lake. In this section, we go beyond Hadoop to review usable tools to implement data lake basic functions.
3.2.1 Data Ingestion
Ingestion technologies help physically transfer data from data sources into a data lake. A first category of tools includes software that iteratively collects data through pre-designed and industrialized jobs. Most such tools are proposed by the Apache Foundation, and can also serve to aggregate, convert and clean data before ingestion. They include Flink and Samza (distributed stream processing frameworks), Flume (a Hadoop log transfer service), Kafka (a framework providing real time data pipelines and stream processing applications) and Sqoop (a framework for data integration from SQL and NoSQL DBMSs into Hadoop) John2017; Mathis2017; Suriarachchi2016.
A second category of data ingestion technologies is made of common data transfer tools and protocols (wget, rsync, FTP, HTTP, etc.), which are used by the data lake manager within data ingestion scripts. They have the key advantage to be readily available and widely understood Terrizzano2015. In a similar way, some Application Programming Interfaces (APIs) are available for data retrieval and transfer from the Web into a data lake. For instance, CKAN and Socrata provide APIs to access a catalogue of open data and associated metadata Terrizzano2015.
3.2.2 Data Storage
We distinguish two main approaches to store data in data lakes. The first way consists in using classic databases for storage. Some data lakes indeed use relational DBMSs such as MySQL, PostgreSQL or Oracle to store structured data Beheshti2017; Khine2017. However, relational DBMSs are ill-adapted to semi-structured, and even more so to unstructured data. Thus, NoSQL (Not only SQL) DBMSs are usually used instead Beheshti2017; Giebler2019; Khine2017. Moreover, assuming that data variety is the norm in data lakes, a multi-paradigm storage system is particularly relevant Nogueira2018. Such so-called multistore systems manage multiple DBMSs, each matching a specific storage need.
The second main way to store data and the most used is HDFS storage (in about 75% of data lakes Russom2017). HDFS is a distributed storage system that offers a very high scalability and handles all types of data John2017. Thus, it is well suited for schema-free and bulk storage that are needed for unstructured data. Another advantage of this technology is the distribution of data that allows high fault-tolerance. However, HDFS alone is not sufficient to handle all data formats, especially structured data. Thus, it should ideally be combined with relational and/or NoSQL DBMSs.
3.2.3 Data Processing
In data lakes, data processing is very often performed with MapReduce Couto2019; John2017; Khine2017; Mathis2017; Stein2014; Suriarachchi2016, a parallel data processing paradigm provided by Apache Hadoop. MapReduce is well-suited to very large data, but is less efficient for fast data because it works on disk Tiao2018. Thus, alternative processing frameworks are used, from which the most famous is Apache Spark. Spark works like MapReduce, but adopts a full in-memory approach instead of using the file system for storing intermediate results. Thence, Spark is particularly suitable for real-time processing. Similarly, Apache Flink and Apache Storm are also suitable for real-time data processing John2017; Khine2017; Mathis2017; Suriarachchi2016; Tiao2018. Nevertheless, these two approaches can be simultaneously implemented in a data lake, with MapReduce being dedicated to voluminous data and stream-processing engines to velocious data John2017; Suriarachchi2016.
3.2.4 Data Access
In data lakes, data may be accessed through classical query languages such as SQL for relational DBMSs, JSONiq for MongoDB, XQuery for XML DBMSs or SPARQL for RDF resources Farid2016; Fauduet2010; Hai2016; Laskowski2016; Pathirana2015. However, this does not allow simultaneously querying across heterogeneous databases, while data lakes do store heterogeneous data, and thus typically require heterogeneous storage systems.
One solution to this issue is to adopt the query techniques from multistores (Section 3.2.2) Nogueira2018. For example, Spark SQL and SQL++ may be used to query both relational DBMSs and semi-structured data in JSON format. In addition, the Scalable Query Rewriting Engine (SQRE) handles graph databases Hai2018. Finally, CloudMdsQL also helps simultaneously query multiple relational and NoSQL DBMSs Leclercq2018. Quite similarly, Apache Phoenix can be used to automatically convert SQL queries into a NoSQL query language, for example. Apache Drill allows joining data from multiple storage systems Beheshti2017. Data stored in HDFS can also be accessed using Apache Pig John2017.
Eventually, business users, who require interactive and user-friendly tools for data reporting and visualization tasks, widely use dashboard services such as Microsoft Power BI and Tableau over data lakes Couto2019; Russom2017.
3.3 Combining Data Lakes and Data Warehouses
There are in the literature two main approaches to combine a data lake and a data warehouse in a global data management system. The first approach pertains to using a data lake as the data source of a data warehouse (Section 3.3.1). The second considers data warehouses as components of data lakes (Section 3.3.2).
3.3.1 Data Lake Sourcing a Data Warehouse
This approach aims to take advantage of the specific characteristics of both data lakes and data warehouses. Since data lakes allow an easier and cheaper storage of large amount of raw data, they can be considered as staging areas or Operational Data Stores (ODSs) Fang2015; Russom2017, i.e., intermediary data stores ahead of data warehouses that gather operational data from several sources before the ETL process takes place.
With a data lake sourcing a data warehouse, possibly with semi-structured data, industrialized OLAP analyses are possible over the lake’s data, while on-demand, ad-hoc analyses are still possible directly from the data lake (Figure 6).
3.3.2 Data Warehouse within a Data Lake
As detailed in Section 3.1.1, Inmon proposes an architecture based on a subdivision of data lakes into so-called data ponds Inmon2016. For Inmon, structured data ponds sourced from operational applications are, plain and simple, data warehouses. Thus, this approach acts on a conception of data lakes as extensions of data warehouses.
When a data lake sources a data warehouse (Section 3.3.1), there is a clear functional separation, as data warehouses and data lakes are specialized in industrialized and on-demand analyses, respectively. However, this comes with a data siloing issue.
By contrast, the data siloing syndrome can be reduced in Inmon’s approach (Section 3.3.2), as all data are managed and processed in a unique global platform. Hence, diverse data can easily be combined through cross-reference analyses, which would be impossible if data were managed separately. In addition, building a data warehouse inside a global data lake may improve data lifecycle control. That is, it should be easier to track, and thus to reproduce processes applied to the data that are ingested in the data warehouse, via the data lake’s tracking system.
4 Metadata Management in Data Lakes
Data ingested in data lakes bear no explicit schema Miloslavskaya2016, which can easily turn a data lake into a data swamp in the absence of an efficient metadata system Suriarachchi2016. Thence, metadata management plays an essential role in data lakes Laskowski2016; Khine2017. In this section, we detail the metadata management techniques used in data lakes. First, we identify the metadata that are relevant to data lakes. Then, we review how metadata can be organized. We also investigate metadata extraction tools and techniques. Finally, we provide an inventory of desirable features in metadata systems.
4.1 Metadata Categories
We identify in the literature two main typologies of metadata dedicated to data lakes. The first one distinguishes functional metadata, while the second classifies metadata with respect to structural metadata types.
4.1.1 Functional Metadata
Oram introduces a metadata classification in three categories, with respect to the way they are gathered Oram2015.
Business metadata are defined as the set of descriptions that make the data more understandable and define business rules. More concretely, these are typically data field names and integrity constraints. Such metadata are usually defined by business users at the data ingestion stage.
Operational metadata are information automatically generated during data processing. They include descriptions of the source and target data, e.g., data location, file size, number of records, etc., as well as process information.
Technical metadata express how data are represented, including data format (e.g., raw text, JPEG image, JSON document, etc.), structure or schema. The data structure consists in characteristics such as names, types, lengths, etc. They are commonly obtained from a DBMS for structured data, or via custom techniques during the data maturation stage.
Diamantini et al. enhance this typology with a generic metadata model Diamantini2018 and show that business, operational and technical metadata sometimes intersect. For instance, data fields relate both to business and technical metadata, since they are defined in data schemas by business users. Similarly, data formats may be considered as both technical and operational metadata, and so on (Figure 7).
4.1.2 Structural Metadata
In this classification, Sawadogo et al. categorize metadata with respect to the “objects” they relate to Sawadogo2019B. The notion of object may be viewed as a generalization of the dataset concept Maccioni2018, i.e., an object may be a relational or spreadsheet table in a structured data context, or a simple document (e.g., XML document, image file, video file, textual document, etc.) in a semi-structured or unstructured data context. Thence, we use the term “object” in the remainder of this paper.
belong to a set of characteristics associated with single objects in the lake. They are subdivided into four main subcategories.
Properties provide an object’s general description. They are generally retrieved from the filesystem as key-value pairs, e.g., file name and size, location, date of last modification, etc.
Previsualization and summary metadata aim to provide an overview of the content or structure of an object. For instance, metadata can be extracted data schemas for structured and semi-structured data, or wordclouds for textual data.
Version and representation metadata are made of altered data. When a new data object is generated from existing object in the data lake, may be considered as metadata for
. Version metadata are obtained through data updates, while representation metadata come from data refining operations. For instance, a refining operation may consist of vectorizing a textual document into a bag-of-words for further automatic processing.
Semantic metadata involve annotations that describe the meaning of data in an object. They include such information as title, description, categorization, descriptive tags, etc. They often allow data linking. Semantic metadata can be either generated using semantic resources such as ontologies, or manually added by business users Hai2016; Quix2016.
represent links between two or more objects. They are subdivided into three categories.
Object groupings organize objects into collections. Any object may be associated with several collections. Such links can be automatically deduced from some intra-object metadata such as tags, data format, language, owner, etc.
Similarity links express the strength of likeness between objects. They are obtained via common or custom similarity measures. For instance, Maccioni2018 define the affinity and joinability measures to express the similarity between semi-structured objects.
Parenthood links aim to save data lineage, i.e., when a new object is created from the combination of several others, these metadata record the process. Parenthood links are thus automatically generated during data joins.
are data structures that provide a context layer to make data processing and analysis easier. Global metadata are not directly associated with any specific object, but potentially concern the entire lake. There are three subcategories of global metadata.
Semantic resources are knowledge bases such as ontologies, taxonomies, thesauri, etc., which notably help enhance analyses. For instance, an ontology can be used to automatically extend a term-based query with equivalent terms. Semantic resources are generally obtained from the Internet or manually built.
Indexes (including inverted indexes) enhance term-based or pattern-based data retrieval. They are automatically built and enriched by an indexing system.
Logs track user interactions with the data lake, which can be simple, e.g., user connection or disconnection, or more complex, e.g., a job running.
Oram’s metadata classification is the most cited, especially in the industrial literature Diamantini2018; Laplante2016; Ravat2019A; Russom2017, presumably because it is inspired from metadata categories from data warehouses Ravat2019B. Thus, its adoption seems easier and more natural for practitioners who are already working with it.
Yet, we favor the second metadata classification, because it includes most of the features defined by Oram’s. Business metadata are indeed comparable to semantic metadata. Operational metadata may be considered as logs and technical metadata are equivalent to previsualization metadata. Hence, the structural metadata categorization can be considered as an extension, as well as a generalization, of the functional metadata classification.
Moreover, Oram’s classification is quite fuzzy when applied in the context of data lakes. Diamantini et al. indeed show that functional metadata intersect (Section 4.1.1) Diamantini2018. Therefore, practitioners who do not know this typology may be confused when using it to identify and organize metadata in a data lake.
Table 1 summarizes commonalities and differences between the two metadata categorizations presented above. The comparison addresses the type of information both inventories provide.
|Type of information||Functional metadata||Structural metadata|
|Basic characteristics of data||✓||✓|
|(size, format, etc.)|
|(tags, descriptions, etc.)|
4.2 Metadata Modeling
There are in the literature two main approaches to represent a data lake’s metadata system. The first, most common approach, adopts a graph view, while the second exploits data vault modeling.
4.2.1 Graph Models
Most models that manage data lake metadata systems are based on a graph approach. We identify three main subcategories of graph-based metadata models with respect to the main features they target.
Data provenance-centered graph models
mostly manage metadata tracing, i.e., the information about activities, data objects and users who interact with a specific object Suriarachchi2016. In other words, they track the pedigree of data objects Halevy2016B. Provenance representations are usually built using a directed acyclic graph (DAG) where nodes represent entities such as users, roles or objects Beheshti2017; Hellerstein2017. Edges are used to express and describe interactions between entities, e.g., through a simple timestamp, activity type (read, create, modify) Beheshti2017, system status (CPU, RAM, bandwith) Suriarachchi2016 or even the script used Hellerstein2017. For instance Figure 8-a shows a basic provenance model with nodes representing data objects and edges symbolizing operations. Data provenance tracking helps ensure the traceability and repeatability of processes in data lakes. Thus, provenance metadata can be used to understand, explain and repair inconsistencies in the data Beheshti2017. They may also serve to protect sensitive data, by detecting intrusions Suriarachchi2016.
Similarity-centered graph models
describe the metadata system as an undirected graph where nodes are data objects and edges express a similarity between objects. Such a similarity can be specified either through a weighted or unweighted edge. Weighted edges show the similarity strength, when a formal similarity measure is used, e.g., affinity and joinability Maccioni2018 (Figure 8-b). In contrast, unweighted edges serve to simply detect whether two objects are connected Farrugia2016. Such a graph design allows network analyses over a data lake Farrugia2016, e.g., discovering communities or calculating the centrality of nodes, and thus their importance in the lake. Another use of data similarity may be to automatically recommend to lake users some data related to the data they currently observe Maccioni2018.
Composition-centered graph models
help decompose each data object into several inherent elements. The lake is viewed as a DAG where nodes represent objects or attributes, e.g., columns, tags, etc., and edges from any node to any node express the constraint Diamantini2018; Halevy2016B; Nargesian2018. This organization helps users navigate through the data Nargesian2018. It can also be used as a basis to detect connections between objects. For instance, Diamantini2018 used a simple string measure to detect links between heterogeneous objects by comparing their respective tags.
4.2.2 Data Vault
A data lake aims at ingesting new data possibly bearing various structures. Thus, its metadata system needs to be flexible to easily tackle new data schemas. Nogueira et al. propose the use of a data vault to address this issue Nogueira2018. Data vaults are indeed alternative logical models to data warehouse star schemas that, unlike star schemas, allow easy schema evolution Linstedt2011. Data vault modeling involves three types of entities Hultgren2016.
A hub represents a business concept, e.g., customer, vendor, sale or product in a business decision system.
A link represents a relationship between two or more hubs.
Satellites contain descriptive information associated with a hub or a link. Each satellite is attached to a unique hub or link. In contrast, links or hubs may be associated with any number of satellites.
In Nogueira et al.’s proposal, metadata common to all objects, e.g., title, category, date and location, are stored in hubs; while metadata specific to some objects only, e.g., language for textual documents or publisher for books, are stored in satellites (Figure 9). Moreover, any new type of object would have its specific metadata stored in a new satellite.
Data vault modeling is seldom associated with data lakes in the literature, presumably because it is primarily associated with data warehouses. Yet, this approach ensures metadata schema evolutivity, which is required to build an efficient data lake. Another advantage of data vault modeling is that, unlike graph models, it can be intuitively implemented in a relational DBMS. However, several adaptations are still needed for this model to deal with data linkage as in graph models.
Graph models, though requiring more specific storage systems such as RDF or graph DBMSs, are still advantageous because they allow to automatically enrich the lake with information that facilitate and enhance future analyses. Nevertheless, the three subcategories of graph models need to be all integrated together for this purpose. This remains an open issue because at most two of these graph approaches are simultaneously implemented in metadata systems from the literature Diamantini2018; Halevy2016B. The MEDAL metadata model does include all three subcategories of graph models Sawadogo2019B, but is not implemented yet.
4.3 Metadata Generation
Most of the data ingestion tools from Section 3.2.1 can also serve to extract metadata. For instance, Suriarachchi and Plale use Apache Flume to retrieve data provenance in a data lake Suriarachchi2016. Similarly, properties and semantic metadata can be obtained through specialized protocols such as the Comprehensive Knowledge Archive Network (CKAN), an open data storage management system Terrizzano2015.
A second kind of technologies is more specific to metadata generation. For instance, Apache Tika helps detect the MIME type and language of objects Quix2016. Other tools such as Open Calais and IBM’s Alchemy API can also enrich data through inherent entity identification, relationship inference and event detection Farid2016.
Ad-hoc algorithms can also generate metadata. For example, Singh et al. show that Bayesian models allow detecting links between data attributes Singh2016. Similarly, several authors propose algorithms to discover schemas or constraints in semi-structured data Beheshti2017; Klettke2017; Quix2016.
Last but not least, Apache Atlas ApacheAtlas, a widely used metadata framework Russom2017, features advanced metadata generation methods through so-called hooks, which are native or custom scripts that rely on logs to generate metadata. Hooks notably help Atlas automatically extract lineage metadata and propagate tags on all derivations of tagged data.
4.4 Features of Data Lake Metadata Systems
A data lake made inoperable by lack of proper metadata management is called a data swamp Khine2017, data dump Inmon2016; Suriarachchi2016 or one-way data lake Inmon2016, with data swamp being the most common terms. In such a flawed data lake, data are ingested, but can never be extracted. Thus, a data swamp is unable to ensure any analysis. Yet, to the best of our knowledge, there is no objective way to measure or compare the efficiency of data lake metadata systems. Therefore, we first introduce in this section a list of expected features for a metadata system. Then, we present a comparison of eighteen data lake metadata systems with respect to these features.
4.4.1 Feature Identification
Sawadogo et al. identify six features that a data lake metadata system should ideally implement to be considered comprehensive Sawadogo2019B.
Semantic Enrichment (SE) is also known as semantic annotation Hai2016 or semantic profiling Ansari2018. It involves adding information such as title, tags, description and more to make the data comprehensible Terrizzano2015. This is commonly done using knowledge bases such as ontologies Ansari2018. Semantic annotation plays a vital role in data lakes, since it makes the data meaningful by providing informative summaries Ansari2018. In addition, semantic metadata could be the basis of link generation between data Quix2016. For instance, data objects with the same tags could be considered linked.
Data Indexing (DI) is commonly used in the information retrieval and database domains to quickly find a data object. Data indexing is done by building and enriching some data structure that enables efficient data retrieval from the lake. Indexing can serve for both simple keyword-based retrieval and more complex querying using patterns. All data, whether structured, semi-structured or unstructured, benefit from indexing Singh2016.
Link Generation (LG) consists in identifying and integrating links between lake data. This can be done either by ingesting pre-existing links from data sources or by detecting new links. Link generation allows additional analyses. For instance, similarity links can serve to recommend to lake users data close to the data they currently use Maccioni2018. In the same line, data links can be used to automatically detect clusters of strongly linked data Farrugia2016.
Data Polymorphism (DP) is the simultaneous management of several data representations in the lake. A data representation of, e.g., a textual document, may be a tag cloud or a vector of term frequencies. Semi-structured and unstructured data need to be at least partially transformed to be automatically processed Diamantini2018. Thus, data polymorphism is relevant as it allows to store and reuse transformed data. This makes analyses easier and faster by avoiding the repetition of certain processes Stefanowski2017.
Data Versioning (DV) expresses a metadata system’s ability to manage update operations, while retaining the previous data states. It is very relevant to data lakes, since it ensures process reproducibility and the detection and correction of inconsistencies Bhattacherjee2018. Moreover, data versioning allows branching and concurrent data evolution Hellerstein2017.
Usage Tracking (UT) consists in managing information about user interactions with the lake. Such interactions are commonly creation, read and update operations. This allows to transparently follow the evolution of data objects. In addition, usage tracking can serve for data security, either by explaining data inconsistencies or through intrusion detection. Usage tracking and data versioning are related, since update interactions often induce new data versions. However, they are distinct features as they can be implemented independently Beheshti2017; Suriarachchi2016.
4.4.2 Metadata System Comparison
We present in Table 2 a comparison of eighteen state-of-the-art metadata systems and models with respect to the features they implement Sawadogo2019B. We distinguish metadata models from implementations. Models are indeed quite theoretical and describe the conceptual organization of metadata. In contrast, implementations follow a more operational approach, but are usually little detailed, mainly focusing on a description of the resulting system instead of the applied methodology. This comparison also considers metadata systems that are not explicitly associated with the concept of data lake by their authors, but whose characteristics allow to be considered as such, e.g., the Ground metadata model Hellerstein2017.
The comparison shows that the most comprehensive metadata system with respect to the features we propose is MEDAL, with all features covered. However, it is not implemented yet. The next best systems are GOODS and CoreKG, with five out of six features implemented. However, they are black box metadata systems, with few details on metadata conceptual organization. Thus, the Ground metadata model may be preferred, since it is much more detailed and almost as complete (four out of six features).
Eventually, two of the six features defined in Section 4.4.1 may be considered advanced. Data polymorphism and data versioning are indeed mainly found in the most complete systems such as GOODS, CoreKG and Ground. Their absence from most of metadata systems can thus be attributed to implementation complexity.
Implementations Metadata models
Model or implementation akin to a data lake
5 Pros and Cons of Data Lakes
In this section, we account for the benefits of using a data lake instead of more traditional data management systems, but also identify the pitfalls that may correspond to these expected benefits.
An important motivating feature in data lakes is cheap storage
. Data lakes are ten to one hundred times less expensive to deploy than traditional decision-oriented databases. This can be attributed to the usage of open-source technologies such as HDFSKhine2017; Stein2014. Another reason is that the cloud storage often used to build data lakes reduces the cost of storage technologies. That is, the data lake owner pays only for actually used resources. However, the use of HDFS may still fuel misconceptions, with the concept of data lake remaining ambiguous for many potential users. It is indeed often considered either as a synonym or a marketing label closely related to the HDFS technology Alrehamy2015; Grosser2016.
Another feature that lies at the core of the data lake concept is data fidelity. Unlike in traditional decision-oriented databases, original data are indeed preserved in a data lake to avoid any data loss that could occur from data preprocessing and transformation operations Ganore2015; Stein2014. Yet, data fidelity induces a high risk of data inconsistency in data lakes, due to data integration from multiple, disparate sources without any transformation OLeary2014.
One of the main benefits of data lakes is that they allow exploiting and analyzing unstructured data Ganore2015; Laskowski2016; Stein2014. This is a significant advantage when dealing with big data, which are predominantly unstructured Miloslavskaya2016. Moreover, due to the schema-on-read approach, data lakes can comply with any data type and format Cha2018; Ganore2015; Khine2017; Madera2016. Thence, data lakes enable a wider range of analyses than traditional decision-oriented databases, i.e., data warehouses and datamarts, and thus show better flexibility and agility. However, although the concept of data lake dates back from 2010, it has only been put in practice in the mid-2010’s. Thus, implementations vary, are still maturing and there is a lack of methodological and technical standards, which sustains confusions about data lakes. Finally, due to the absence of an explicit schema, data access services and APIs are essential to enable knowledge extraction in a data lake. In other words, a data access service is a must to successfully build a data lake Alrehamy2015; Inmon2016, while such a service is not always present.
Next, an acclaimed advantage of data lakes over data warehouses is real-time data ingestion. Data are indeed ingested in data lakes without any transformation, which avoids any time lag between data extraction from sources and their ingestion in the data lake Ganore2015; Laskowski2016. But as a consequence, a data lake requires an efficient metadata system for ensuring data access. However, the problem lies in the “how”, i.e., the use of inappropriate methods or technologies to build the metadata system can easily turn the data lake into an inoperable data swamp Alrehamy2015.
More technically, data lakes and related analyses are typically implemented using distributed technologies, e.g., HDFS, MapReduce, Apache Spark, Elasticsearch, etc. Such technologies usually provide a high scalability Fang2015; Miloslavskaya2016. Furthermore, most technologies used in data lakes have replication mechanisms, e.g., Elasticsearch, HDFS, etc. Such technologies allow a high resilience to both hardware and software failure and enforce fault tolerance John2017.
Eventually, data lakes are often viewed as sandboxes where analysts can “play”, i.e., access and prepare data so as to perform various, specific, on-the-fly analyses Russom2017; Stein2014. However, such a scenario requires expertise. Data lake users are indeed typically data scientists Khine2017; Madera2016, which contrasts with traditional decision systems, where business users are able to operate the system. Thus, a data lake induces a greater need for specific, and therefore more expensive, profiles. Data scientists must indeed master a wide knowledge and panoply of technologies.
Moreover, with the integration in data lakes of structured, semi-structured and unstructured, expert data scientists can discover links and correlations between heterogeneous data Ganore2015. Data lakes also allow to easily integrate data “as is” from external sources, e.g., the Web or social media. Such external data can then be associated with proprietary data to generate new knowledge through cross-analyses Laskowski2016
. However, several statistical and Artificial Intelligence (AI) approaches are not originally designed for parallel operations, nor for streaming data, e.g., K-means or K-Nearest Neighbors. Therefore, it is necessary toreadjust classical statistical and AI approaches to match the distributed environments often used in data lakes OLeary2014, which sometimes proves difficult.
In this survey paper, we establish a comprehensive state of the art of the different approaches to design, and conceptually build a data lake. First, we state the definitions of the data lake concept and complement the best existing one. Second, we investigate alternative architectures and technologies for data lakes, and propose a new typology of data lake architectures. Third, we review and discuss the metadata management techniques used in data lakes. We notably classify metadata and introduce the features that are necessary to achieve a full metadata system. Fourth, we discuss the pros and cons of data lakes. Fifth, we summarize by a mind map the key concepts introduced in this paper (Figure 10).
Eventually, in echo to the topics we chose not to address in this paper (Section 1), we would like to open the discussion on important current research issues in the field of data lakes.
Data integration and transformation have long been recurring issues. Though delayed, they are still present in data lakes and made even more challenging by big data volume, variety, velocity and lack of veracity. Moreover, when transforming such data, User-Defined Functions (UDFs) must very often be used (MapReduce tasks, typically). In ETL and ELT processes, UDFs are much more difficult to optimize than classical queries, an issue that is not addressed yet by the literature Stefanowski2017.
With data storage solutions currently going beyond HDFS in data lakes, data interrogation through metadata is still a challenge. Multistores and polystores indeed provide unified solutions for structured and semi-structured data, but do not address unstructured data, which are currently queried separately through index stores. Moreover, when considering data gravity Madera2016, virtual data integration becomes a relevant solution. Yet, mediation approaches are likely to require new, big data-tailored query optimization and caching approaches Quix2018; Stefanowski2017.
Unstructured datasets, although unanimously acknowledged as ubiquitous and sources of crucial information, are very little specifically addressed in data lake-related literature. Index storage and text mining are usually mentioned, but there is no deep thinking about global querying or analysis solutions. Moreover, exploiting other types of unstructured data but text, e.g., images, sounds and videos, is not even envisaged as of today.
Again, although all actors in the data lake domain stress the importance of data governance to avoid a data lake turning into a data swamp, data quality, security, life cycle management and metadata lineage are viewed as risks rather than issues to address a priori in data lakes Madera2016. Data governance principles are indeed currently seldom turned into actual solutions.
Finally, data security is currently addressed from a technical point of view in data lakes, i.e., through access and privilege control, network isolation, e.g., with Docker tools Cha2018, data encryption and secure search engines Maroto2018. However, beyond these issues and those already addressed by data governance (integrity, consistency, availability) and/or related to the European General Data Protection Regulation (GDPR), by storing and cross-analyzing large volumes of various data, data lakes allow mashups that potentially induce serious breaches of data privacy Joss2016. Such issues are still researched as of today.