Data Quality (DQ) is a very wide research area, which involves many different aspects, problems and challenges. The growing need to discover and integrate reliable information from heterogeneous data sources, distributed in the Web, Social Networks, Cloud platforms or Data Lakes, makes DQ an unavoidable topic, particularly hot in recent years (see e.g., (chu_data_2016; sadiq_data_2018; pena_discovery_2019; bertossi_database_2019)). In addition, DQ has enormous relevance for the industry, due to its great impact on information systems in all application domains.
Advances in information and communication technologies have led organizations to manage increasingly large amounts of data. These generates interesting opportunities, both for the daily operations as well as for decision-making and strategic planning. However, these opportunities may be limited by data quality issues of this kind of scenarios (Batini2016eGov). In addition, such applications involve many services, processes and tasks, using varied data and metadata, for many purposes, and involving different users and roles. Thus, that DQ perceptions and expectations may be very different from a context to another.
Early approaches and surveys defined DQ as fitness for use (wang-strong) and showed the influence of context on DQ (strong_data_1997). For example, for DQ assessment, a single set of DQ metrics is not enough, and task-dependent DQ metrics, developed for specific application contexts, should be included (pipino2002). Enriching DQ metrics definition with context management would provide with: (i) flexibility, since they could adapt to context variations, (ii) generality, since they could include many particular context-dependent cases, and (iii) richness, leading to include more aspects to the metric.
According to the literature, there is no single definition of context, rather it seems that context can be made up of several elements. For instance, authors of (pipino2002) include the organization’s business rules, company and government regulations, and constraints provided by the database administrator, while others include the users (Dey01), or the characteristics of the decision task and the decision-maker (shankaranarayanan2009).
Despite its recognized importance for DQ management (DQM), context is typically taken as an abstract concept, which is not clearly defined nor formalized. In order to evidence this lack, we analyzed the state of the art on context definition for DQ management, with special attention to modeling and formalization aspects.
This paper presents a Systematic Literature Review (SLR) for investigating how, when and where context is taken into account in recent proposals for DQM. A SLR is a methodology for identifying, evaluating and interpreting all available research literature that is relevant to a particular research question or topic area. This methodology is not a conventional literature review, because while the latter just provides a high level summary of the literature in connected fields, the SLR provides a more complete, rigorous and reproducible framework for the literature study (Kitchenham04).
Specifically, we identify and analyze works dealing with context for DQM, and we classify them according to several analysis criteria. Besides, we identify contextual aspects of DQ tasks included at the stages of a DQM process. In addition, we describe our findings about context definition and formalization as well as the identification of context components and their possible representations.
Specifically, we identify and analyze works dealing with context for DQM, and we classify them according to several analysis criteria. Besides, we identify contextual aspects of DQ tasks included at the stages of a DQM process. In addition, we describe our findings about context definition and formalization as well as the identification of context components and their possible representations.
The paper is organized as follows: Section 2 provides background on the important notions used throughout this paper. Section 3 presents the planning and execution of the SLR and describes its results. Section 4 analyzes context proposals in terms of level of formalization, components and representation. Section 5, details the proposals for the formalization of context, and Section 6 answers the research questions of the SLR. Finally, Section 7 concludes and presents open research questions.
The bibliography referenced in this work is classified into 2 groups: (i) the general bibliography, and the works selected by the SRL.
This section presents background concepts about Data Quality, Context and Systematic Literature Review.
2.1. Data Quality
DQ is defined as fitness for use and is widely recognized to be multidimensional (wang-strong). DQ dimensions express the characteristics that data must have, such as its correctness, completeness and consistency. In the literature, there is no agreement on the set of the dimensions characterizing DQ nor on their meanings (scannapieco2002data). This subsection lists the most representative ones, to be included in our review.
In a large industrial survey, Wang and Strong inventoried 175 quality attributes used by data consumers (wang-strong) and organized them in 15 quality dimensions (strong_data_1997), namely, accuracy, objectivity, believability, reputation, accessibility, access security, relevancy, value-added, timeliness, completeness, amount of data, interpretability, ease of understanding, concise representation, consistent representation. Such dimensions were also organized in 4 DQ categories: Intrinsic DQ, Accessibility DQ, Contextual DQ and Representational DQ. The ISO/IEC 25012 Standard (25012-Stand) also proposes 15 quality dimensions (called characteristics), namely, accuracy, completeness, consistency, credibility, currentness, accessibility, compliance, confidentiality, efficiency, precision, traceability, understandability, availability, portability, recoverability. These dimensions are reused by Merino et al. in the context of big data (PS3). In addition, the term data veracity is widely used in Big Data applications, ambiguously relating to several DQ dimensions.
While DQ dimensions are used to express the characteristics that data must satisfy, DQ metrics provide quantitative means for assessing to what extent these characteristics are satisfied. Pipino et al. in (pipino2002) argue that it is difficult to define which aspects of a DQ dimension are related to a specific application in a company. However, once this task is done, proposing DQ metrics associated with these dimensions is easier. Their research addresses DQ assessment based on task-dependent and -independent DQ metrics. According to the authors, companies must consider the subjective perceptions of the users, and the objective measurements based on the used data.
Also, the term DQ attribute is often used in DQ literature, but ambiguously, as it may refer to DQ dimensions, to specific aspects of DQ dimensions, or even to DQ metrics.
Concerning DQ tasks, in particular measuring, evaluating and improving tasks, it is not trivial to carry them out in a company, since each of these tasks generally depends on the complexity of the application domain. Typically, these tasks are organized and carried out following a process, that is part of a methodology. A DQ methodology is defined as a set of guidelines and techniques whose starting point is information related to a certain area of interest (PS16). Several DQ methodologies have been proposed in the literature, in particular, (PS16) compares 13 of them, tailored for DQ assessment and improvement.
Finally, in the literature we find the terms Data Quality and Information Quality interchangeably, as for example in (pipino2002). However, Batini and Scannapieco (batini2016Book) use DQ when referring to dimensions, models, and techniques strictly related to structured data, while, in all other cases, they use information quality. In DQ literature, there is no consensus on whether there is or not a difference between DQ and information quality. Since, the difference is established or not, by the researchers of each proposal.
In a tentative to understand context, Bazire and Brézillon (DBLP:conf/context/BazireB05) analyzed a corpus of 150 definitions in different domains of cognitive sciences and close disciplines, and build a model of context representing the components of a situation and the different relations between those components. As pointed out by Dey (Dey01), context is a poorly used source of information in computing environments, resulting in an impoverished understanding of what context is and how it can be used. Dey provided operational definitions of context and context-aware computing: context is a general term used to capture any information that can be used to characterize the situations of an entity, a system being context-aware if it uses context to provide relevant information and/or services to the user, where relevancy depends on the user’s task.
Other concepts are used as synonymous of context. For instance, preferences and data tailoring appear as related concepts (see (BolchiniCQST07; SerraM17) for a literature review). However, in some cases context-dependent preferences are also proposed (AbbarMokrane; KostasPitoura; ciaccia2011). Regardless of whether they represent the same thing, they are concepts that are strongly related. Profile is another concept that slightly differ from context. Nevertheless, when they are used jointly, the relationship between them remains often unclear (AbbarMokrane).
In a research on ubiquitous computing (Abowd2000), the authors explore context-aware computing. They point out that context is more than position and identity, and consider that although a complete definition of context is illusive, it is necessary to define a minimum set of context components, answering to who, what, where, when and why questions. But this is not the only important thing, since related to the context definition is the question of how to represent it.
2.3. Context for Data Quality
DQ assessment is a challenging task that requires data context and demands a great domain knowledge (mylavarapu2020). The importance and influence of data context in DQ has been stated many years ago (wang-strong), and is widely accepted. Early works only state whether a DQ dimension depends on context or not (strong_data_1997). However, since 2006, shankaranarayanan and Cai (shankaranarayanan2006) pointed out that contextual factors have not been explicitly examined in data quality literature, and recognized the contextual or subjective nature of data quality evaluation. In addition, in 2009, Watts et al. (shankaranarayanan2009) argue that contextual assessments can be as important as objective quality indicators, because they can affect the information used for decision making tasks. Most recently, in 2019, Catania et al. suggest that context information can help in interpreting users’ needs in Linked Data. While in 2020, Davoudian and Liu (PS45) claim that traditional DQ tasks do not take into account Big Data characteristics, since the context of use is appropriate for Big Data quality analysis.
The need to consider the context for DQ assessment persists over time and for different application domains. Nevertheless, even today the literature lacks of a concise and globally accepted definition for context in DQ (PS25). Therefore, this present work aims at investigating whether context is used and how it is used in recent proposals for DQ management.
2.4. Systematic Literature Review
SLR (Kitchenham04) is a methodology aiming at identifying, evaluating and interpreting all available research relevant to a particular research question or topic area. It starts by defining a review protocol that specifies the research questions being addressed and the methods that will be used to perform the review. These research questions determine the criteria to assess potential relevant data to answer the questions. Relevant data is extracted from scientific works, called primary studies (PS for short), that were found with the SLR. According to Kitchenham (Kitchenham04), SRL methodology and produced documents (protocol, questions, etc.) ensure the completeness, rigor and reproducibility of the review.
As laid down in (Kitchenham04), a SLR is performed in three stages: (i) planning, (ii) conducting and (iii) reporting. At the first stage, the review objectives and the research questions are defined. Based on the research questions, the next stage consists in deriving search strings, selecting digital libraries and defining inclusion and exclusion criteria. The execution of search strings in search engines of digital libraries results in a set of PS. The last stage reports SLR results.
3. SLR Application
A SLR is executed to obtain an overview of existing research on the use of Contexts in Data Quality. In this section, we describe the planning and conducting of the SLR methodology applied to our research problem, and we present quantitative and qualitative results. More detailed findings, including the description of some proposals, are presented in Section 4.
3.1. Planning the review
At this stage, the objectives of the investigation, and the research questions that arise from these objectives, are defined.
We apply a SLR to know current evidence in the relationship between Context and Data Quality areas. In short, we want to identify any gap at the intersection of these areas in order to detect new research lines.
To define the research questions, the most important DQ concepts (DQ model, DQ dimension and DQ metric) are taken into account. In particular, it is important to know if DQ models have been defined considering context and how context is represented in such models. In turn, we search whether context participates in the definition of DQ dimensions and their DQ metrics, and more generally, how context relates to DQ concepts at different stages of DQ management. For all this, the following research questions are raised:
RQ1: How is context used in data quality models?
RQ2: How is context used within quality metrics for the main data quality dimensions?
RQ3: How is context used within data quality concepts?
Note that RQ3 is intentionally more inclusive, in order to get works dealing with DQ and context but not mentionning specific DQ models, DQ dimensions or DQ metrics.
3.2. Conducting the review
The SLR execution is carried out following a series steps: search strings definition, selection of the digital libraries, definition of the inclusion and exclusion criteria, and selection of the primary studies. All of them are described below.
To create the search strings we firstly extract the keywords of the research questions, namely: context, data quality model, quality metric, data quality dimension, quality concept and data quality. The latter two are the decomposition of a more complex keyword, data quality concept, appearing in RQ3. Then, to perform a thorough search, some keywords are complemented or detailed with alternative terms as follows (each of them is discussed and argued in Section 2):
context is complemented with alternative terms data tailoring and preference.
data quality dimension is refined by the DQ dimensions. Note that, contrarily to dimensions, metrics are not refined. Indeed, unlike dimensions, metrics are usually not managed with names in DQ literature, since they are defined specifically for each measurement process.
quality concept is refined by the most important DQ concepts: quality dimension, quality metric and quality attribute.
data quality is complemented with alternative term information quality.
|context||preference, “data tailoring”|
data quality model
data quality dimensions
|“data believability”, “data accuracy”, “data objectivity”, “data reputation”, “data value-added”, “data relevancy”, “data timeliness”, “data completeness”, “appropriate amount of data”, “data interpretability”, “data ease”, “data understanding”, “data representational consistency”, “data concise representation”, “data credibility”, “data currentness”, “data veracity”, “data accessibility”, “data compliance”, “data confidentiality”, “data efficiency”, “data precision”, “data traceability”, “data understandability”, “data availability”, “data portability”, “data recoverability”|
|“quality dimension”, “quality metric”, “quality attribute”|
Then, for each research question, a search string is built as a conjunction of keywords (AND connector). The alternative terms are joined with the keywords by means of disjunctions (OR connectors). We can see the alternative terms of each keywords in Table 1. Since the RQ2 is very comprehensive the associated search string is too long to be supported by search engines of some digital libraries, as it should list 27 DQ dimensions. To solve this problem, we cut the search string in 7 smaller ones, each containing a subset of the DQ dimensions. The resulting search strings are:
SS1: (context OR preference OR “data tailoring”) AND (“data quality model”)
SS2: (context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data believability” OR “data accuracy” OR “data objectivity” OR “data reputation”)
SS3: (Context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data value-added” OR “data relevancy” OR “data timeliness” OR “data completeness” OR “appropriate amount of data”)
SS4: (Context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data interpretability” OR “data ease” OR “data understanding” OR “data representational consistency” OR “data concise representation”)
SS5: (Context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data credibility” OR “data currentness” OR “data veracity”)
SS6: (Context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data accessibility” OR “data compliance” OR “data confidentiality” OR “data efficiency”)
SS7: (Context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data precision” OR “data traceability” OR “data understandability”)
SS8: (Context OR preference OR “data tailoring”) AND (“quality metric”) AND (“data availability” OR “data portability” OR “data recoverability”)
SS9: (Context OR preference OR “data tailoring”) AND (“data quality” OR “information quality” ) AND (“quality metric” OR “quality dimension” OR “quality attributes”)
We select digital libraries trying to cover the most important venues in the database domain, but having the least possible overlap among them, thus reducing the number of duplicated PS. In addition, the selected digital libraries should allow for search in both titles and abstracts. We initially avoided to use the search engines of DBLP and Google Scholar because most of the returned PS are already returned by the other digital libraries. As well, Google Scholar indexes many other types of documents (e.g. technical reports or slides) producing noise within search results and DBLP does not support search inside abstracts, loosing many relevant PS. However, we are aware that some important venues have online proceedings, in particular VLDB, EDBT and ICDT. So, we complete the SLR with a specific search in Google Scholar, but restricted to VLDB, EDBT and ICDT.
Inclusion and Exclusion Criteria:
To restrict the PS returned by each digital library, we apply a set of inclusion and exclusion criteria. The former are automatically applied by digital libraries, while the latter are considered later, while reading abstracts or full texts.
The considered inclusion criteria are: (i) PS are published in the period 2010-2020 inclusive. This time interval is large enough to ensure getting recent PS. (ii) PS are written in English and (iii) are published in PDF format. (iv) The set of PS only includes articles and book chapters.
Using exclusion criteria, we selected PS if: (i) full text is written in another language (even if the abstract is in English), (ii) they are published in non pair-reviewed venues, (iii) data quality is not addressed or it is addressed superficially, (iv) context is not addressed. These criteria allow to discard PS addressing other subjects even if containing the good keywords.
Selection of Primary Studies:
The selection process has 5 steps. They are listed below, highlighting the number of PS output by each step, which are summarized in Table 2.
Step 1: Execution of search strings. 2898 PS were returned by the 5 digital libraries in response to search strings and inclusion criteria.
Step 2: Duplicate elimination. Some PS were returned by several search strings within the same digital library, and many PS from Google Scholar were also returned by other libraries. Duplicate elimination resulted in 1955 PS.
Step 3: Selection by relevance. 279 PS were selected by relevance, reading the title and abstract of each PS.
Step 4: Selection by full text. A final selection, by a careful reading of full text and application of exclusion criteria, resulted in 54 PS.
Step 5: Selection by references. Finally, the references of the 54 selected articles are reviewed. Respecting the inclusion and exclusion criteria, 4 additional articles were selected, obtaining a total of 58 PS.
|Digital Library||Step 1||Step 2||Step 3||Step 4||Step 5|
|Google Scholar (VLDB, EDBT, ICDT)||67||1||1|
According to (Kitchenham04) this methodology is reproducible. However, there are some exceptions as the methodology depends on external factors. In particular, we experienced some problems with the ACM library, which changed its search engine in the middle of the SLR process. Indeed, our first search returned very few PS, while a second search some month later, returned hundreds of PS. This evidences, that the reproducibility of the methodology, as described in (Kitchenham04), is impacted by the changes in digital libraries. The complete selection process is shown in Figure 1.
3.3. Review results
In this section we present and quantify the main results of the review by analyzing the selected PS according to different criteria. From now on, we term PS is used for referring to selected PS.
Firstly, for complementing Table 2, Fig. 2 shows the distribution of selected PS by (a) digital library, (b) search string, (c) year of publication and (d) venue quality. We remark that more than a half of the selected PS were retrieved with SS9, while none with SS4, and most of selected PS come from Springer and ACM, while none from IEEE. Interestingly, the number of published papers dealing with the use of context for DQ increased from 2016, except for 2017. Selected PS are classified according to the quality of venues, in accordance with rankings and metrics of Scopus111SCOPUS portal. https://www.scopus.com/sources (accessed December 2020) for journals and Core222CORE portal. http://portal.core.edu.au/conf-ranks/ (accessed December 2020) for conferences. We use the quality labels (A*, A, B, C) proposed by Core, which correspond to Scopus relative rankings of [94%-100%], [84%-94%), [54%-84%) and [13%-43%), respectively. NR is used for referring to not ranked PS and CH for referring to book chapters. We remark that only 13 PS (22%) are ranked A* or A, and interestingly, 11 of them were published in or after 2016. In addition, 23 PS are ranked B (including 11 from the Int. J. on Data and Information Quality) and 14 PS are not ranked. Concerning venue type, 31 PS come from conference proceedings, 21 from journals and 6 from book chapters.
The selected PS are classified according to the following analysis axes:
Type of work. We classify the PS according to the type of proposal. We find several approaches: framework, model, methodology, analysis, case study, metrics and architecture. A lot of PS address more than one approach. In particular, this is the case of PS that propose a framework. These generally include a model and/or a methodology, among others. We focus on the proposals that contribute the most to our research.
Research domain. We consider the research domain addressed in the PS. Firstly, we highlight PS that only address data or information quality and we call it only DQ
. These PS address, implicitly or explicitly, contextual aspects of DQ. On the other hand, we have other PS that include contextual DQ in their researches. However, in these cases the main research domain is one of the following: Big Data, Decision Making, Linked Data, Internet of Thing, Data Mining, e-Government, Open Data, Machine Learning or Data Integration.These domains use DQ or are exploited to improve DQ.
DQ process stages. We analyze which are the stages, of the DQ process, addressed in the PS. As well, we identify the most important DQ process stages, according to the bibliography.
DQ dimensions. Some authors affirm that there are contextual DQ dimensions. We classify the PS according to this proposal. That is, we are interested in identifying how many PS address the contextual DQ dimensions approach.
DQ metrics. In this case, the approach is similar to the previous classification. Then, we identify the PS that define, use or analyze contextual DQ metrics.
Data model. We also have a classification according to the data model used in the selected PS. In this way, we find PS with structured, semi-structured, unstructured and mixed data models. In some PS data are not specified or PS are general, i.e., not restricted to a data type or model. In both cases, PS are classified as N/A.
Case study: The PS are also classified by the type of data used in the case study presented to evaluate the proposal. The case studies can use real or artificial data. In some PS a case study is not specified (N/A).
Next paragraphs present the classification results for each analysis axis. Firstly, each paper is introduced according to its type of work. Then, the salient facts are commented for the other analysis axes.
Type of work
In this classification we present the different types of works found in the selected PS. When PS have several contributions, we focus on the main proposals in order to indicate the leading types of work.
The most frequent type of work, as evidenced in Fig. 3(a), is the modeling, since several PS coincide in proposing a DQ model (PS2; PS3; PS13; PS18; PS20; PS27; PS28; PS33; PS34; PS59). In particular, (PS59) presents domain-specific characteristics of DQ, while (PS33) extends this work by proposing DQ models. In addition, in (PS15), the authors discuss models proposed for utility, discovering that DQ dimensions and DQ metrics are deeply influenced by utility. This proposal considers the usage of data and the relevance of processes that adopt it in the measurement. In turn, (PS32) presents a model to identify opportunities for increasing monetary and non-monetary benefits by improved DQ. In other matters, we identify other kind of models, all for DQ assessment and motivated by the idea that DQ is context-dependent. For instance, the authors of (PS52) present a decision model to facilitate the description of business rules, and in (PS11; PS41; PS57) a model of context is developed. As well, there are PS that address the impact of poor DQ and propose improvement models, specifically (PS37) presents a machine learning model and (PS46)
a neural networks model.
On the other hand, (PS7) presents a model of the asset management data infrastructure and (PS12) a Bayesian Network model that shows how DQ can be improved while satisfying budget constraints. In the case of
a Bayesian Network model that shows how DQ can be improved while satisfying budget constraints. In the case of(PS19), it presents a formal multidimensional model on which is applied a rule-based approach to DQ assessment, and (PS24) provides a cost and value model for issues related to DQ. Finally, (PS39) models context for source selection in Linked Data, and (PS44) proposes a conceptual model where the relationships among quality characteristics of e-government websites and users’ perceptions are represented.
Another category concern frameworks for assessing data or information quality (PS8; PS13; PS14; PS19; PS23; PS24; PS25; PS31; PS34; PS39; PS47; PS49; PS55; PS56; PS58). For instance, in (PS39) the authors use a framework called Luzzu(Luzzu) for linked DQ assessment. Additionally, (PS19) provides a framework for context-aware Data Warehouse quality assessment. Other proposals that are task-specific, as (PS12) that presents a framework for discovering dependencies between DQ dimensions, or (PS54) that proposes a framework that encapsulates data cleaning procedures. Also in (PS55) we find an ontology-based quality framework for data integration. On the other hand, in some frameworks DQ is not the goal, but it is a relevant element of the proposal. For example, in (PS44) a framework of citizens’ adoption of e-government services is presented. Furthermore, an alternative angle is presented in (PS24), since the authors develop a framework in the form of a guideline to manage DQ costs. They define DQ cost as financial impacts caused by DQ problems and resources required to achieve a specified level of DQ. Finally, the framework in (PS31) describes the implementation of a system that acquires tweets in real time, and applying DQ metrics to measure the quality of such tweets. These metrics are defined formally.
Among the works classified as analysis, we find reviews and surveys. We consider them together because the goal of both is the same, to analyze the state of the art. For instance, (PS1) identifies the core topics and themes that define the identity of DQ research, whereas (PS26) analyzes the main concepts and activities of DQ. In the case of (PS6), the authors investigate DQ in the context of Internet of Thing. In (PS9), besides presenting a DQ methodology, the authors review DQ dimensions and other methodologies to assess DQ. In addition, some PS relate DQ with a particular domain, for example, Big Data (PS17; PS45), Data Warehouses (PS25) and Open Data (PS30). Moreover, (PS48) studies the mapping between metadata and DQ issues, in particular, the connection between metadata (such as count of rows, count of nulls, count of values and count of value pattern), and data errors. In other matters, (PS21; PS53) apply a systematic literature review, the former to find works that consider DQ requirements during the process of developing an information system, the latter presents our preliminary results of review on DQ and context. In addition, in (PS16) the authors compare methodologies that have been proposed in the literature for information quality assessment and improvement.
On the other hand, we find PS focused on methodologies. The most of these work are focused on methodologies for data or information quality, specifically considering evaluation and improvement tasks (PS4; PS9; PS10; PS13; PS16; PS23; PS35; PS36; PS38; PS51). As seen before, (PS13) describes a DQ framework, but it also includes a DQ methodology. (PS23) employs a particular methodology, called six-sigma (linderman2003six), to define critical factors to quality, measure current quality level, analyze deficiencies in information and identify the root causes of poor information. With the same purpose, (PS4) considers a methodology to build a DQ adapter module, which selects the best configuration for the DQ assessment. In turn, a data-focused methodology, based on DQ actions, is used in (PS36) to get smart data and to obtain more value for the datasets. Additionally, with another focus, a methodology for selecting sources in the Linked Data domain is developed in (PS50). In this process context-dependent DQ is taken into account, according to different DQ dimensions.
We also find works that present case studies focusing on the analysis of a particular problem. Firstly, (PS7) evaluates how data governance supports data-driven decision-making in asset management organizations. Also, it investigates the impact of data governance on DQ. Otherwise, the goal of (PS29; PS30) is to examine the quality of Open Data in public sector, while (PS31) dives into the problem of re-defining traditional DQ dimensions and DQ metrics in the Big Data domain.
In the metrics category, (PS22) investigates how metrics for the DQ dimensions completeness, validity, and currency can be aggregated to derive an indicator for the accuracy dimension. In (PS42), a set of requirements for DQ metrics is presented. By last, (PS43) proposes a visual analytic approach that enables data analysts to utilize and customize quality metrics, in order to facilitate DQ assessment of their specific datasets.
In relation to works that focus on architectures we have two proposals. (PS5) presents a system architecture that includes mechanisms for DQ assessment and security, while (PS35) provides a DQ assessment architecture that manage streaming data in a context-aware manner at different levels of granularity.
Fig. 3(b) shows the distribution of the main research domains of the selected PS. Most of them only concern data/information quality (only DQ) but their proposals are not domain-specific. However, we can see that many PS combine DQ with other research domains. For example, the Big Data domain, where the management of large volumes of data presents a clear need to incorporate DQ tasks (PS1; PS17; PS35) and study context into those DQ tasks (PS3; PS4; PS18; PS28; PS31; PS45).
DQ process stages
We also analyze the PS according to the stages of the DQ process that are addressed in the proposal. As we mentioned in Section 2, we consider a DQ process with 7 stages: ST1-Characterize Scenario, ST2-Analyze Objective Data, ST3-Define Strategy, ST4-Define Data Quality Model, ST5-Measure and Evaluate Data Quality, ST6-Determine Causes of Quality Problems, and finally, ST7-Define, Execute and Evaluate Action Plan. For this classification, we observe that some PS address more than one stage. In Fig. 4 we show these results. Measurement and evaluation of DQ, and the definition of a DQ model are the DQ process stages most addressed by the PS. The rest of the stages are considered in a similar way among the PS, in particular only 1 PS do not focus on any of these stages.
Several authors affirm that there is a set of DQ dimensions that are contextual, i.e., context dependent DQ dimensions. However, there is no agreement on what ensures the contextual characteristics of these DQ dimensions. Most authors rely on the bibliography to ensure that certain DQ dimensions are contextual (PS1; PS2; PS6; PS10; PS13; PS20; PS26; PS31; PS45; PS46; PS47; PS49). For example, authors are based on reference bibliography of the DQ domain, such as (wang-strong; Ge2007ARO). Furthermore, different aspects of the user (data profile, location, requirements, etc.) are also taken into account for contextualizing DQ dimensions (PS4; PS8; PS9; PS23; PS27; PS32; PS34; PS35; PS55; PS56; PS58). In turn, data in use or information are also used to give context to DQ dimensions (PS12; PS14; PS15; PS19; PS38; PS39), and according to some PS, DQ dimensions can be influenced by rules and constraints of the application domain (PS16; PS25; PS43; PS51). On the other hand, some authors affirm that DQ requirements condition DQ dimensiones (PS5; PS42; PS52). But also, we can find PS where the contextual aspects of DQ dimensions are based on the ISO/IEC 25012 standard 333https://iso25000.com/index.php/en/iso-25000-standards/iso-25012 (PS3; PS17; PS21). To a lesser extent, authors claim that the contextual aspects of the DQ dimensions depend on metadata (PS22; PS48) and the task at hand (PS18). As we saw before in (PS15), from another perspective Batini and Scannapieco analyze models for utility-driven quality assessment. The authors underline that utility can help define superior measurements for DQ dimensions (e.g., for completeness and accuracy), that reflect DQ assessment in context. In Fig. 5 we can see the total number of PS that consider contextual DQ dimensions. In turn, Fig. 6 shows the quantity of PS for each of the approaches. Finally, it is not possible to find a single set of contextual DQ dimensions, since they vary among the different proposals. However, among the PS the DQ dimensions most commonly considered contextual are completeness, accuracy, consistency, relevance and timeliness.
In this classification, we consider PS that define, use or analyze contextual DQ metrics. Each contextual DQ metric takes into account different contextual aspects, since it strongly depends on the proposal. This means that DQ metrics are influenced by the type of data being measured and how they are measured, among others. According to Heinrich et al. (Heinrich09), DQ metrics need to be adaptable to the context of a particular application. In Fig. 5, we also show the quantity of PS that consider contextual DQ metrics.
According to the Fig. 3(c), most of the PS use real data to develop their case studies. At the same time, we have some PS that do not apply (N/A) in this classification. The latter is so because these PS correspond especially to reviews and surveys.
Fig. 3(d) shows the distribution of the data models used in the selected PS. As we can see, most of the researches are based on structured data, especially using the relational schema. There are also a high number of PS that consider unstructured data, as CSV files or sensor data. On the other hand, there are an important number of PS that are general, i.e. they are not restricted to a data type or model. Thus, they are classified as N/A.
In the next section, we will present the proposed contexts. In particular, we will focus on presenting new classifications for the selected PS. This analysis will be based on the characteristics of the proposed contexts, as the level of formalization of context definition, the context components and the different context representations.
4. Analysis of the proposed contexts
Many works affirm that DQ depends on the context, but what is the context? There are many proposals. For example, in (PS35)(PS23) the authors claim that DQ assessment depends on the user, i.e., the user gives context to DQ assessment. In both articles, the same need arises. However, in (PS35) the elements that make up the context are explicitly mentioned, while in (PS23) they are never named. This means that, as in many cases the context is not defined, it must be deduced. This implies determining which are the components that make up the context.
On the other hand, the representation of the context is linked to its components. This is so, because representing the context means representing each of its components. For this, we identify different types of representations. For example, in (PS20) data filtering needs and DQ requirements are the components of the context. In turn, these components are represented through semantic and syntactic rules. Therefore, in this section we will address the level of formalization of context definition, the various components composing the context, and the representation of those components.
4.1. Level of formalization of context definition
Although all PS deal with the usage of context for data quality, many of them just mention its importance but do not provide a proper definition of context. In addition, when a definition is given, it is usually informal or even fuzzy.
Therefore, as a first step for understanding the underlying notion of context, we classify the PS according to existence of a proper definition of context and its level of formalization, in three categories: (i) formal definition, when the context is defined formally; (ii) informal definition, when the context is defined, but not formally, for example, in natural language; (iii) no definition, when the context is not defined (not even in natural language), but is used implicitly. The latter occurs, for example, when the authors present the importance of data context, but they do not define what the context is. Quantitatively, 50% of the selected PS provide no context definition, while only 10% of the works present a formal context definition, as illustrated in Fig. 7.
In next paragraphs, we report the level of formalization of context definition by research domain, highlighting domains with more formal proposals.
Level of formalization of the context by research domain
Fig. 8 shows the amount of PS of each level of formalization, for each research domain. A first remark is that the level of maturity of context definition is disparate, and allows a preliminary classification of research domains.
Firstly, Only DQ domain, the one concerning more PS (as discussed in Subsection 3.3), is the only one presenting the three levels of formalization. Moreover, most PS proposing formal definitions (5 PS out of 6) concern this domain; being over represented w.r.t. the overall distribution shown in Fig. 7. These results make sense, as many theoretical proposals for DQ modeling are cross-domain.
On the other hand, the authors of (PS1), a survey from 2017 focused on the evolution of DQ, highlight that organizations view Big Data, social media data, data-driven decision-making, and analytics as critical. For this, according to them, DQ has never been more important. Furthermore, they add that DQ research is reaching the threshold of significant growth from focusing on measuring and assessing DQ toward a focus on usage and context. This is reflected in several works (11 PS) that address DQ in Big Data. These PS point out the importance of DQ, however, none of them define the context formally. Indeed, proposals concern informal or no proper definitions. With the same needs, but to a lesser extent we have similar results, regarding the type of formalization, for the Decision Making (4 PS) and Internet of Thing (2 PS) domains.
The Linked Data (3 PS) and Data Mining (2 PS) research domains are a special case, because although this domains presents very few PS, in all cases a context definition is given. In particular, 1 PS in the Linked Data domain presents a formal context definition. In this kind of domain is essential to find the best data sources, and data context plays a very important role when selecting them (PS39). Because, it might help in interpreting the user needs. In other matters, Data Mining domain is applied to exploit DQ. According to (PS12), inference rules are context free, while coherent reasoning under uncertainty is sensitive to the data context.
Finally, we identify the research domains that only address context implicitly, providing no definition: e-Government (2 PS), Open Data (2 PS), Machine Learning (1 PS) and Data Integration (1 PS). Although there are not many PS with these characteristics, we observe that in these domains when DQ is addressed, the authors emphasize the need to identify the context of the data. However, all context definitions are given in natural language.
We highlight that we have only 6 PS that present a formal context definition, and they are proposed in the Only DQ and Linked Data domains. Additionally, in a total of 10 research domains, 6 present some kind of definition for the context when managing DQ. All these results not only show the important role of context in assessing DQ in many domains, but also the lack of formalization in the bibliography. Moreover, half of PS present no context definition, which magnifies this lack.
Regardless of the level of formalization of the context definition, most authors point out that the environment that influences data is defined by several components. These components are determined by how data are used, who uses data, when and where data are used, among others. They are addressed in the next subsection.
4.2. Context Components
The state of the art revealed that although there are general operational definitions of context and context-aware computing (Dey01), context representation is neglected in DQM. In particular, we highlight conclusions of Bertossi et al. (PS11) who report that the literature only deals with obvious contextual aspects of data, like geographic and temporal dimensions. As suggested by Bolchini et al. (BolchiniCOQRST09), other contextual aspects should be specified, e.g. users (person, application, device, etc.), presentations (system capabilities to adapt content presentation to different channels/devices), communities (set of relevant variables shared by a group of peers), and data tailoring (only selecting relevant data, functionalities and services). Preferences, documents content, DQ requirements and domain rules also emerge as important components (a preliminary review can be found in (SerraM17)).
We consider that context not only fits a single perspective, but could be defined by elements taken from different perspectives (user, application domain, data tailoring, etc.). Therefore, we reviewed the selected PS, and only 9% do not propose any component for the context. In the rest, 91% of PS, we identify or deduce (when the authors do not define the context, but suggest that DQ depends on certain elements.), the components of the context suggested in each proposal. Indeed, we started by eliciting the proposed components and we group those proposing close concepts. We got to the ten categories of components listed below. For each one we highlight some of the PS that propose it:
Many PS suggest that DQ requirements must be considered for an efficient DQ management. For instance, according to (PS55), a DQ framework for a data integration environment needs to be capable of representing the varieties of user quality requirements (e.g. the level of precision or the rate of syntactic errors) and to provide mechanisms to detect and resolve the possible inconsistencies between them.
Data filtering needs
According to (PS26), these are requirements and expectations on data that are stated, generally implied or obligatory. They typically express concrete data needs for a specific task, for example, filtering data about patients with a certain health profile.
The authors of (PS45), focused on Big Data Systems, point out that requirements on data in this kind of systems involves several axes: data capability (network and storage requirements, e.g. system needs to support PostrgeSQL and MongoDB), data source (different characteristics of data sources, e.g. system must collect data from sensors), data transformation (data processing and analysis, e.g. system must support batch), data consumer (visualization, e.g. system must support processed results in text) and data lifecycle (data lifecycle management functionality, e.g. system must support DQ).
In (PS3) business rules are simply constraints defined over data. While in (PS26) they are policies that govern business actions that result in constraints on data relationships and values. They typically express conditions that data must satisfy in order to ensure the consistency of the dataset. Contrarily to data filtering needs, in general, business rules are independent of the task on hand.
The authors of (PS43) mention the importance of developing and tailoring quality checks to extend the effectiveness of the DQ metrics in detecting “dirty data”, and contextualizing domain characteristics. Many PS indicate that data are conditioned by the application domain. For instance, many works addressing DQ assessment, consider particular aspects of their application domains for the definition of DQ metrics. In particular, in (PS42) the authors highlight that it is possible to define general metrics, but these are often not sufficient to determine DQ problems specific to a given domain. For this reason, they suggest defining domain-specific DQ metrics.
Task at hand
The task performed by the user plays an important role when defining the context. The proposal in (PS18) indicates that DQ management for Big Data should prioritize those DQ dimensions really addressing the DQ requirements for the task at hand. In particular, Wang and Strong in (wang-strong) underline that DQ must be considered within the context of the task at hand.
In most PS DQ depends on the user. They suggest several characteristics of the user that provide the context. Among them, the user profile implies general aspects of the user, such as his geographical location, language, etc. User preferences are also related to what the user likes. The proposal in (PS44) shows the relationships among perceived information quality (among others), and the perception, satisfaction, trust and demographic characteristics of the users (such as identification, gender, age, education, internet experience, etc.), in e-government environment.
In this category, the PS consider metadata to determine data context. In (PS48), the connection between metadata and DQ problems is investigated. Metadata, such as count of rows, count of nulls, count of values, and count of value pattern are used to generate DQ rules. The authors of this work also mention that for DQ management it is necessary to categorize metadata for improving DQ.
This category suggest to use DQ values to give context to other DQ measures. Although these are also metadata, we consider important to have a category for them, since they are a special type of metadata. For instance, the authors in (PS22) investigate how metrics for the DQ dimensions completeness, validity, and currency can be aggregated to derive an indicator for accuracy. They highlight that although it is well known that there are interdependencies between particular DQ dimensions, their measurement are usually discussed independently from each other.
In this case, the quality of a dataset is evaluated based on other data that are not the contextualized data. For instance, in a relational database, data from one table could give context to other tables. Examples of this case can be seen in (PS11; PS19).
In Figure 9, we present the amount of PS that propose each of the context components. Some PS propose several of these components. According to the results, data are mostly influenced by DQ requirements, data filtering needs, the application domain and other data, the latter are not the contextualized data. To a lesser extent, data are also influenced by business rules. This makes sense since business rules are strongly tied to the application domain.
In other matters, there are 5 PS where the importance of considering the context for managing DQ is highlighted, but they do not mention which are the components of such context. In the Fig. 9 these PS are classified as ”none”.
So far we have seen that there is not a single context, since it depends on the components that make it up. The components of the context vary according to the elements that have the greatest influence on the data. For instance, sometimes the context is only conditioned by the characteristics of the user, since the data depends on the geographical location, age, expertise, etc. of the user. On other occasions, the important thing is the application domain of the data, regardless of the user who uses such data. Therefore, to identify the components of the data context, it is first necessary to identify the elements that will condition the use of the data.
On the other hand, these elements could vary throughout the entire data life cycle, in particular through the different stages of the DQM process. For this reason, below we identify at which stages of the DQ process each of the proposed context components appears.
Context components by DQ process stages
We now investigate which context components are considered at each DQM process stage, in Table 3 we classify each PS according to the context components propose. Besides, we quantify the results and summarize them in Figure 10, where we show the context components that participate at each of the stages.
Based on this classification, it appears to make sense that some context components are more important than others at certain stages of the DQM process. In fact, there are components that are not taken into account at some stages. For instance, DQ requirements, data filtering needs, application domain, metadata, and other data are suggested as context components at all stages of the DQM process. However, this does not happen for all the context components analyzed. Next, we will analyze the suggested components at each of the stages of the DQM process.
ST1 - Scenario characterization
At this stage, the task at hand and DQ values are not taken into account as context component in any of the PS. For the former it is not an expected result, since the task at hand are part of the work scenario. For the latter, the result makes sense, because at this stage DQ values are not yet known. On the other hand, only 1 PS takes the application domain into account for the work scenario characterization. As the domain defines the work scenario, it could be a natural context component of this stage.
ST2 - Objective data analysis
In other matters, user characteristics is the only component that is not considered context component at this stage. User characteristics do not play an important role at this stage, because the data profile is analyzed as objectively as possible.
ST3 - Strategy definition
In this case, where the DQ management strategy is defined and DQ requirements are prioritized, business rules, user characteristics neither DQ values are relevant, according to the analyzed PS. However, intuitively we could believe that the definition of the DQ management strategy is strongly based on business rules and user characteristics. Besides, it seems to make sense that DQ values do not participate in the DQ management strategy. Nevertheless, the first estimation of DQ values obtained at the stage ST2, from data profile, could be useful for the prioritization of DQ requirements.
In this case, where the DQ management strategy is defined and DQ requirements are prioritized, business rules, user characteristics neither DQ values are relevant, according to the analyzed PS. However, intuitively we could believe that the definition of the DQ management strategy is strongly based on business rules and user characteristics. Besides, it seems to make sense that DQ values do not participate in the DQ management strategy. Nevertheless, the first estimation of DQ values obtained at the stage ST2, from data profile, could be useful for the prioritization of DQ requirements.
ST4 - DQ model definition
For this stage, all the PS agree that all the context components proposed influence the DQ model, where DQ dimensions and DQ metrics are defined. Based on this, this stage appears as the most context-dependent. It is in line with the concept of DQ that is defined as data that are fit for use by data consumers (wang-strong). What’s more, for the area researchers, DQ is a multi-faceted concept, which is represented by DQ dimensions that address different aspects of data (PS25), and according to Wang and Strong in (wang-strong), some of these aspects are context dependent.
ST5 - DQ measurement and evaluation
At this stage something similar to the previous stage happens. Since, except task at hand, all other components suggested for the context also influence this stage. For this reason, it is also one of the most context-dependent stages. Perhaps this result is associated with the fact that DQ community researchers have paid more attention to the stages of DQ model definition (ST4) and DQ measurement (ST5), than others stages.
ST6 - Causes of DQ problems determination
The selected PS suggest that business rules, the task at hand, user characteristics and DQ values have no influence at this stage, where the causes of DQ problems are determined. This is surprising, since the analysis of DQ problems should be conditioned to all the components that make up the context, because DQ problems could arise from several factors. In fact, DQ values should be indicators of the importance of DQ problems.
ST7 - Definition, execution and evaluation of the plan
Finally, when the action plan for DQ is defined, executed and evaluated, the task at hand, user characteristics and DQ values are not considered to form the context. In this case, it is also striking that DQ values are not taken into account in any PS, when defining an action plan. Since these indicators could condition the actions to be taken on the data, processes, users, etc. of the work scenario.
DQ requirements participate at all stages of the DQM process. Besides, according to table 3, they are proposed as the most relevant context components at each stage. This is a natural result, since in DQ management, DQ requirements are the starting point for measurement and evaluation. In turn, it is important not to ignore the fact that DQ requirements may change during data usage (PS26) and this can be reflected in the DQM process. Regarding data filtering needs, although they are suggested at all stages of a DQM process, where they appear with more relevance is at stages ST4 and ST5. However, at stages ST1, ST3 and ST7 they also have a very important role, since the scenario characterization, the prioritization of DQ requirements and the DQ improvement plan strongly depend on data filtering needs.
In other matters, business rules, user characteristics, application domain and DQ values are especially suggested at stages ST4 and ST5. The first three context components are strongly related, so it is not surprising that they appear together to form the context. Especially at the stage of DQ model definition, where DQ dimensions and DQ metrics, which can be strongly subjective (PS15) , are defined. On the other hand, the task at hand appears as context component especially at stage ST4. It is probably one of the first identified context components, since Wang & Strong
, are defined. On the other hand, the task at hand appears as context component especially at stage ST4. It is probably one of the first identified context components, since Wang & Strong(wang-strong) in 1996 emphasize that DQ must be considered within the context of the task at hand. Finally, metadata and other data condition the stages ST2, ST4 and ST5. At the stage ST2, in the objective data analysis, metadata are naturally considered.
So far we have analyzed the level of formalization of the context and the components that compose it. We also saw that context can change throughout the entire DQM process. This could mean that the importance of each context component can vary at the different stages of the DQM process. On the other hand, we have not said anything about how each of the suggested components are represented. Therefore, below we will focus on analyzing the different ways of representing the context components identified.
|none||(PS1; PS53)||(PS1; PS53)||(PS1; PS53)||(PS1; PS53)||(PS1; PS37; PS53)||(PS1; PS29; PS30; PS53)||(PS1; PS53)|
|DQ requirements||(PS7; PS21; PS26; PS45)||(PS27; PS28; PS45; PS46)||(PS13; PS21; PS23; PS45)||(PS3; PS5; PS6; PS13)||(PS3; PS4; PS5; PS7)||(PS6; PS11; PS24; PS26)||(PS13; PS16; PS20; PS23)|
|(PS10; PS50; PS54; PS55)||(PS54; PS55)||(PS10; PS50; PS54; PS55)||(PS17; PS18; PS20; PS23)||(PS11; PS13; PS16; PS20)||(PS34; PS56)||(PS10; PS34; PS45; PS50)|
|(PS28; PS34; PS35; PS45)||(PS23; PS27; PS28; PS32)|
|(PS10; PS55; PS56; PS58)||(PS33; PS34; PS35; PS39)|
|(PS41; PS42; PS56; PS57)|
|(PS10; PS58; PS59)|
|data filtering needs||(PS10; PS15; PS26; PS36)||(PS9; PS36; PS43)||(PS10; PS13; PS50)||(PS9; PS10; PS13; PS14)||(PS10; PS13; PS34; PS36)||(PS24; PS26; PS34; PS43)||(PS10; PS13; PS34; PS36)|
|(PS50)||(PS34; PS43)||(PS39; PS43)||(PS50)|
|system requirements||(PS45)||(PS45; PS49)||(PS45; PS49)||(PS45)||(PS45)|
|business rules||(PS36; PS52)||(PS36; PS51)||(PS3; PS14; PS51; PS52)||(PS3; PS25; PS36; PS41)||(PS12; PS36; PS51)|
|application domain||(PS54)||(PS9; PS43; PS54)||(PS54)||(PS5; PS6; PS9; PS34)||(PS4; PS5; PS25; PS34)||(PS6; PS34; PS43)||(PS34)|
|(PS38; PS43)||(PS38; PS42; PS43)|
|task at hand||(PS46; PS49)||(PS49)||(PS14; PS17; PS18; PS47)|
|user characteristics||(PS44)||(PS38; PS47)||(PS19; PS25; PS38)|
|metadata||(PS54)||(PS9; PS28; PS48; PS54)||(PS48; PS54)||(PS8; PS9; PS28; PS34)||(PS28; PS34; PS35)||(PS34)||(PS34)|
|DQ value||(PS2)||(PS22; PS31)||(PS22; PS31)|
|other data||(PS36)||(PS2; PS28; PS36)||(PS23)||(PS23; PS28; PS35)||(PS4; PS11; PS19; PS23)||(PS11)||(PS23; PS36)|
|(PS28; PS35; PS36; PS41)|
4.3. Context Components Representation
In this section, we describe the different proposed forms to represent each component of the context, beyond the level of formalization used to define it. Of the PS that propose components to form the context, only 55% suggest or present a representation of these components. In Fig. 11 we show the distribution of the selected PS according to the different representation proposed. On the other hand, all the suggested representations for each context component are shown in Table 4. Next, we present the context components representations and the PS that propose them:
Rules in natural language
These rules are proposed to represent DQ requirements (PS10; PS20; PS23; PS27; PS42; PS45), data filtering needs (PS10; PS36), business rules (PS36; PS52), system requirements (PS45) and other data (PS23; PS36). Although the rules in natural language are the most used by PS, they are especially used to represent DQ requirements. For instance, (PS23) DQ requirements of customers, in natural language rules format, are linked to DQ dimensions, while in (PS42) they are used to develop DQ metrics.
Also this kind of rules are widely used to represent context components. In this case, they are proposed to represent DQ requirements (PS11; PS41; PS55; PS57), business rules (PS25; PS41; PS57), application domain (PS25), user characteristics (PS25) and other data (PS11; PS19; PS41; PS57). Logical rules, although they are not the most used, they are the ones that cover the most variety of context components (in Table 4). For instance, in (PS25) the Datalog language is used to represent context components through a set of logical rules.
Syntactic or semantic rules
To a lesser extent, with respect to the other types of rules, syntactic or semantic rules are used for representing data filtering needs (PS43), business rules (PS12) and application domain (PS43). As an example, in (PS43) characteristics of the application domain and data filtering needs arising from constraints and dependencies, and they are represented through semantic and syntactic rules. In turn, in (PS12) business rules are applied as semantic rule over data items, where the items are either tuples of relational tables.
The proposals in (PS33; PS59) represent DQ requirements using SQL statements. In (PS33), an extension of (PS59), it is mentioned that SQL statements enable defining data objects using “select”, and defining DQ requirements using “where” conditions. On the other hand, in (PS51) also SQL statements are used, but in this case to represent functional dependencies that arise from business rules. According to the authors, conditional functional dependencies have a syntax that can clearly represent context rules to document dependencies between attributes. This syntax also makes easier to translate rules to SQL queries that look for records that violate the rules.
Numerical indicators as thresholds
Another way to specify context components is through numerical indicators, and in this case, they are thresholds that represent DQ requirements (PS27; PS39; PS50; PS56)), application domain (PS42), and other data (PS4). For example, the authors in (PS39) use numerical quality indicators for a better linked data source selection. That is, user indicates thresholds for DQ values, then a ranked list of relevant sources are returned according to the quality thresholds. Quality thresholds represent DQ requirements that characterize DQ dimensions considered relevant for the use case at hand.
Numerical indicators applying functions
In this case, numerical indicators are calculated by applying functions.The context components represented with this representation are metadata (PS48) and DQ values (PS22; PS31). In (PS48) an indicator function maps the set of metadata to binary values 0 or 1, where the result 1 denotes that the respective dataset value is incorrect, otherwise it is 0, meaning “clean data”. On the other hand, the indicator function in (PS22) is designed as a product of the results of DQ metrics for completeness, validity and currency. The indicator function in (PS31) is also designed as a product of the results of DQ metrics, but in this case DQ metrics are for completeness, readability and usefulness. In addition, in this proposal each DQ metric has an associated weight, a scalar value between 0 and 1.
Numerical indicators as attributes
Numerical indicators as attributes are used to represent data filtering needs (PS9), application domain (PS9) and metadata (PS8; PS9). In (PS8), the authors identify attributes for information quality, and these are assigned to DQ dimensions. For instance, they associate the degree of closeness of its value v to some value v’ with accuracy. In other matters, in (PS9), the authors suggest to ask “how can data be characterized?” instead of “what causes DQ deficiencies?”, and they explain that the methodology consists of analyzing numerical indicators at the stages of data profiling, applying different tests that depend, not only on data, but also on the application domain and data filtering needs.
Context components represented using entities are data filtering needs (PS39; PS50), application domain (PS38) and user characteristics (PS38). The proposal in (PS38) uses a set of entities to represent the context. In addition, this PS relies on the context definition of Dey (Dey01), and although the authors present illustrative examples to show their proposal, they never include the suggested representation of context components in them. On the other hand, in (PS39; PS50), concepts that are entities categories identified by URIs, are used for representing a single context component, data filtering needs. Notably, (PS39) is an extension of (PS50). The proposal uses SKOS (Simple Knowledge Organization System)444https://www.w3.org/TR/skos-reference/ for defining, publishing, and sharing concepts over the Web. It is a W3C recommendation. SKOS concepts represent different data filtering needs that may be required by users to select linked data sources.
|Ctx Components||Context Components Representations|
|data filtering needs||X||X||X||X|
|task at hand|
The most common ways to represent context components among the PS analyzed is using rules, especially rules in natural language and logical rules. Second are numerical quality indicators, particularly those that represent thresholds. 11 PS consider this way of representing context components, especially when it is necessary to represent metadata, and in particular applying functions for DQ values. Regarding the representation of context components through sql statements, considering the arguments of who proposes it, it seems natural this kind of representation. However, based on the bibliography consulted (only 3 PS), there is not enough evidence of its usefulness. In turn, for business rules, only SQL statements and rules are suggested for their representation, the latter in all the proposed forms.
In the case of representation through entities, we consider that the identified cases are specific to a task, since they are used for specific needs in linked data domain. In (PS38), such representation is suggested, but it is never used in the proposed examples. On the other hand, as mentioned in the previous section, although the task at hand is probably one of the first context components identified in the bibliography, it is the only context component for which the analyzed bibliography does not suggest any type of representation for it.
5. Description of proposals formalizing context
All selected PS underline the importance of context in DQ assessment. However, as seen in Figure 7, only 6 PS present a formal context definition. In this section, we focus on describing in detail the proposals of each PS that gives a formal definition of the context.
5.1. A Methodology to Evaluate Important Dimensions of Information Quality in Systems. Proposals of Todoran, Lecornu, Khenchaf & Le Caillec 2015 (Ps38)
In this PS, a methodology is proposed for assessing the quality of the information proposed by an information system (IS). The IS is decomposed into modules to define the quality of the information locally in each of the modules. According to the authors, local DQ assessment allows an IS analyst to check the performance of each module depending on the application context. Also, they add that data are context independent, but information is context dependent, i.e. data are put in context and transformed into information. In this way, a simplified IS makes easier to pass from data to information. Hence, they underline that information consists of organized data having a meaning in the application context. Based on this, they use the context definition of (Dey01) which says the context can be defined as any knowledge that can be used to characterize the situation of an entity.
The proposal considers three major components: data, an IS and information. These components can be seen in Fig. 12. The IS is decomposed into N modules. At the input of each module there is a dataset (in Fig. 12 D1 for module 1) that will be processed within the module to generate output information (Inf1); and a set of DQ measures (in Fig. 12 Q1 in for module 1). The input DQ measures are evaluated by different quality criteria (it could be accuracy, relevancy, completeness, currency, etc.), to generate the output DQ measures (in Fig. 12 Q1 out for module 1). The information generated in each module (Inf 1, …, Inf N) is merged at the stage called information fusion. Additionally, DQ measures (Q1 out, …, QN out) received at this stage are aggregated taking into account the merged information and the user application context. Finally, the DQ measures obtained (Q1, …, QN) and the merged information (Inf) are delivered to the user. These DQ measures describe the merged information quality.
This proposal takes the definition of context given in (Dey01), where context is defined as any knowledge that can be used to characterize the situation of an entity (anything relevant to the interaction between a user and an application, including the user and the application). Based on this, the authors define for a given application A, its context environment that is defined as a set of n entities called E1, E2, . . . En. At the same time, each of these entities is characterized by a context domain called dom(Ei) that is an infinitely countable set of values.
The authors underline that for turning data into information, some knowledge is needed, and it is given by the context. The context definition is taken from the bibliography (Dey01), and although they propose a formal representation of the data/information quality, considering the context for applying DQ dimensions (called by the authors, quality criteria), they do not apply in their examples the proposed formalization. In the examples, they only use the context in an implicit way.
5.2. Exploiting Context and Quality for Linked Data Source Selection. Proposals of Catania, Guerrini & Yaman 2019 (Ps39)
This work is an extension of (PS50), and addresses the problem of source selection for Linked Data, relying on context and user DQ requirements. The authors claim that due to the semantic heterogeneity of the web of data, it is not always easy to assess relevancy of data sources or quality of their data. They consider that context information can help in interpreting users needs. In their proposal, the sources are distributed in a data summaries structure that supports context and quality aware source selection.
In Fig. 13 we present a simplification of the proposal of this PS, focusing mainly on the elements that interest us for our research. We can see, the user submits a SPARQL query to the system, and this is accompanied by a set of quality thresholds ¡v1,…, vn¿ and a reference context c. Each vi is a value between 0 and 1, and it specifies that the user is interested in values greater than vi. According to the authors, quality thresholds characterize DQ dimensions considered relevant for the use case at hand.
During a search with respect to the query, it first look for sources that satisfy the quality thresholds specified by the user through ¡v1,…, vn¿ (i.e. user DQ requirements). If sources are found, they are ranked according to the context distance. The latter is the distance between the contexts associated with the selected source and the reference context c of the user query. Finally, the user receives the most relevant sources according to his/her reference context, whose data fits his/her DQ requirements.
In this proposal the context is modelled through SKOS555Simple Knowledge Organization System: a W3C recommendation for developing, sharing and linking knowledge organization systems via the Web; https://www.w3.org/TR/skos-reference/ concepts, and these concepts are identified through URI references. According to the authors, SKOS is a W3C recommendation for defining, publishing, and sharing concepts over the Web. Besides, they add that although SKOS does not provide as powerful semantic and reasoning capabilities as ontologies, concepts enhance the potential of search capabilities. They also mention that SKOS provides mappings between concepts from different concept schemes and in diverse data sources, as well as a hierarchical organization of concepts.
In this PS, the context of the use case at hand and DQ requirements of the user are considered for the selection of data sources for Linked Data. Contexts are modelled through SKOS concepts and user DQ requirements, expressed as quality thresholds, which characterize DQ dimensions of interest to the user. Data sources that satisfy the quality thresholds and that most closely approximate to the reference data context, will be considered the most relevant ones and with the highest quality.
5.3. Rule-Based Multidimensional Data Quality Assessment Using Contexts. Proposal of Marotta & Vaisman 2016 (Ps19)
This proposal uses logic rules to assess the quality of measures in a Data Warehouse (DW), taking into account the context in which these DW measures are considered.
A DW is a database supporting decision making and whose data is represented and manipulated through the Multidimensional Model (MD). Real-world facts (e.g. sales) are represented by a set of dimensions (describing different analysis axes), and a set of measures (numerical values associated to facts, as quantity sold). Data are organized in star schemes, containing a central table called fact table (Sales in Fig. 14), storing facts (and their measures, e.g. quantity in Fig. 14) and dimension tables describing dimensions (Branches of a supermarket, Products, and Time in Fig. 14).
As can be seen in Fig. 14, the components of the context proposed in this PS are the DW dimensions (Branches, Products, and Time) and data external to the DW (database E). The authors propose to use logic rules to assess the quality of the DW measure (quantity), taking into account the context. It is worth noting that in a more complex situation, the authors also include, in the context, final users of the DW system.
DQ metrics usually represent aggregate values (e.g. accuracy is measured as the ratio of measure values that are accurate) or Boolean values (e.g. a specific value is accurate or not). However, in this proposal, a DQ metric for accuracy dimension has the following form: A DW measure is likely to be inaccurate if the quantity sold of a product in the chocolate family, in January in Uruguay is greater than 50. Then, the rule that represents this DQ metric is executed to select the facts, from the fact table, that satisfy this DQ rule.
The formalization of context and DQ metrics are given as logic rules. Also, the MD data model is represented as a collection of axioms and logic rules, based on the idea proposed by (Minuto02). The authors mention that they have limited themselves to non-recursive Datalog, and they don’t even use negation. They underline that they could have also used it to express more complex queries, but they consider that it was not the objective of the proposal, since the main objective was to show the applicability of the approach. Additionally, the researchers consider that even the most basic Datalog version would allow them expressing context for multidimensional data, not only using DW dimensions but external data too.
According to the authors, (PS19) proposes a multidimensional model which uses Datalog rules to represent facts, dimensions, and aggregate data. In addition, they apply DQ metrics, depicted by DQ rules that are defined taking into account the context of the DW measures, for obtaining a set of facts that satisfy such rules.
5.4. Data Quality Is Context Dependent. Bertossi, Rizzolo & Jiang 2011 (Ps11)
In this work, authors propose DQ assessment and DQ query answering as context-dependent activities. They mention that DQ is usually related to the discrepancy between the stored values and the real values that were supposed or expected to be stored. However, they focus on another type of semantic discrepancy. Specifically, this proposal focuses on DQ problems caused by semantic discrepancy that occurs when senses or meanings attributed by the different agents to the actual values in the database do not coincide. The context formalization is given as a data and metadata integrated system, called contextual system.
The authors propose a contextual system for modeling the context of an instance D of a relational schema, i.e. D is under quality assessment with respect to such contextual system. Fig. 15 presents a simplification of the framework for contextual DQ. Thus, the contextual system consists of a set of contextual relational schemas C = C1,…, Cn; a set of contextual quality predicates which are defined over C, and in a more complex scenario, also a set of schemas from external sources. Furthermore, DQ requirements represent the conditions that data must verify to satisfy the information needs of the users. Contextual quality predicates capture these DQ requirements.
The proposal considers a relational schema S, with relational predicates R1,…, Rn S, of first-order logic. The authors consider an instance D of S, and the database instances are conceived as finite sets of ground atoms. For each R S, instances D are those under quality assessment wrt the contextual system. Besides, the contextual relational schema C may include a set of predicates C1,…, Cn. Each contextual quality predicate is defined as a conjunctive view in terms of elements of C, and each database predicate R S is also defined as a conjunctive view. The authors mention that they consider only monotone queries and views (e.g. conjunctive queries) that they write in non-recursive Datalog with built-ins.
Authors propose a framework for the assessment of a database instance in terms of quality properties. These properties are alternative instances that are obtained by interaction with additional contextual data or metadata. In turn, contextual quality predicates capture DQ requirements. On the other hand, although the proposal is motivated for supporting DQ assessment, only deal with the selection of appropriate data for query answering. Authors do not address DQ assessment issues, nor specific DQ dimensions or DQ metrics. In fact, the authors mention that the contextual schema and data are not used to enforce quality of a given instance, but rather the quality of the data in the instance is evaluated. Finally, the contextual schema characterizes the quality answers to queries.
5.5. Ontological Multidimensional Data Models and Contextual Data Quality. Bertossi & Milani 2018 (Ps41), extension of Milani, Bertossi & Ariyan 2014 (Ps57)
The framework presented in (PS11) (and previously described) is extended in the works (PS57; PS41). Specifically, these extensions present a formal model of context for context-based DQ assessment and quality data extraction. For that, the authors propose ontological contexts including multidimensional (MD) data models in them. In this work, DQ refers to the degree to which data fits a form of usage, and it relates DQ concerns to the production and the use of data.
The authors underline that the instance of a given database does not have all the necessary elements to assess DQ or to extract data of quality. For this reason, they consider that the context provides additional information about the origin and intended use of data. Therefore, the authors use MD ontologies for representing the context, and they extended these ontologies with rules for DQ specification. A ontology contains definitions of quality predicates that are used to produce quality versions of the original tables. As seen in Fig. 16, data in the given database instance D are put in context to get a quality version of D. This means that data obtained from D are processed through the contextual ontology to obtain quality data.
According to the researchers, the contextual ontology contains a multidimensional core ontology. i.e. the ontological multidimensional data model represented by categorical relations and dimensions, and eventually dimensional rules and constraints. Additionally, a quality-oriented sub-ontology that contains quality predicates (rules and/or constraints ), and possible extra contextual database. The latter could be used at the contextual level in combination with data associated to the multidimensional ontology. Fig. 16 shows a simplification of the proposed contextual ontology, where it is presented how these elements are related.
On the other hand, in Fig. 17 we show how are proposed the dimensional data with categorical relations. Each level of the hierarchical dimension corresponds to a relations that contains more information for each data of the level. Then, data in D is processed based on contextual information, called guidelines. In the example of Fig. 17 a guideline could be ”Sellers, in the computing department, update customer data at the time of the sale”, and it will condition the navigation in the dimensional hierarchy (selecting the level and the level data, according to the contextual information), and the selection of the data in the categorical relations. Finally, quality predicates, and eventually extra contextual data, will be taken into account to obtain a quality version of D.
The authors extend the formal context model presents in (PS11) for context-based DQ assessment, DQ extraction, and data cleaning on a relational database. In this case, they are focus on an ontological context. Regarding the representation of the contextual ontology, they consider that it must be written in a logical language, and propose Datalog for this task. Also a declarative query language for relational databases is taken into account. According to the authors, it provides declarative extensions of relational data through expressive rules and semantic constraints.
The work in (HMmodel) is applied in this proposal, which uses Datalog to represent multidimensional data model. Bertossi et al. mention that certain classes of Datalog programs have non-trivial expressive power and good computational properties. The researchers add that some of these programs allow them to represent a logic-based, relational reconstruction and extension of the proposal in (HMmodel).
This work proposes ontological contexts embedding multidimensional data models in them. The notion of quality presented in this proposal has to do with the DQ dimension called Relevancy (wang-strong). This PS highlights that this contextual approach can be used to address inconsistency, redundancy and incompleteness of data. Nevertheless, authors do not delve into these DQ dimensions and do not define DQ metrics for such dimensions, since it is specifically focused on query answering. In fact, the authors mention that independently from the DQ dimension, they can consider DQ assessment and data cleaning as context-dependent activities.
5.6. General results
As a summary, we want to highlight that in two cases (PS38; PS41) appears the need to put data in context. In the former to obtain information and in the latter to obtain a quality database. Although all works focus on data context, such data are considered at different levels of granularity: a single value, a relation, a database, etc. For instance, in (PS19) dimensions of a Data Warehouse (DW) and external data to the DW give context to DW measures. While, in (PS11) data in relations, DQ requirements and external data sources give context to other relations. On the other hand, proposals in (PS41; PS57) model and represent a multidimensional contextual ontology. The context, in this cases, is defined by hierarchical data, DQ requirements and external data sources, and all these give context to relational databases. The authors in (PS39) propose a framework where the context (represented by SKOS concepts), and DQ requirements of users (expressed as quality thresholds), are using for selecting Linked Data sources.
In turn, (PS38) presents an information quality methodology that considers the context definition given in (Dey01). This context definition is represented through a context environment (a set of entities), and context domains (it defines the domain of each entity). In this case, researchers assume that although the context definition given is too general for DQ domain, it allows them to enumerate the context elements for a given application scenario. Next, in Table 5 we summarize this information.
|PS||CTX components||CTX representation||Contextualized object|
|DQ req., other data||relational schema||relational DB|
|DQ req., business rules, other data||ontology||relational DB|
|DQ req., business rules, other data||ontology||relational DB|
|user, app. domain||set of entities||information|
|DQ req., data fil. needs||SKOS concept||data|
|other data, user||logic rules||DW measures|
Secondly, we are interested in identifying the most important DQ concepts addressed in these PS. For instance, DQ requirements are addressed by all PS except in (PS38). (PS19; PS38; PS39) consider contextual DQ dimensions, while in (PS41) DQ dimensions are non-contextual and in (PS11; PS57) DQ dimensions are not mentioned. Regarding DQ metrics, they appear in (PS19; PS38; PS39), and in all of them they are contextual, i.e. their definition includes context components or they are influenced by the context. In the case of DQ tasks, cleaning (PS11; PS41; PS57), measurement (PS19) and assessment (PS38; PS39) are the only tasks tackled in these PS. We summarize this information in Table 6.
|PS||DQ req.||DQ dimensions||DQ metrics||DQ tasks|
Finally, we present the most important characteristics of these PS. Regarding the research domain, (PS19; PS39) address context definitions for Data Warehouse Systems and Linked Data Source Selection respectively. On the other hand, (PS38; PS11; PS41; PS57) are specifically focused on DQ, the last three proposals tackle cleaning and DQ query answering. Furthermore, the type of work most approached is the definition of a model (PS11; PS19; PS39; PS41; PS57). In the case of (PS41; PS57), they also present a contextual ontology, while (PS19; PS39) also pose a framework and (PS38) only presents a DQ methodology. According to the venues quality, (PS38; PS39; PS41) are ranked with B and (PS11; PS19; PS57) are not ranked. Regarding to the publication date, the oldest PS corresponds to the year 2010 and the newest to 2019. This information is shown in Table 7.
|PS||Research area||Work type||Venue quality||Pub. year|
|(PS57)||Cleaning||Model / Ontology||NR||2014|
|(PS41)||Cleaning||Model / Ontology||B||2018|
|(PS38)||Data Quality||DQ Methodology||B||2015|
6. Answers to Research Questions
This section presents the answers brought to the three research questions by the analysis of the selected PS.
6.1. RQ1: How is context used in data quality models?
In the PS that address DQ models (PS2; PS3; PS13; PS18; PS20; PS27; PS28; PS33; PS34; PS59), we mainly identify that they include context i) distinguishing contextual DQ dimensions from non-contextual ones, ii) defining contextual DQ metrics, which include context components or are delimited by DQ thresholds, and iii) considering the specific context through DQ requirements, which can come from business rules, users’ data or user preferences. Next, we will concentrate on analyzing each of the proposals.
To begin we consider the works in (PS3; PS18), where are proposed quality-in-use models (3As and 3Cs respectively). They claim that ISO/IEC 25012 DQ model (25012-Stand), devised for classical environments, is not suitable for Big Data projects, and present Data Quality in use models. They propose DQ dimensions as categories called contextual adequacy and contextual consistency, among others, evidencing the need for explicit context consideration. Regarding contextual DQ metrics, in the case of (PS3), they also mention that to measure DQ in use in a Big Data project, DQ requirements must be established. In addition, in (PS18) it is mentioned that DQ dimensions that address DQ requirements of the task at hand should be prioritized.
In the proposal of (PS2), the authors reuse the DQ framework of Wang & Strong (wang-strong) to highlight contextual characteristics of DQ dimensions as completeness, timeliness and relevance, among other. Besides, they underline that contextual DQ increases the retrieval of valuable information from data. Also in (PS13), the contextual DQ dimensions included in the proposed DQ model are taken from the bibliography, but in this case the ISO/IEC 25012 standard (25012-Stand) is considered. As well as, the authors claim that DQ requirements play an important role in defining a DQ model, because they depend on the specific context of use. (PS20) references the proposal in (PS3) supporting the need raised in it. On the basis that DQ assessment model based-in-use is more and more important, since as in (PS13), business value can only be estimated in its context of use. In this case, DQ requirements are strongly tied to the contextual DQ dimensions efficiency and adequacy.
In fact, the proposal in (PS27) is also motivated by producing value from Big Data analysis minimizing DQ problems. In turn, this work also considers the quality-in-use models in (PS3; PS18) (3As and 3Cs respectively), but in this case the authors underline that, for these works and others, analyzing DQ only involves preprocessing of Big Data analysis. As well as, they argue that these DQ models mainly consider DQ on a single source, and they do not take sufficiently account user preferences. Hence, the authors present their proposal as a more complete DQ model, because it alerts about DQ problems during the analysis stage in Big Data without any preprocessing, and takes into account user preferences. On the other hand, completeness and consistency are the DQ dimensions considered in this DQ model, and they are contextual according to the users’ perspective. Besides, DQ metadata obtained with DQ metrics associated to the DQ dimensions are limited by thresholds specified by users. Therefore, these thresholds can change for each user.
Contrary, results of (PS28) that presents a DQ profiling model, prove that DQ profiling traces quality at the earlier stage of Big Data life cycle leading to DQ improvement. This proposal profiles a dataset in a DQ domain defined by a set of DQ requirements and data filtering needs. In other matters, motivated by decision making, in (PS34) along with DQ measurement, DQ problems are also the focus, and they are associated with DQ dimensions. In turn, users DQ requirements give context to the DQ dimensions. It is also added that the lack of combination of general and specific DQ dimensions for analysing DQ affects data fit for uses. As well as, the authors even point out that although data cleaning produces DQ improvement in the short term, it does not have a radical effect on DQ. Therefore, constant DQ improvement is necessary.
To finish, we emphasize that the bibliography supports the fact that the term quality depends highly on the context in which it is applied, and to assess DQ for a specific usage, DQ requirements must be described and the compliance of them have to be checked (PS33; PS59). Moreover, the same data may be checked for its accordance with reference to different DQ requirements (PS59).
There is vast evidence that DQ assessment is context-dependent. Since several research domains as Linked Data, Decision Making, Big Data and especially DQ domain, present arguments of the importance of having DQ metrics that adapt to the needs of each reality. The bibliography claims that the current DQ models do not take into account such needs, and particular demands of the different application domains, in particular in the case of Big Data. Perhaps a common DQ model is not possible, since each DQ model should be defined taking into account particular characteristics of each application domain. In addition, there is an agreement on the influence of DQ requirements on a contextual DQ model, since according to the literature, they condition all the elements of such model.
6.2. RQ2: How is context used within quality metrics for the main data quality dimensions?
Already in subsection 3.3 we have addressed contextual DQ metrics that we have identified in the analyzed PS. Now, looking for answering this research question, we return to these PS, for a more detailed analysis. Now, we want to know why authors consider certain DQ metrics are contextual, which context components are considered, and how they are included in the definition of DQ metrics. Next, we present this analysis.
The proposals in (PS16; PS32) are based on the fitness for use approach (wang-strong) to the DQ measurement task. (PS16) defines DQ metric as objective o subjective, the latter when they are based on qualitative evaluations by information and/or data users. It is also mentioned that DQ measurement allows the comparison of the effective quality with predefined DQ requirements. Furthermore, in the case of (PS32), the authors underline that DQ requirements have a very important role when implementing a DQ projects, because it should meet the specified DQ requirements. In turn, in that task it is difficult to select appropriate DQ dimensions and their DQ metrics, since there is no agreement on the dimensions that exactly determine DQ. Actually, the previous idea is reinforced by the researchers of (PS35), where they point out that DQ requirements may influence over the selection of the set of dimensions and metrics to be considered for DQ assessment. In addition, as a conclusion of a literature review in (PS21), the authors define DQ requirements as “the specification of a set of dimensions or characteristics of DQ that a set of data should meet for a specific task performed by a determined user”. According with this, when we specify DQ characteristics some elements that can define a context must be taken account: such as the specific task, the user, business rules, etc.
Taking Big Data quality issues into account, a proposal of context-dependent DQ assessment in (PS4) presents a DQ metric for evaluating the confidence precision based on DQ requirements specified by users. At the same time, this DQ metric is defined based on certain DQ dimensions such as completeness and distinctness. In a review (PS17), the authors recommend an evaluation scheme in which DQ metrics are selected according to DQ dimensions too, beside data and Big Data attributes. On the other hand, (PS28) proposes a Big Data quality profile repository that includes DQ requirements. This repository defines DQ dimensions and their DQ metrics. Besides, DQ evaluation includes data sources (with all that they imply: data, metadata, etc.), DQ requirements, and DQ evaluation scenarios. Also, but in Data Integration domain, (PS55) presents users with different roles that specify DQ requirements that later will determine the selection of DQ metrics.
In the proposals (PS27; PS39) DQ requirements are represented as threshold. In the former measurement methods use threshold (called quality limit) with which the system alerts users. These thresholds are specific for each measurement method and can be indicated by users. The latter uses thresholds specified by users to condition data sources selection. Data sources whose quality values do not verify the thresholds associated with each DQ dimension are discarded. In the case of (PS42), a set of DQ requirements for DQ metrics is specially defined, some of them stating that it must be possible to adjust a DQ metric to a particular application domain. In fact, the latter is verified by (PS15; PS43), since in (PS15) the authors define DQ dimensions and DQ metrics whose definition and process of measurement inherently depend on the application domain, resulting in a class of subjective DQ dimensions and DQ metrics. While in (PS43) it is introduced a set of quality checks for creating application domain specific DQ metrics. According to (PS38), a quality criterion might be evaluated by multiple measures, depending on the information characteristics. Then, they evaluate DQ based on the application domain, and the interaction between the user and the application is also considered for contextualizing DQ.
In the case of (PS43), data filtering needs are included in the definition of DQ metrics, and they are customized by users. In other matters, the proposal in (PS51) presents DQ metrics that are created using business rules that represent conditional functional dependencies. Furthermore, (PS3) presents a 3As DQ-in-Use model where DQ dimensions (called DQ characteristics by the authors) suggested for Big Data analysis are contextual adequacy, temporal adequacy and operational adequacy. To measure the levels of Data Quality-in-Use, DQ requirements are considered to select the appropriate type of adequacy. Additionally, business rules are used as input to the DQ metrics condition the measurement.
For decision making, a methodology for DQ assessment in (PS9), defines DQ metrics based on the task at hand (called use-case in this work), data attributes and tests criteria. The latter are used for characterizing datasets and DQ dimensions. With the same purpose the authors of (PS14) introduce DQ metrics for accuracy of a relational database. The syntactic accuracy assessment matches tuples from the table under evaluation with tuples of another table which contains the same but correct tuples. In this case, DQ metrics are defined based on other data that works as a referential. Also taking into account the relational model, the proposals in (PS19; PS25) are motivated by DQ assessment, but in this case in a Data Warehouse. For the measurement, other data, which are not contextualized data, are taken into account to define the context considered in DQ metrics. In particular, data in the DW dimensions are embedded in the contextual DQ metrics. As well as, in (PS25), information from business rules and about the application domain are also be embedded in DQ metrics.
The approaches in (PS22; PS31) are different from previous PS, since in these cases the authors investigate how DQ metrics for particular DQ dimensions can be aggregated to derive an indicator for DQ. In the case of (PS22) values of completeness, validity, and currency are aggregated to derive an indicator for the dimension accuracy. While in (PS31), an indicator function is designed as a product of the results of the DQ metrics for completeness, readability and usefulness. One more time, DQ metrics are raised based on other data, which in this case are DQ metadata.
It is difficult to identify which are the context components that make DQ fitness for use. Since literature suggests several context components to contextualize DQ dimensions and DQ metrics. In particular, regarding how context is used within DQ metrics, there is nothing concrete yet. Because in most of the times DQ metrics are considered contextual according to whether the measurements obtained with them conform or not DQ requirements. Actually, we found few works (PS19; PS22; PS25; PS31; PS43) that explicitly include context components in the definition of DQ metrics.
On the other hand, among all the context components addressed, DQ requirements are the most considered by the different research domains to contextualize DQ. In particular, at the moment to properly select DQ dimensions and define their metrics. This latter is in line with the analysis performed for the RQ1.
6.3. RQ3: How is context used within data quality concepts?
To answer this research question and according to our review, we identify DQ concepts that usually consider data context, such as measurement tasks, DQ methodologies, DQ requirements and DQ problems. Next, we will see how the context is related to these DQ concepts.
To begin we consider the proposal in (PS36), it takes DQ actions related to business rules. DQ actions, in this case, refer to DQ tasks corresponding to measurement, evaluation and cleaning. Additionally, authors of (PS11; PS57; PS41) address DQ assessment, focusing on data cleaning and motivated by data filtering needs. In this approach, other data (which are not the evaluated data), DQ requirements and business rules influence DQ evaluations. In other matters, a review carried out in (PS25), authors observe that few works use context when performing DQ tasks as data profiling, data cleaning or data evaluation, being DQ measurement one of the tasks that more considers the context. This latter coincides with the results obtained in section 7, where we observe that it is at the measurement and evaluation stages of a DQ process that the components of the data context are most taken into account.
On the other hand, in (PS51) a DQ methodology is proposed for assessing DQ based on business rules. Also (PS4) proposes a methodology that selects, based on user DQ requirements, the best configuration for DQ assessment. This coincides with arguments of (PS16), where is mentioned that the role of DQ methodologies is to guide in the complex decisions to be made, but at the same time, it has to be adapted to the application domain. According to this, (PS38) considers the application domain and user data for the purpose of presenting reliable information quality. In other matters, we also identify DQ methodologies for particular research domains, for example those proposed by (PS35; PS36), which are adapted for assessing DQ in a Big Data Domain. In the case of (PS25) a methodology is presented to define contextual DQ metrics in Data Warehouse Systems. In other cases, particular methodologies are proposed for DQ assessment. For instance, (PS23) applies the six-sigma methodology (linderman2003six), and it addresses DQ tasks (measurement, assessment, and improvement), that are guided especially by DQ requirements. In the case of (PS13), a methodology for public sector organizations is based on the OPDCA (Observe, Plan, Do, Check, Adjust/Act) Shewhart-Deming cycle (PDCA). Besides, DQ requirements, laws and regulations own of the application domain, condition DQ assessment.
Furthermore, (PS33) presents a theoretical methodology that describes principles of DQ and methods for its evaluation, which are carried out based on DQ requirements. Also in (PS10; PS56) DQ requirements are the starting point of DQ assessment. In (PS10) the proposed methodology identifies DQ dimensions and DQ metrics that arise from DQ requirements, focusing on their visualization to assess the overall DQ. While in (PS56) the methodology identifies DQ problems to perform DQ evaluations, in this case DQ requirements are called specifications. Regarding DQ problems, the authors in (PS26) highlight that they are an important source to understand data filtering needs. Moreover, they usually result from the violation of DQ requirements. The proposal in (PS29) asserts that general data problems within a context can result in information quality problems. In particular, the research in (PS24) classifies DQ problems using a conceptual model to determine an optimal investment in DQ projects. In addition, some of these DQ problems are classified as context dependent. Context is also considered in (PS9) at the initial stage of a DQ process, where at the final stages DQ is assessed and improved. In this case, the authors emphasize that a specific usage context or data dependent task is defined. As we can see, task and context are used interchangeably.
As we have already seen, several works focus on satisfying DQ requirements to achieve DQ tasks as measurement, evaluation and data cleaning. Mainly, DQ requirements vary according to users, applications domains or the task at hand, in particular at the different stages of DQ methodologies. Since each stage proposes different DQ actions. But also, adapted DQ methodologies are suggested for assessing DQ, especially in the Big Data Domain. Additionally, literature suggests that when we analyze general data problems within a context, these could become DQ problems.
This paper reports the protocol and results of a SLR on context in DQ management. The SLR methodology allowed the selection of 58 papers based on their relevance, that were analyzed, with a particular attention on the definition of context in DQ management. In addition, we focused on the main characteristics of the proposals, such as type of work, application domain, considered data model and proposed case study. Furthermore, we identified positive and negative points of each approach, taking them into account for our purpose, which is to provide guidance in the usage of data context in DQ management.
Regarding the general results of the systematic review, we highlight that the largest number of PS were returned by the digital library Springer, the search string that contains the most important DQ concepts is the one that provided the best results, and the most of the PS are recent, published since 2016. In other matters, models and framework were the types of work most found. In turn, PS that only address DQ domain (called only DQ), were the most common proposals. This makes sense, because our searches specifically pointed to that domain. However, we found several PS of Big Data domain too. Researchers of this community claim that current DQ models do not fit to the needs and characteristics of this research domain. In fact, there are discrepancies among Big Data proposals, since in some cases the authors state that most literature does not address DQ at all stages of a Big Data project, but only at the initial stages.
With respect to our findings, we note that while the importance of context in DQ management is acknowledged in all the studied PS, only 6 of them present a formal context definition. Among them, 5 correspond to the DQ domain (i.e. only DQ) and 1 corresponds to the Linked Data domain. In turn, half of the selected PS, i.e. 29 PS, do not present any context definition, formal nor informal. In other matters, among the context components identified in our review, we observe that DQ requirements are the most suggested to give context to DQ. To a lesser extent, also data filtering needs, the application domain, business rules, other data (which are not the evaluated data), and the task at hand are suggested to compose the context.
In relation to DQ processes (we consider a DQ process with 7 stages that is presented in the subsection ), measurement and evaluation stages seem to be the most contextual, since they are the stages in which more context components are proposed to form the context. With respect to the representation of the context components, rules are the most used. In particular, the rules are written in natural language or using a logical language. Numeric indicators are also used to represent context components, DQ requirements are usually represented as DQ thresholds.
To finish our work, we can conclude, based on the analysis performed in answering our research questions, that there is vast evidence that DQ assessment is context-dependent. Since several research domains argue the importance of having DQ models that suit their needs. This implies that DQ models must be contextual, and therefore DQ dimensions and their DQ metrics. However, in the literature there is no consensus on what exactly makes these DQ concepts contextual. In fact, there is not even agreement on what are the most appropriate DQ dimensions and DQ metrics for a DQ project. The latter makes sense, since due to the contextual nature of DQ, DQ dimensions and DQ metrics must be considered according to the data context of each DQ project. On the other hand, there is a tacit agreement that DQ requirements must be considered in all DQ management tasks, and these can vary according to users, the applications domain and the task at hand.
We also identified that context components are usually proposed for DQ evaluation, where measurements obtained through DQ metrics are compared with quality thresholds that represent DQ requirements. This does not coincide with what was analyzed in the PS, because most of the proposals affirm that the context is necessary both when measuring DQ and when evaluating it. This means that there is awareness of the need of a context, but this is not explicitly addressed, especially when DQ metrics are defined. In fact, although we have identified contextual DQ metrics, they are not defined in a generic way, but for particular domains. Therefore, considering this lack in DQ domain, we are currently working in this direction. In this way, we are modeling the context for DQ management, and at the same time, we are focused on defining a case study that supports the context modeling through definitions of contextual DQ metrics. On the other hand, as future work we are planning to formalize the proposed model and apply it at the different stages of a DQ process.