PACLP: a fine-grained partition-based access control policy language for provenance

by   Faen Zhang, et al.

Even though the idea of partitioning provenance graphs for access control was previously proposed, employing segments of the provenance DAG for fine-grained access control to provenance data has not been thoroughly explored. Hence, we take segments of a provenance graph, based on the extended OPM, and defined use a variant of regular expressions, and utilize them in our fine-grained access control language. It can not only return partial graphs to answer access requests but also introduce segments as restrictions in order to screen targeted data.



There are no comments yet.


page 1

page 2

page 3

page 4


A fine-grained policy model for Provenance-based Access Control and Policy Algebras.pdf

A fine-grained provenance-based access control policy model is proposed ...

Content Confidentiality in Named Data Networking

In this paper we present the design of name based access control scheme ...

On Fine-Grained Exact Computation in Regular Graphs

We show that there is no subexponential time algorithm for computing the...

A Graphical Framework for the Category-Based Metamodel for Access Control and Obligations

We design a graph-based framework for the visualisation and analysis of ...

NAC: Automating Access Control via Named Data

In this paper we present the design of Name-based Access Control (NAC) s...

Droplet: Decentralized Authorization for IoT Data Streams

This paper presents Droplet, a decentralized data access control service...

The Impact of Timestamp Granularity in Optimistic Concurrency Control

Optimistic concurrency control (OCC) can exploit the strengths of parall...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Data provenance logs historical operations performed on documents. Provenance can be expressed as a directed acyclic graph (DAG), illustrating how a data artifact is processed by an execution. In such a DAG of provenance under the Open Provenance Model (OPM)[5], nodes present three main entities including Artifact, Agent and Process and edges represent connections to the main entities. Provenance Access Control is considered a crucial research topic for big data security. The sensitivity of files and their provenance can be different, and users can request, and be granted, access to files and provenance separately. In some situations, provenance itself may consist of sensitive information which might require more protection than its attached document. For instance, although a programming project can be published to the public, its authors and executed operations should be kept as a secret, to prevent leaking the techniques. Therefore, access control to the provenance data itself is required. It allows eligible users to access the provenance data and protects it from unauthorised access. Privacy and security of provenance are perceived as the main bottleneck to broad applications of provenance [27][28][29].

Privacy helps individuals maintain their autonomy and individuality, and security is the protection from theft and damage to provenance, as well as from disruption or misdirection of provenance. Hence, lacking protection for privacy and security can not convince users to trust the application of provenance.

Although there has been great progress in provenance access control, there still exist some difficulties. Existing traditional access control techniques including role-based access control[25], attribute-based access control [26] cannot be applied to provenance access control straightforwardly, because of provenance is a type of meta-data with a specific data structure. Due to the diversity and particularity of provenance data structure, traditional access control technology is not suitable for provenance access control. Therefore, there is an urgent need to develop appropriate source access control languages. How to define fine-grained access control policies under a proper provenance model is the main concern of this paper.

Previous provenance DAG access control language has not employed different types of segments in provenance graphs thoroughly. Although Danger proposed an idea to return provenance sub-graphs by answering access queries, their approach relies upon explicitly enumerated sets of nodes. This approach is not effective when the policies consist of a large amount of nodes. In addition, in some scenarios, the set nodes in a collection is not predictable in advance. Therefore, a mechanism is proposed in this chapter to define a collection of nodes by a summarizing approach. For example, a string of connected nodes can be defined in a policy by nominating a starting node and an ending node. More concretely, a provenance logs an assignment written by a student and the operation the student carries out upon the assignment. The provenance also records that the student submitted the assignment to a professor and actions from the professor to grade it, attach comments, revisions, etc. However, students are not allowed to gain any information about the operations carried out by the markers. However, it cannot be known in advance, for a given submissions, how many operations will be involved in the marking. A professor may access the submission multiple times and it may even be marked by more than one staff members. Therefore, a policy is required that can block access to a portion of the provenance from the point of submission until the assignment is returned to the student by the professor, regardless of how many processes have been done between those points.

In this paper, we propose a Segment-based Access Control Policy Language for Provenance (PACLP) by extending existing policy languages. PACLP enables the specification of partial provenance graphs as well as transformation scope, transformation mode, and transformation labels, in order to partition a provenance graphs in a fine-grained approach and define how to transform the collected nodes into a new graph. And our major contributions can be summarized as follows.

  • The attributes are stored to each node in a provenance graph to support more fine-grained access control policy, which can also be used in policies to specify which nodes are the targets of the policy.

  • We use regular expressions to define multiple ways to partition provenance DAG, including single node, node type, path, and subgraph, which could be utilized as conditions for access control policy.

  • We propose an algorithms to retrieve applicable policies in response to the request and merge results of individual policies as a final decision. An existing provenance DAG will be transformed into a new graph to return to users.

2 Related Works

Several previous works have made significant contributions to provenance access control. Ni [6] proposed an initial fine-grained provenance access control language based on XACML syntax, which was customized according to the provenance model of recording operational attributes. Subsequently, several papers proposed access control languages of the DAG provenance models, and tried to partition provenance graphs and return partial graphs to users[20][11]. In addition, SPARQL query templates were presented to answer queries that record attributes in their provenance.

Cadenhead et al.[18] extended an existing provenance access control language proposed by Ni et al.[6], which introduced regular expression to protect traditional data items as well as their relationships from unauthorized users. It was an XML-based structure policy language and associates grammar based on provenance graphs. In order to evaluate the effectiveness of their policy language, a prototype language based on their architecture utilizing Semantic Web technologies has been implemented.

Danger et al.[11] proposed a method that allows policies to define subgraphs that can be transformed through three levels of abstractions, which presented an algorithm for transforming provenance graphs and generating accessible versions of queries. Although Danger presented the idea of returning provenance sub-graphs by answering access queries, their approach relies upon explicitly enumerated node-sets. This approach is not effective when the policies contains a large amount of nodes. In addition, in some scenarios, the set nodes in a collection can not be predicted in advance.

3 Provenance Access Control

3.1 The Basics of the policy language

Figure 1: The extended version of OPM: OPM Schema

Given the OPM T, L, G, an OPM instance is defined by a provenance graph = , where is a set of entities and . : is a function mapping an entity to its type, is valid if for each entity v , T, and for each edge (v, v, l) , ( (v), (v), l) E. We extend the definition of OPM in paper [20]. We will give some definitions here.

Definition 1 (Open Provenance Model (OPM)) is an extension of OPM recording how is a piece of data derived, which is defined by a triple T, L, G:

  • T is the vertex types: agent (Ag), artifact (A), process (P) and attribute (Att). As shown in Fig 1, the artifact is represented by the shape of oval, which is an object or a piece of data, such as “”, “comments”; A process is an operation performed on a piece of data, such as “submit” and “review”; An agent is a topic that supports operations including “” and “professor”.

  • L is the relationship labels: used (u), wasGeneratedBy (wgb), wasControlledBy (wcb), wasTriggeredBy (wtb), wasDerivedFrom(wdf) and hasAttributes (ha). Each edge in a provenance graph will be marked as one of these labels. Labels describe the relationships between vertices.

  • G is a labelled DAG, where G = V, E, E defines the allowable relationships between the elements, E = { (P, A, used), (A, P, wgb), (P, Ag, wcb), (A, A, wdf), (P, P, wtb), (Ag, Att, ha), (P, Att, ha), (A, Att, ha) }

Definition 5 (Provenance Path). A path p = { (, , … ) n2}, starting from and ending at , which is a collection of vertices that forms a line in a provenance DAG. In addition, if all the vertices in the path are connected by cause edges , it is a directional provenance path indicating that all operations along the directional provenance path occur in chronological order. The general provenance path may contain effect edges where processes recorded in a general provenance path may not occur in chronological order.

(XPath Symbols). XPath is a query language defined by the World Wide Web Consortium (W3C) for selecting nodes from an XML document. XPath can be used to compute values (e.g., strings, numbers, or Boolean values) from the content of an XML document. Provenance paths are defined over XPath. A directional provenance path can be defined as (///), describing a provenance path that starts from and ends by , including between and .

Figure 2: Sample Provenance Path A and B

The keyword is used to distinguish direction and general paths.

  • The first example will match any provenance path that starts from a Process vertex named and ends at a Process vertex named . Since it is a directed path, all nodes in the path are connected by CE. In Fig.2, the example expression matches the the provenance path described, which can also be represented as follows:

    (upload1, o1v1, replace1, o1v2, submit1): upload1, o1v1, , o1v1, replace1, , replace1, o1v2, , o1v2, submit1, .

  • The second example will match any provenance path that starts from a Artifact vertex named and ends at a Artifact vertex named . unlike directed path, edges at either end of the path can contain EE. In the Fig.2, the example expression matches the illustrated the provenance path, which can also be represented as follows:

    (o2v1, review1(attri), o1v3, grade1, o4v1(attri)): o2v1, review1, , review1, o1v3, , o1v3, grade1, , grade1, o4v1, .

  • The last example will match any provenance path that starts from an Agent vertex named and ends at a Artifact vertex named . In the given example graph, there are three coincident paths.

    (au1, upload1, o1v1, replace1, o1v2): au1, upload1, c, upload1, o1v1, , o1v1, replace1, , replace1, o1v2, ;

    (au1, replace1, o1v2): au1, replace1, c, replace1, o1v2, ;

    (au1, submit1, o1v2): au1, submit1, c, submit1, o1v2, .

Definition 6 (Subgraph). Let G=V, E, S=V, E . S is a subgraph of G, if V V, E E. A subgraph is a set of vertices { (, , … ) n2} in a provenance DAG. A partition P of G is a connected subgraph of G.

Figure 3: Sample Subgraph A and B

We define subgraphs by specifying vertex expressions. A subgraph consisting of a given vertex is the target of a policy. The policy defines a subgraph by nominating a starting node and an ending node of the subgraph as the subgraph (// ). A subgraph can start or end at endpoints of a provenance graph regardless of what those vertices are. Hence, in PACLP language, a terminal of a provenance is expressed as which is a starting point of a provenance DAG and which is an ending point of a provenance DAG. Specifically, subgraphs (/following::*) are graphs that start at the beginning vertex of a provenance graph and end at the Vertex . Subgraph (/preceeding::*) is that starting at and finishing at the end vertex of a provenance graph.

Fig.3 highlights a subgraph () starts with “upload 1” and ends with “submit 1”. Between the two terminals, it summarizes all the nodes, including vertices of the type of Agent. The subgraph (/following::*) is another example between vertex and the end of the provenance DAG .

4 Partition-based Access Control Language on Provenance (PACLP)

A Partition-based Access Control Policy Language is proposed to extend the access control language [18], enabling a policy to allow or prohibit access to partitions. First, language item PACLP tailored under the OPM stores the attributes that support more fine-grained policies. Second, an XPath representation of the provenance partitions is presented to determine which collection of nodes in the source DAG can be accessed. Therefore, the access to available information could be maximized, rather than hiding the entire graph to protect partially unavailable attributes in a provenance graph. Third, as provenance partitions could provide clues for data sensitivity and vulnerability, the provenance partitions could be employed as conditions in policy.

Figure 4: System Model

As shown in Fig.4, the system model consisted of four parts: Administrator, Server, Users, and Database. Users send queries to the Server to access provenance stored in a database. Administrator generates access control policies and sends them to the Server. Server collects policies from administrators and (optional) data producers, and will generate results based on the policies and delivers the results to the database when receiving queries. Database transforms the target provenance graph based on the results to hide unavailable partitions and sends it to users. Details about the model are discussed bellow.

Figure 5: PACLP Schema

4.1 Language Items

Our provenance access control policy PACLP is tailored over XACML syntax consisting of Target, Condition, Obligation, Effect, and Transformation items. Each item contains one or more tags. To support more fine-grained control policies, the PACLP constructed over the OPM attaches attributes sets to main entities. In addition, the provenance partitions are employed as conditions to confine applicable policies to determine whether the partitions are accessible.

Transformation is an important item in policy, indicating how the provenance graphs should be transformed to hide sensitive information. Transformation specifies how each provenance partition should be processed by replacement or deletion. The transformation elements consist of four items, including provenance partitions, transformation scope, transformation mode, and transformation label. First, we demonstrate how to define these provenance partitions in a provenance access control policy and give some examples.

  • Vertices. This will collect one or more types of vertices in a provenance DAG. These can be Agent, Process, Artifact, or Attributes. Data owners may allow read operations but keep the executor anonymous. As shown below, the agent vertices are collected with values of “wasGradedBy” or “Graded”.

vertices () vertices ()

  • Provenance Path. This is a vertex line from DAG. Two types of source paths are defined based on the directions of the edges : directed path and general path. In a directed path, only edges connect the vertices from the origin to the destination; In general paths, there can exist effect edges between the two endpoints. The following directed path example is from node “wasGradedBy” or “Graded” to node “wasSubmittedBy” or “Submit”. Particularly, in a directed path the process from the original to the target is listed in chronological order, as all nodes are linked by cause edges.

directed (b+ )

  • Subgraph. A subgraph is a collection of vertices with a specified origin and/or destination, and can be represented as a subgraph ( ). The first example of the subgraph below defines all operations performed in the provenance graph in 2016. Another example represents a partition from the vertex with a given value to the end of the entire graph, representing all the operations that have occurred in the graph since 2016.

subgraphs subgraphs

Transformation scope defines the scope of the node used for transformation, which accepts three possible values: Original, Conjunction, and Extension. Original means that the vertices defined by access control policies do not extend to other vertices in the given provenance graph. Conjunction indicates that a set of vertices should be integrated with the connection nodes in the same category as VCD. To facilitate graph transformation, the VCD lists categories of nodes and corresponding labels. When the graph is transformed, the label replaces the removed vertex. If the cluster’s neighbor nodes belong to the same category, the cluster should expand to include those neighbor nodes. Extension is a function that returns a set of clusters in a given provenance graph. For a given cluster of vertices, all vertices in the provenance graph belong to the same category and should be collected as a set of clusters, regardless of whether they are connected to a given cluster.

Transformation mode indicates how to handle vertices that can be accessed or collected by provenance access control policies. Two possible modes are “replace” and “remove”. For Replace mode, when transforming a new provenance graph, replace the vertex cluster with the label pointing to the VCD. Labels are summary terms for vertex categories. For Remove mode, it removes clusters of vertices specified by access control policy, and edges appear outside.

Transformation labels, for the Original dependency, when a cluster of nodes is removed or replaced, the original dependency means that the two edges beside a removed node are merged by referencing to the Edge Merging Table to maintain the original dependency. For the Fault dependency, it means removing the original label and replacing it with ”wasCausedBy” to prevent the label showing clues to remove vertices.

The sample Transformation item defines two provenance partitions to be transformed. The first one is a subgraph that starts at an Artifact vertex o3v1 and ends at Artifact vertex o8v1. As the transformation scope is “original”, the subgraph does not contain any other vertices. Since the transform mode is ”replace”. It should be replaced with a label. In addition, all the connection edges of the node cluster should be changed to ”wasCausedBy”. The second hidden partition is a Process vertex Submit was SubmittedBy. If adjacent nodes belong to the same category as defined by the VCD, they are included in a partition, which is deleted according to the transformation mode. For edges, keep the original edges, or merge them by referring to the Edge Merging Table.

Transformation partition subgraphs () /partition scope original /scope mode replace /mode label false dependency /label partition vertices () /partition scope conjunction /scope mode remove /mode label original dependency /label /Transformation

This paper proposes Algorithm 1 to view transformation of provenance based on PACLP. The input to algorithm 1 is the result of the access control policy and a target provenance graph, and the output is a transformed graph returned to the requester. The algorithm arranges the set of vertices specified by the policy into a cluster. The goal of Algorithm 2 is to retrieve all applicable policies in the system for a request. The details about the two algorithms are shown in appendix.

4.2 Evaluation

To evaluate the performance of the PACLP, the experiment is designed to simulate the policy generation, performed onto a virtual machine with 16GB memory and 3.40 GHz CPU. The 20 sample provenance graphs and 200 policy conditions and tags are generated. We select 300 random provenance partitions from the sample provenance graphs. We count the numbers of provenance partitions under PACLP and LPAC. Obviously, in the storage entity attributes of provenance model, compared with other other policy languages, PACLP can express more random sample provenance partitions. From the figures, we can see PACLP is good at to describe more complicated provenance partitions with more nodes. Then, we implement the process of merging the results of various policies. The three scenarios selected are (1) all 20 sample policies are policies with the effect of Abosulte Permit (2) all 20 sample policies with the effect of Deny, (3) 20 sample policies with all effects are randomly selected. We count the time span to simulate the process.

Figure 6: left: Comparison of express ability of PACLP and LPAC. right: The time span of policy results combination with different number of policies.

5 Conclusion

In this paper, we propose a fine-grained provenance access control language PACLP under extended OPM storing attribute sets to extend the exiting languages. Various types of partitions are defined over regular expressions. Our provenance access control language aims to define which partial graph can be accessed or denied under conditions and restrictions. This fine-grained access control policy model not only hides all sensitive vertices and edges in the provenance graph, but also maximizes accessible qualifying information.


  • [1] Altintas I, Barney O, Jaeger-Frank E. Provenance collection support in the kepler scientific workflow system[C]//International Provenance and Annotation Workshop. Springer, Berlin, Heidelberg, 2006: 118-132.
  • [2] Ma X, Fox P, Tilmes C, et al. Capturing provenance of global change information[J]. Nature Climate Change, 2014, 4(6): 409.
  • [3] Chirigati F, Shasha D, Freire J. Reprozip: Using provenance to support computational reproducibility[C]//Presented as part of the 5th USENIX Workshop on the Theory and Practice of Provenance. 2013.
  • [4]

    Ramchurn S D, Huynh T D, Wu F, et al. A disaster response system based on human-agent collectives[J]. Journal of Artificial Intelligence Research, 2016, 57: 661-708.

  • [5] Moreau L, Clifford B, Freire J, et al. The open provenance model core specification (v1. 1)[J]. Future generation computer systems, 2011, 27(6): 743-756.
  • [6] Ni Q, Xu S, Bertino E, et al. An access control language for a general provenance model[C]//Workshop on Secure Data Management. Springer, Berlin, Heidelberg, 2009: 68-88.
  • [7] Danger R, Joy R C, Darlington J, et al. Access control for OPM provenance graphs[C]//International Provenance and Annotation Workshop. Springer, Berlin, Heidelberg, 2012: 233-235.
  • [8] Mannan M, van Oorschot P C. 3rd USENIX Workshop on Hot Topics in Security (HotSec?08)[J]. 2008.
  • [9] Moreau L, Plale B, Miles S, et al. The open provenance model (v1. 01)[J]. Technical Report 16148, Electronics and Computer Science, 2008.
  • [10] Nguyen D, Park J, Sandhu R. A provenance-based access control model for dynamic separation of duties[C]//2013 Eleventh Annual Conference on Privacy, Security and Trust. IEEE, 2013: 247-256.
  • [11] Danger R, Curcin V, Missier P, et al. Access control and view generation for provenance graphs[J]. Future Generation Computer Systems, 2015, 49: 8-27.
  • [12] Chen L, Edwards P, Nelson J D, et al. An access control model for protecting provenance graphs[C]//2015 13th Annual Conference on Privacy, Security and Trust (PST). IEEE, 2015: 125-132.
  • [13] Crampton J, Morisset C. PTaCL: A language for attribute-based access control in open systems[C]//International Conference on Principles of Security and Trust. Springer, Berlin, Heidelberg, 2012: 390-409.
  • [14] Benjelloun O, Das Sarma A, Halevy A, et al. Databases with uncertainty and lineage[J]. The VLDB Journal?The International Journal on Very Large Data Bases, 2008, 17(2): 243-264.
  • [15] Buneman P, Chapman A, Cheney J. Provenance management in curated databases[C]//Proceedings of the 2006 ACM SIGMOD international conference on Management of data. ACM, 2006: 539-550.
  • [16] Anand M K, Bowers S, McPhillips T, et al. Efficient provenance storage over nested data collections[C]//Proceedings of the 12th International Conference on Extending Database Technology: Advances in Database Technology. ACM, 2009: 958-969.
  • [17] Heinis T, Alonso G. Efficient lineage tracking for scientific workflows[C]//Proceedings of the 2008 ACM SIGMOD international conference on Management of data. ACM, 2008: 1007-1018.
  • [18] Cadenhead T, Khadilkar V, Kantarcioglu M, et al. A language for provenance access control[C]//Proceedings of the first ACM conference on Data and application security and privacy. ACM, 2011: 133-144.
  • [19] Moreau L. The foundations for provenance on the web[J]. Foundations and Trends in Web Science, 2010, 2(2?3): 99-241.
  • [20] Chen L, Edwards P, Nelson J D, et al. An access control model for protecting provenance graphs[C]//2015 13th Annual Conference on Privacy, Security and Trust (PST). IEEE, 2015: 125-132.
  • [21] Papakonstantinou V, Michou M, Fundulaki I, et al. Access control for RDF graphs using abstract models[C]//Proceedings of the 17th ACM symposium on Access Control Models and Technologies. ACM, 2012: 103-112.
  • [22] Kuwabara K, Yasunaga S. Use of metadata for access control and version management in RDF database[C]//International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. Springer, Berlin, Heidelberg, 2011: 326-336.
  • [23] Khaled A, Husain M F, Khan L, et al. A token-based access control system for RDF data in the clouds[C]//2010 IEEE Second International Conference on Cloud Computing Technology and Science. IEEE, 2010: 104-111.
  • [24] Goyal V, Pandey O, Sahai A, et al. Attribute-based encryption for fine-grained access control of encrypted data[C]//Proceedings of the 13th ACM conference on Computer and communications security. Acm, 2006: 89-98.
  • [25] R.S. Sandhu, P.Samarati Access Control: Principles and Practice[C]// IEEE Communications Magazine(Sept.), 1994: 40-48
  • [26] Lawrence Kerr, Jim Alves-Foss Combining Mandatory and Attribute-Based Access Control[C]//49th Hawaii International Conference on System Sciences, HICSS 2016: 2616-2623
  • [27] Patrick D. McDaniel Data Provenance and Security[C]//IEEE Security & Privacy. IEEE, 2011: 83-85.
  • [28] Fahima Amin Bhuyan, Shiyong Lu, Robert G. Reynolds, Ishtiaq Ahmed, Jia Zhang Quality Analysis for Scientific Workflow Provenance Access Control Policies[C]//2018 IEEE International Conference on Services Computing, 2018: 261-264.
  • [29] Lorena González-Manzano, Mark Slaymaker, José María de Fuentes, Dimitris Vayenas SoNeUCONPro: An Access Control Model for Social Networks with Translucent User Provenance[C]// Security and Privacy in Communication Networks - SecureComm 2017 International Workshops, 2017: 234-252