The usage of cloud computing can bring a lot of advantages to its users, e.g. flexible, pay-as-you-go storage solutions. Yet, when storing data in the cloud, sensitive and personal data needs to be tracked, e.g. regarding its replications and geolocations. Since the General Data Protection Regulation (GDPR) became enforceable in the European Union, processing and storing personal data in the cloud has become a liability for service providers in the cloud.
Keeping track of, and enforcing data flows in cloud systems is an essential requirement for practical reasons, as well as for legal and compliance reasons, e.g. when personal data is processed and stored. First, the GDPR establishes various requirements towards the processing of personal data. It requires data controllers to limit the storage of personal data (Art. 5) and to be able to demonstrate how they process personal data. Furthermore, it requires data controllers to provide certain rights to data subjects, such as the right to access, rectify and delete their personal data. Regarding the storage geolocation of personal data, it restricts the transfer of personal data to countries outside of the EU (Art. 44). The EU ePrivacy regulation, the upcoming certifications based on the EU Cybersecurity Act and other regulations establish comparable standards.
Also, practical requirements play a role for data controllers. For instance, tracking data flows can support data loss prevention, and data flow policy enforcement can ensure that personal data and business secrets cannot leave a defined trust boundary, preventing data breaches.
Cloud providers offer various services to monitor resources in the cloud, as well as to manage their life-cycle. For example, Amazon Web Services (AWS) provides the possibility to configure life-cycle rules for data objects in S3 buckets. Microsoft Azure allows to configure data replication for a storage account to another region for failover safety. Yet, these services are mostly designed for data redundancy and configuration monitoring rather than for tracking data objects with the purpose of fulfilling GDPR requirements and other compliance and practical requirements.
In this paper we demonstrate how today’s cloud providers fall short when it comes to tracking data flows and enforcing data flow policies, and propose a label-based tracker system to overcome these shortcomings. Other approaches have proposed, e.g., to tag data in a lower level of the software stack  . In contrast, we propose to label data objects on the cloud’s control plane. We also propose a possible integration of the proposed tracker in cloud infrastructures building on existing permission systems.
The paper’s contribution is threefold:
We present various data tracking mechanisms implemented in AWS, as well as compare them to other cloud providers,
we propose a label-based data tracking mechanism and a data flow-sensitive cloud architecture that can be easily deployed with different cloud providers,
we present a prototypical implementation of this architecture using the Istio service mesh in Kubernetes and evaluate it in different scenarios.
The remainder of the paper is structured as follows. Section 2 describes our approach to solving specific tracking problems in cloud architectures. Hereafter, in Section 3 we generalize from these specific solutions to a general policy enforcement architecture. Section 4 supports this architecture with a prototypical implementation. In Section 5, we conclude our experimental work with some final discussions on solving data tracking problems in cloud environments. In Section 6, we describe some related work and the paper is concluded in Section 7.
2 Approach: Tracking Data in the Cloud
In this section, we present our novel approach for tracking data flows in cloud environments and, in particular, we refer to some AWS examples and also discuss the differences to Azure and Google Cloud Platform (GCP). We also show that comparable mechanisms, offered by cloud providers, are not sufficient to keep track of data flows and that instead, cloud users need to implement custom tracking solutions that are costly and complex.
Consider the following example scenario which we use throughout the paper as a running example. Company CloudFlow offers job applicants an online portal for uploading CVs and other documents. CloudFlow uses AWS to store the files in a S3 bucket. Also, they are automatically replicated to several other buckets for backup purposes. The source files and their replications, however, are only allowed to be stored in the EU. Depending on whether an applicant is rejected or accepted, her CV must be deleted after six months or can be retained. If a CV needs to be deleted, not only the source file but also all replications need to be deleted. Yet, it can be a tedious and error-prone task to search through all existing replications of a single file.
Hence, a tracking system is needed which, however, is not offered by available cloud monitoring systems. From a high level perspective, a tracking system should be able to retrieve information about data objects from a cloud service API, examine object properties (i.e metadata or custom-defined labels) and validate them against an expected policy.
Figure 1 illustrates an overview of the proposed tracking system of which the main components are the Tracker and the Tracking Policies.
A tracking policy defines labels to be associated to objects and corresponding life-cycle rules. The policies are stored as a separate file. Life-cycle rules can define a retention period for a label or other metadata, as well as a deletion time. For instance, a rule may state that an object that has been labeled as [invoice] must be deleted after one year. At the same time, it may state that data objects carrying this label must be retained for at least six months.
The Tracker first discovers existing data objects, i.e. existing storage entities and data objects are retrieved using the Cloud APIs and their titles are parsed for attached labels. Later, the labels of the discovered items are checked against the policy rules. The Tracker verifies that a data object still complies with its life-cycle policy or that the object violates it.
The terms metadata and label refer to data that is associated with a data object or other resources. In the remainder of this paper we will use metadata when we refer to already available information, as offered by AWS or other cloud providers, and we will refer to labels to refer to custom-defined metadata which is necessary to enable some of our tracking mechanisms. The motivation for having custom-defined labels also stems from the fact that the number of tags that can be assigned to resources natively in AWS and Azure is limited.
For example, Amazon S3 buckets metadata reveals information about the bucket’s logging and permission configurations, the last modification date and whether it is a replication object or not, indicated by the REPLICA flag. AWS allows to replicate data objects or whole S3 buckets, e.g. for data recovery purposes. When a replication for a bucket is specified, every data object that is uploaded to the bucket is replicated to the specified replication bucket. Yet, users are not notified when a replication bucket is deleted. Also, this mechanism does not delete a replication bucket when its source bucket, i.e. the bucket holding the original data objects, is deleted. This leaves the possibility that a replication bucket containing sensitive data is forgotten.
The reason that the deletion of replication buckets is not enforced in S3 is that data objects in the replication bucket may stem from more than one source bucket. Still, such orphan data objects are not tracked reliably which is why we developed a tracking mechanism which can identify such replication data objects.
In the reported scenario, desirable tracking requirements may be the following:
Replicated data objects shall not exist without a source object.
Replicated data objects shall not reside in a different geographic area than their source objects.
It shall be possible to validate a data object against a custom life cycle rule, for example to check if the object should be deleted.
In the next subsections, we describe in detail how we designed different tracking functionalities which aim at fulfilling these requirements.
2.1 Bucket and object replication tracking
This tracking functionality aims at preventing orphan replication objects. We use the terms replication object and replication bucket to describe replicated data objects or replicated buckets respectively. Replications in S3 are created according to replication rules which specify which data objects shall be replicated to which replication bucket. As stated above, source buckets can be deleted without impacting their replications but still, deletion periods rules may apply to these replications, e.g., for compliance reasons. This check tracks replication objects in two ways: First, it verifies that for every replication bucket that is found, a source bucket still exists; and second, it verifies that for every replication object, its source object still exist.
To do this, it first retrieves a list of all S3 buckets and iterates through their data objects. Then, it checks whether a data object has the replication flag set to REPLICA in its metadata. If this is the case, two separate checks are performed. First, it is verified that a bucket exists with a replication rule specified which points at the bucket the replication object is contained in. While this does not guarantee that every data object’s source object still exists—but only if there is any source bucket that is replicated to this replication bucket—it can still identify orphan replication buckets. Second, it is verified that in one of the buckets whose data objects are replicated to the replication bucket, an object with the same object key, i.e. the object’s path and name uniquely identifying the object in the bucket, as the replication object exists.
A replication bucket can only be identified by iterating through its data objects and checking if there is at least one object that has the REPLICA state set in its metadata attributes, or by identifying a replication rule that specifies it as a replication bucket. Replication rules, however, are deleted as well when a source bucket is deleted which is why it is necessary to check for replications on the object-level, too.
2.2 Bucket replication geolocation tracking
This check verifies that every replication object resides in the same region group as its source object and aims at preventing the replication to disallowed geolocations.
We define a region group as a set of regions that AWS denotes with the same two starting letters, e.g., EU for Europe and US for the United States. Although the goal is to verify this for every data object, we verify the objects’ geolocation through the bucket they are saved in, since the information about objects’ geolocation cannot be extracted from objects but only from buckets (contrary to the replication state that is saved as metadata).
The check first retrieves a list of S3 buckets and checks for each bucket whether replication rules are specified in the tracker policies. If this is the case for a bucket, meaning that some or all of its contents are replicated to a different bucket, the specified replication buckets region group is compared to the source buckets region group.
Notice that AWS S3 also offers a feature called Same-Region Replication that ensures that replications can only be created in the same region as their source objects. However, for failover reasons, it may still be important for cloud users to create replications that do not reside in the same region as the source objects, but still reside in the same region group. For GDPR compliance, for example, it may be relevant to keep the data in the EU group but still it may be reasonable to distribute replications across different regions within the EU.
2.3 Tracking data objects’ validity with custom labels
This check allows to validate a data object against a custom tracking policy. The validation flow is similar to the previous examples, but, in order to manage a large number of data objects and specify different custom defined rules, it is helpful to use custom-defined labels to express the semantics of different data attributes. This way, semantics are made visible on the cloud control plane where tracking or permission policies may be applied.
In the following, we first describe this mechanism and then we present a semi-automated way to manage and associate labels.
In particular, it works in three steps as described in the following. First, a life-cycle policy is defined which defines labels and corresponding life-cycle rules. Life-cycle rules can define a retention period for a label, as well as a deletion time. For instance, it may indicate that an object that has been associated with the label [CV] must be deleted after one year or that data objects carrying this label must be retained for at least six months.
Then, the existing data objects are discovered, i.e. existing storage entities and data objects are retrieved using the AWS APIs and their titles are parsed for attached labels.
In the last step, the labels of the discovered items are checked against the life-cycle configuration their labels correspond to in the custom policy. The result verifies that a data object still complies with its life-cycle policy or that the object violates it.
In the case that the data object which is checked is a replication object, the same configuration as for its source object is applied.
Again, considering the running example, assuming that legal regulations demand to delete CVs of rejected applicants within six months, the respective S3 buckets have to be examined regularly to ensure that CVs and their replications are deleted within their expiration time. Associating a common label to objects that indicates whether a file is a CV or a different file, makes it possible to build such automatic mechanisms but it could be tedious and error prone. Since there may be a very large number of objects to apply new labels to, we also propose a semi-automated mechanism which adds labels to objects based on keywords that are found in the object’s title. For example, the labelling mechanisms can be configured to add the label [personal] to all data objects that contain the keywords CV or invoice in their titles. The mechanism and the semantics of custom labels are important to understand the proposed tracking mechanism, and are explained in the next subsection.
2.3.1 A semi-automated labeling mechanism
The labeling mechanism first lists all data objects using the respective APIs, and then iterates through data objects’ titles, scans them for pre-defined keywords and attaches them in form of a label to the title. In case no keyword is found, the user is prompted to assign a label manually.
This mechanism is implemented using the XText framework111https://www.eclipse.org/Xtext/ which supports the development of custom programming languages. Using XText, we defined a schema for the assignment of labels to data objects. When applying this schema, custom labels are added automatically to data objects’ titles to indicate, e.g., their level of criticality or person-relation.
Listing 1 shows an example schema that defines how labels can be assigned to data objects. It is based on the running example where the company CloudFlow handles three types of data: job applications, customer data and public data. The schema first defines the values (AssetValues) that can be assigned to a data object, e.g. low or personal to indicate the level of criticality or person-relation. Furthermore, types (AssetTypes) are defined that data objects may pertain to, i.e. the already mentioned types of data Application, Customer and Public. Next, keywords (Keywords) are listed that may be contained in the data objects’ titles. Hereafter, the schema defines rules which indicate which labels are associated to a file in case a specific keyword is found in its title. More specifically, the rules define which values (ValueIfKeyWord) and types (TypeIfKeyword) are assigned to a data object. In the example schema, when the keyword order is found in the title, the labels high and Customer are associated to the file. Note that all values, types and rules are completely custom-defined, different categories of labels can be easily defined and attached by simply configure the schema. In this example, the label personal is assigned to any data object that contains the keyword CV in its title.
The labels assigned to data objects by such a mechanism can then be leveraged by data tracking mechanisms, e.g. to track the location of data objects more easily.
2.4 Discussion on available cloud providers offerings
Since the previous examples were written for AWS, in this section we discuss Microsoft and Google solutions for tracking objects and how the proposed approach could be applied in Microsoft Azure and Google Cloud Providers (GCP), too.
2.4.1 Storage and data object replication tracking
For Azure storage accounts, replication is implemented per default. Users can only select the corresponding strategy, e.g., locally redundant storage which is stored in the same data center as the source storage. Hence, the redundancy and its deletion is managed by Azure. Azure offers a further feature that allows to choose a specific region to which the redundant storage shall be replicated, called geo-replication222At the time of the writing, this feature is still in preview..
Also GCP’s Blob storage offers similar features, like regional or multi-region redundancy. Replications of storage accounts and their locations are therefore transparent to the user. If, however, more control over data replications is required, e.g. to choose the replication region, the replication and tracking mechanisms would have to be implemented manually with custom labels and external tracking systems.
2.4.2 Tracking custom retention periods
Azure allows to configure life-cycle rules for blob containers, e.g. with the effect of moving data to an archive storage or deleting the blob333Note that these life-cycle rules are different from the ones we use in the policies for our tracker system where they define retention periods. Each option can be configured to be performed after a certain number of days after the last modification and can be applied to the whole container or certain folders inside the container. Similar functionality is offered by GCP where life-cycle rules can be configured to delete objects or change a bucket’s storage tier.
This functionality mainly aims at reducing costs for the service users, allowing to move data to a cheaper storage or to delete it when it is not needed anymore. As in AWS S3, implementing custom retention periods is not possible. Only GCP offers the possibility to specify retention periods, e.g. object holds that prevent an object’s deletion for a certain time.
In conclusion, our experiments show that AWS and Azure as well as GCP provide mechanisms for data replication for their storage services—and partly even for retention policies—but fall short when one wants to track data objects throughout their life-cycle. In AWS S3, especially monitoring custom data retention periods is complex since data objects only provide a last modified attribute, but not a creation date, such that buckets have to be monitored continuously for new data uploads to track their creation time. Also in Azure Blob storages, there is no efficient way of tracking replications if functionality is required that goes beyond what Azure provides by default, like distributing replications to different regions.
While services like AWS Config and Azure Policy can be used as monitoring tools (see section 6.2), they are designed for resource configuration monitoring rather than for data flow tracking. Implementing data tracking mechanisms nowadays is not straightforward as it requires complex customized architectures which can come at a considerable cost. In the next section, we show how the proposed tracking system can be generalized and deployed in different scenarios, including an integration into existing cloud infrastructures.
3 High-Level Data-Flow Tracking Architecture
In this section, we present a data-flow sensitive architecture for cloud systems which generalizes the label-based tracking mechanisms we proposed in the last section, together with different deployment models.
3.1 Generalized architecture
As outlined in the previous section, the tracker system offers different functionalities and components: A discovery functionality that retrieves a list of object metadata, a validation functionality that compares an object’s configuration to an expected configuration, and a policy which determines the tracking rules to be enforced.
In order to enforce tracking policies, we propose a generalized architecture for these functionalities which is inspired by modern policy enforcement architecture, as the OASIS standard XACML  for attribute-based authorization decisions, which is made of an enforcement point and a decision point.
In Figure 2, the proposed architecture is illustrated. The Tracker System is triggered by a data flow that may originate from a user of the cloud system (1). The Tracker API then makes a request to the Validator (2). The Validator in turn retrieves applicable policies (3) which specify tracking life-cycle rules. The Validator can request the object’s attribute values (metadata and labels) from the Discovery component (4) which retrieves those values from the respective cloud APIs (5).
Mapping this to the XACML framework, we find that the Tracker API corresponds to a Policy Enforcement Point (PEP), the Validator corresponds to a Policy Decision Point (PDP) and the Discovery corresponds to the Policy Information Point (PDP) as also indicated in Figure 2.
Generalizing the tracker architecture in this way allows to extend it in a modular fashion. For example, policies can be added without impacting other components and enforcement points may be triggered by different proxies and gateways provided by current technologies in their cloud control panels.
This policy enforcement approach allows to manage flows of data objects with similar properties on the control plane, rather than tracking and enforcing data flows on a lower layer of the software stack or using other mechanisms. A service provider could, for instance, decide to manage data flows by isolating resources using firewalls and resource groups. Yet, when changes need to be made, it becomes error-prone to change all corresponding firewall rules and group assignments.
Having the cloud provider implementing an ABAC system and exposing labels on the control plane, gives the cloud users more flexibility in managing access control and simplifies the protection of data flows. Furthermore, managing data flows based on labels complements role-based access control (RBAC) since this approach allows to separate duties of developers and roles responsible for compliance who do not need to read and understand code but only need to assign labels to resources and define rules for their circulation.
Note that for our proposed tracking system and architecture we do not impose a syntactic structure on the labels since cloud users may need to fit the labels’ syntax to their specific requirements.
3.2 Deployment models
Given the generalized architecture, various deployment models are feasible considering different use cases for cloud users and cloud providers. A cloud user could set-up her own local deployment, not integrated in the cloud system, which queries external cloud APIs, similar to the high level setup described in Section 2.
Considering the cloud provider, we identified two possible scenarios to deploy the proposed architecture. First, a cloud provider may offer the tracking system as-a-service to complement already available services in Azure and AWS (as discussed in Section 6.2).
Second, this architecture can be integrated into the existing infrastructure that cloud providers already implement, mainly as Gateway APIs that can be used for security enforcement of, e.g., identity and access management (IAM) policies or RBAC authorization mechanisms.
In Azure, for example, this authorization architecture is implemented with the Azure Resource Manager (ARM). The ARM acts as a gateway API that receives requests, e.g. for creating a resource, from users and authenticates and authorizes them across services and regions.
Figure 3 illustrates this high-level architecture. It shows how clients using different kinds of mechanisms to access the cloud system, such as a dashboard interface, a CLI or a SDK, make requests to the central API gateway. This gateway performs authentication and authorization using the role-based access control (RBAC) and it is extensible. Also, the label-based tracking mechanism may be enforced here. This existing infrastructure authorizing user attributes has therefore been extended to track data attributes, defined in labels, as we will show in the next section.
4 Prototype Implementation
In this section, we describe our prototype implementation of the generalized tracking architecture. We first briefly describe the implementation of the tracker system as a local client deployed by a cloud user, and its deployment in the cloud. We then describe and evaluate an implementation of the integrated deployment model as described in Section 3.2 which can not only track data flows but also prevent non-compliant data flows.
For our prototype implementation we chose the integrated deployment model and set up a Kubernetes444https://kubernetes.io cluster on virtual machines hosted in Microsoft Azure to imitate a cloud environment with multiple services which can communicate with each other, i.e. a service mesh. We base our implementation on existing mechanisms to track and enforce policies in services meshes, such as Istio555https://istio.io. Istio introduces several components into a cluster including one for policy controls which is called Mixer. As depicted in Figure 4, the Mixer authorizes every message flow and can be used to deduce health metrics, track data flows and enforce data flow policies. However, out-of-the-box, Istio is only able to make authorization decisions based on the metadata of the request, e.g., based on the HTTP headers the message contains or based on the attributes of source and destination on a service-level. Since our approach envisions a data flow tracking and policy enforcement based on the data’s semantics (represented by labels), we implemented an extension to the Mixer, which is called an Adapter.
While Istio itself claims to be platform-independent and several components, such as the service proxy Envoy666https://www.envoyproxy.io can also be used standalone, there are some benefits to the deployment into Kubernetes. For once, it allows the automatic injection of so-called sidecars into Kubernetes Pods which act as proxies for the pods. They manage in- and outbound traffic and can relay network connections to Mixer before they are forwarded to its intended destination. As such, they can be seen as a PEP.
Furthermore, the Mixer forwards traffic coming from the sidecar proxies according to its configuration to Adapters which process the messages. Figure 4 illustrates this showing a simplified architecture underlying the communication between services, their respective sidecar proxies and the central Mixer component. It also indicates the mapping of the Istio components to the tracker system described in Section 3: The service sidecars act as the Tracker API since they enforce an authorization request to the Mixer. Furthermore, the Mixer and Adapter correspond to the Validator since they make the decision about authorizing or rejecting a request. In our implementation, the Adapter additionally acts as the Discovery component since it also retrieves a requested file’s labels and makes the authorization decision on this basis.
4.1 Label-Based Authorization Adapter
Using the architecture as described above, we designed a label-based tracking and policy enforcement which can not only track data objects using a monitoring approach, but also track and authorize requests as they occur. It is implemented as an out-of-process Istio Adapter written in Go supporting the Istio authorization template. The Adapter can be configured with a list of workloads, or services, which are allowed to access data resources with a particular label.
Figure 4 shows an example authorization flow through this architecture: When the client service initiates a request to retrieve a file from the object storage service (OSS), it is relayed by the client’s proxy to the OSS proxy (1). The OSS proxy makes an authorization request to Mixer (2) which in turn forwards the request to the Adapter. The Adapter parses the file name that is requested and inspects its labels by querying the introspection endpoint of the OSS (3). The introspection endpoint follows a simple URL pattern of /inspect/<resource> and it returns the labels found at the <resource> using the HTTP header X-Data-Labels
(4). The Adapter then compares the labels with its policy configuration which specifies the allowed workloads for certain labels. If the incoming workload is not among the list of allowed ones, the request is denied. If a resource does not contain any label, i.e. it is not classified, the request is automatically accepted. The decision is reported back to the proxy (5), and in case the request has been authorized, the proxy relays the file request to the OSS Files API and returns the response to the Client Service via its proxy (6).
4.2 Exemplary Service Mesh
Our implementation uses again the running example of the company CloudFlow described in Section 2. Part of the exemplary service mesh of CloudFlow are services for the job application portal (which handles CVs), other parts handle financial data such as invoices and public material about the company for public use. The mesh is configured to track and enforce the following data labels:
personal—identifying personal data, stored in CVs.
public—identifying non-personal information, such as a brochure about the service offering.
invoice—indicating financial documents.
We implemented the object storage service as a REST-based service which serves as a storage backend to store CVs and other data the company might hold. Next to the regular HTTP endpoints to retrieve files, the storage service also implements an introspection endpoint returning the labels of a specified PDF file to the Adapter. Access to the introspection endpoint is restricted to the IP address of the pod running our custom Adapter. Otherwise the labels and the existence of a file might leak to other (unauthorized) services in the cluster.
4.3 Performance Evaluation
We have compared three different scenarios to evaluate the proposed system’s performance, where one client service (CS) requests files from an object storage service (OSS) as described above. The latencies plot against the number of concurrent client connections are illustrated in Figure 5.
In the first scenario, the two services in the Kubernetes cluster directly communicate with each other without relaying connections through proxies and enforcement of policies (no_proxies_no_mixer). In the second scenario, proxies are injected into the services’ Pods and the request is relayed through the proxies but are not subjected to a policy enforcement by Mixer (proxies_no_mixer). The third scenario is the one described above where the CS requests a file from the OSS via its proxy which makes an authorization request to the Mixer. The Adapter then retrieves the necessary labels from the introspection endpoint of the OSS and makes its authorization decision (proxies_mixer). However, both the Envoy proxy as well as the Mixer include caches from which policy decisions can be made. This is why the latencies for the third scenario only differ slightly from the ones in the second scenario.
To evaluate the essential overhead which is incurred by our solution, we have measured the latency that one policy check—including the call of the custom adapter to the OSS’s introspection endpoint—incurs. To that end, we have set up a local installation of the Mixer server, the custom adapter, the OSS and a Mixer client. Here, one authorization request from the Mixer client was answered on average with a latency of 17.31 ms (with all caches disabled).
There can be a number of Mixer Pods in the cluster to make the policy enforcement scalable. Proxies can be granularly assigned to these Mixer instances. This scenario is also transferable to existing cloud infrastructures where the policy enforcement load may be distributed across many policy enforcement instances in different datacenters. For further performance statistics of Istio, the interested reader is referred to the Istio benchmark tests777https://istio.io/docs/ops/deployment/performance-and-scalability.
5 Final remarks
The architecture proposed in this section represents an extension to existing cloud authorization mechanisms. Such a system does not require extensive code changes or injection of monitoring monitors into virtual machines, but only acts upon custom labels defined in resources’ metadata.
Replication checks, as presented earlier, can be managed using this service by assigning the necessary labels to data objects. When they are copied, e.g. by a backup system, their labels are copied to their replications and—contrary to existing cloud providers’ storage solutions—can automatically be managed in the tracking and enforcement mechanisms. In section 2, we have also presented a mechanism for the tracking of data retention policies. Note that with the architecture proposed here, we do not aim at providing a solution for this problem. While monitoring services like Azure Policy (see Section 6.2) do not yet provide such a functionality, we would argue that data retention policies should be part of such monitoring services. These services monitor the state of resources and apply actions to them according to a policy, e.g. comparing a resource’s age with a retention policy. Hence, they are better suited to provide a data retention enforcement than a service focused on data flows.
While the compliance of data geolocations could be also monitored using monitoring services as described above, this mechanism should be implemented using a data flow policy enforcement since here, a data flow is subject to authorization. One reason is that an undesired geolocation may already present a breach of law and should be prevented rather than discovered. In our architecture, such a mechanism may be implemented by creating a corresponding label that indicates the allowed region and defining a policy on the flow of this label.
While our architecture is designed to manage data in bulk, it is possible to assign labels to data that indicate a granular relation to individual persons or devices, such that their flow across services can be logged, reconstructed and possibly subjected to data flow policies as well.
Furthermore, the proposed architecture supports the prevention of data breaches and enhances the auditability of cloud systems, since it exposes the semantics of data flows.
To manage who can attach, change, and delete labels, the existing IAM can also be extended to provide respective roles and permissions to be assigned to users. This allows not only to define which users may manage labels, but also to give resources, like virtual machines, the permission to change labels if they change data objects, e.g. by applying anonymization functions.
6 Related Work
6.1 Information Flow Control
Our work is related to the topic of information flow control (IFC). In comparison to user-based access control, IFC refers to models that describe an information-centric access control, defining rules for transmissions of data and information.
Most of them also rely on labeling data in some way. For example, Pappas et al.  propose a framework which relies on injecting Virtual Machine Monitors on VMs’ processes to tag data and track it across processes. Other works propose code annotations  or customized OSs to tag and track data . Pasquier et al.  propose an information flow control system that enforces policies using Linux Security Modules. In , Suen et al. propose a data tracking mechanism that tracks data at the kernel level in both host machines and virtual machines to build an end-to-end provenance control system for the cloud. All in all, these approaches propose tracking mechanisms for lower levels of the software stack rather than building on existing high-level authorization mechanisms in cloud infrastructures.
A monitoring system directed at Platform-as-a-Service (PaaS) cloud systems called CloudSafetyNet is proposed in . They propose to include so called security tags in HTTP headers. Yet, the tags identify an application rather than expose the data’s semantics to allow to manage it on the control plane.
Another work by Pasquier et al.  proposes a cloud architecture in which a PaaS service provides audit results regarding the compliance with security policies to applications’ end-users. Yet, their approach does neither target specific limitations current cloud providers have when tracking data objects, nor does it leverage the existing cloud authorization infrastructure.
6.2 Cloud Monitoring Services
Both AWS and Azure offer services for monitoring and modifying the state of existing resources. AWS Config888https://aws.amazon.com/config/ provides pre-defined rules, e.g. for auditing open SSH ports in security groups, and allows to configure custom rules using the Lambda service. Here, completely customized scripts can be written in any language that Lambda supports. Hence, our experiments can also be implemented as custom AWS Config rules using standard AWS APIs. Still, Config rules can incur a significant cost if applied to a large number of resources.
Azure Policy999https://docs.microsoft.com/en-us/azure/governance/policy/overview offers the possibility to define policies using a specific format that includes a condition and a corresponding effect that shall be executed if the condition matches. The types of conditions and effects that can be specified, however, are limited and not customizable. At the time of the writing, it is not possible to implement a policy in the Azure Policy service that audits properties of blob objects, e.g. their creation date. Yet, if the corresponding conditions would be implemented, tracking replications, geolocations as well as custom retention periods would be implementable using such policies.
7 Conclusion and Future Work
In this paper, we have presented several data tracking and validity mechanisms for data objects in cloud systems, e.g. replication-tracking of AWS S3 data objects. We have shown that today’s cloud providers do not offer sufficient functionality for tracking data flows and to enforce respective policies. On this basis, we have developed a model for a data-flow sensitive cloud architecture which supports some legal data protection requirements and other practical requirements for data tracking.
Future work includes the generalization of the tracker’s rules to a reusable language that can express tracking rules. Regarding the prototype implementation, we plan to improve the security of the label-based policy enforcement by making labels tamper-proof to protect against a malicious use of data by changing their labels.
-  Soren Preibusch. Information flow control for static enforcement of user-defined privacy policies. In IEEE International Symposium on Policies for Distributed Systems and Networks, pages 133–136. IEEE, 2011.
-  Nickolai Zeldovich, Silas Boyd-Wickizer, Eddie Kohler, and David Mazières. Making information flow explicit in histar. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 263–278. USENIX Association, 2006.
-  OASIS Standard. extensible access control markup language (xacml) version 3.0, 2005.
-  Vasilis Pappas, Vasileios P Kemerlis, Angeliki Zavou, Michalis Polychronakis, and Angelos D Keromytis. Cloudfence: Data flow tracking as a cloud service. In International Workshop on Recent Advances in Intrusion Detection, pages 411–431. Springer, 2013.
-  Thomas Pasquier, Jean Bacon, Jatinder Singh, and David Eyers. Data-centric access control for cloud computing. In Proceedings of the 21st ACM on Symposium on Access Control Models and Technologies, pages 81–88. ACM, 2016.
-  Chun Hui Suen, Ryan KL Ko, Yu Shyang Tan, Peter Jagadpramana, and Bu Sung Lee. S2logger: End-to-end data tracking mechanism for cloud data provenance. In 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications, pages 594–602. IEEE, 2013.
-  Christian Priebe, Divya Muthukumaran, Dan O’Keeffe, David Eyers, Brian Shand, Ruediger Kapitza, and Peter Pietzuch. Cloudsafetynet: Detecting data leakage between cloud tenants. In Proceedings of the 6th edition of the ACM Workshop on Cloud Computing Security, pages 117–128. ACM, 2014.
-  Thomas FJ-M Pasquier and Julia E Powles. Expressing and enforcing location requirements in the cloud using information flow control. In 2015 IEEE International Conference on Cloud Engineering, pages 410–415. IEEE, 2015.