Facilities like the Large Hadron Collider (LHC) lhc at CERN, Geneva, with the attached experiments such as Alice alice, ATLAS atlas, CMS cms, and LHCb lhcb are giving rise to very large amounts of experimental data that is now already close to an exabyte and will continue to grow substantially in the next two decades. Thousands of scientists around the globe are analyzing these data in the pursuit of finding evidence for new physics phenomena emerging that are not predicted by the established theories. Often, scientific results are produced just in time for the next conference. In such a fast-paced environment at the cutting edge of research, one of the key challenges the collaborations are confronted with is the efficient and reliable management of their data that are being taken and analyzed by a large number of collaborators. This is especially important given the fact that the experimental data are the core asset at the center of multi-billion dollar projects like the LHC.
The moment we accumulate data of a large volume, the question of how to do data management arises. Even with this problem being a very old and well-studied one, no single solution or implementation has emerged. The reason is that data management has to address the specific set of requirements of the given environment. Those factors have a strong influence on the design of such solutions and their implementation.
In the case of the LHC experiments, one of the defining constraints is the distribution and types of the available data storage. The LHC experiments have a tiered computing approach, in which in the order of 100 geographically separated sites provide data storage. These sites are heterogeneous in terms of capacity, mass storage technology, network interconnectivity, level of support, etc. Large Tier-1 sites (seven for CMS) provide archival tape systems, on which data are permanently stored but not immediately accessible (cold storage). Disk pools in smaller Tier-2 sites and also in Tier-1 sites allow immediate read and write access, but are limited in capacity. Thus, nontrivial decisions have to be taken on which pieces of data to keep on disk, in how many copies, and where.
Another important factor in designing a data management product is how the data are actually consumed. Data usage in the experiments can be categorized into two big classes: production access, made by data processing tasks planned by the experimental collaboration to produce collaboration-wide common datasets111Data in CMS are mainly organized in datasets, which are collections of files sharing semantic properties (See Section 2.2)., and user analysis access, made by individual analysts. The biggest difference between the two classes is that production access is predictable while user analysis is inherently unpredictable. As an example, the re-processing of the data with updated calibrations is carefully planned and the necessary inputs can be staged from tape without causing any delay to the reprocessing schedule. On the other hand, a user might one day decide to analyze a dataset that has not been accessed for multiple years, or hundreds of users might want to read the same dataset at the same time. To avoid bottlenecks in the analysis tasks, data management must provide some slack in the form of distributed copies of datasets, and possess certain intelligence to keep that slack under control.
The initial approach in CMS towards data management was to ensure that datasets can be efficiently and safely transferred from one storage site to another, with a rich set of permissions to identify who is allowed to perform certain actions on the data. Sites were put in charge to install local software agents to execute transfers and communicate with the central agents about their progress. The intelligence about which data was supposed to be available at the sites was supposed to be provided by data managers, who were individuals appointed by the subgroups of the collaboration. Each subgroup was assigned three to five specific Tier-2 sites to be filled with the datasets of their interests, claiming ownership of these datasets. Some coordination was required to decide who was in charge of the large datasets, for example the datasets containing at least one muon as determined by the CMS trigger system, because they are used by almost all physics analysis groups. Coordination of data ownership was quite time-consuming in certain cases.
For the first few years this concept worked, because there was enough disk space, a lot of interest and support from the sites and the data managers, and there were relatively few datasets. Over time, sites and data managers had less resources, and with the rapidly growing amount of data and number of datasets, the system became virtually unmanageable. Another important development was that the strict rule of re-processing the detector and Monte Carlo simulation data only at the Tier-1 sites placed a major bottleneck on the production process. Moving the re-processing also to the Tier-2 sites meant a substantial increase in data transfers which became impossible to support with the team available for computing operations. In short, there was a large need for automation and intelligence, which was particularly evident in the computing operations organization at the time.
Studying this situation, there were a number of key conclusions reached.
Users should not have to care where their analysis jobs run as long as they finish successfully and quickly,
subgroups did not want to and could not manage their own data,
sites did not want to manage the exact data content of their storage, and
data production systems needed an automatic way to spread the data across all production sites with the least amount of effort.
To address these points, we introduced an automated data management system, which we dubbed Dynamo. Dynamo was developed with the goal to eliminate or at least minimize human interactions with the data management system, and at the same time to optimize the way the storage is used to hold the data for the user analysis and for the data production system. In addition, a number of important simplifications and features were introduced to the data management model. To name a few:
Sites were opened to any datasets that users or production were interested in.
Data ownership by subgroups was deprecated, and was replaced with that by two main groups: Analysis and Production.
Policies were introduced to fill the disk space automatically with popular data replicas, while removing less popular data replicas.
A fully automatized site consistency enforcement was introduced to address any failures in the data management system.
A fully automatic site evacuation was introduced to quickly and efficiently deal with major site failures.
An interface to the batch submission system was provided to automatically download data that are only available on tape to disk, when required by the users.
Dynamo is a software package which enables intelligent and flexible data management by incorporating a rich representation of the global storage system and its contents. The package can be used as a high-level intelligence layer of a data management software stack, only making the data placement decisions but not performing the actual file transfers and deletions, or as a full-stack standalone data management product. The data placement policies in Dynamo are expressed in a human-readable syntax and can be easily defined at runtime, increasing the transparency of data management to the collaborators while minimizing the necessary human intervention to the system.
In this article, we document the design and some implementation details of Dynamo. We then describe various Dynamo applications, which are the key components of the system that implement actual data management operations. Finally, we introduce real-world use cases of Dynamo in the CMS collaboration and at a single local analysis facility.
2 Overview of the system
2.1 System design
Dynamo is written in Python python with a modular architecture. The central component depends only on the Python standard library, to decouple the system core from specific storage, transfer, and metadata-managing technologies. Interface to various external services and to internal bookkeeping persistence are provided as plugins. A minimum set of plugins required to perform the standard tasks in a small-scale standalone environment are packaged together with the core software.
The core of the system is the Dynamo server process, which possesses the full image of the global storage system under its management, called the inventory. In the inventory, sites and data units (blocks and datasets, described in Section 2.2) are all represented as interlinked in-memory objects. By keeping the full image in RAM, the inventory allows fast execution of flexible and complex data placement algorithms. Persistence is still required for the inventory, but can be provided in any form with an appropriate plugin; one can even choose to save the inventory image as an ASCII text file if desired. In practice, a relational database is used as the persistence layer of the inventory, due to its natural support of frequent data insertion and update operations, which are required for taking real-time backup of the inventory. However, it should be stressed that Dynamo is not a database-centric application, which is its main distinction to be drawn with respect to existing data management solutions.
Individual data management tasks, such as identifying data units to be copied or deleted and initiating the actual file operations, are carried out by Dynamo applications, which are Python programs run as child processes of the server process. As child processes, applications inherit the inventory image from the server, accessible as a Python object in the program. There is no restriction on what an application can or must do, including whether or not to access the inventory or perform any data management task. At its core, from a technical perspective, Dynamo is an engine to execute an arbitrary Python program with one specific large structure (inventory) pre-loaded in memory.
Because an application is a child process of the server, any modifications it makes to the inventory within its address space are discarded automatically at the end of its execution, and are not visible from the server or the other applications that may be running concurrently. However, pre-authorized applications can communicate the changes they make to the inventory back to the server before process termination. Such applications are said to be write-enabled.
Routinely performed tasks can be executed as an application sequence by the built-in application scheduler. Multiple concurrent application sequences can be registered to the scheduler.
The Dynamo server can also run a web server as a child process. The web server communicates with an external HTTP(S) server and exports a web page (HTML document) or provides a REST interface, depending on the requested URL. The actual contents delivered through the web server is created by web server modules, which are easily expandable according to the needs.
Finally, Dynamo server processes can operate synchronously on multiple machines communicating with each other. With the use of a load-balancing mechanism such as keepalived keepalived, linked parallel Dynamo instances can share tasks of running the applications and responding to HTTP requests. Furthermore, since each instance in this setup is a fully functional Dynamo server, failure of any of the nodes does not stop the service.
A schematic of the Dynamo system is shown in Figure 1.
In Dynamo, data are managed in a three-tiered hierarchy. At the bottom of the hierarchy is the file, which naturally maps to a POSIX file but can also represent other types of data units. A file in Dynamo is the atomic unit of data transfer and deletion. The system has knowledge only of whether a file exists fully in a given storage unit or not; there is no concept of any intermediate states such as partially copied files.
Files that share semantic properties are grouped into a dataset, which is the highest level of the hierarchy. For example, detector data from a continuous period in a year and Monte Carlo simulation sample for the same physics process are each organized as a dataset.
Because datasets can greatly vary in size, from a few gigabytes to a few hundred terabytes in the case of the CMS experiment, the intermediate grouping of blocks is introduced to facilitate various data management tasks. Blocks are non-overlapping subdivisions of datasets, consisting of one or more files. There is no guideline for how blocks should be formed, but the intention is that they are purely logistical units that are semantically indistinguishable within a dataset. A block is the algorithmic atomic unit of data in Dynamo. In other words, decisions to replicate, move, and delete data are taken on the level of either datasets or blocks, but not files. Blocks offer a balance between fine-grain control of data placement and management efficiency. As such, the typical size of blocks is left to be decided for each use case. In the CMS experiment, a typical block of a large dataset contains 5 to 10 files, adding up to 10 to 20 gigabytes in volume.
Computing clusters and other storage elements across the globe are represented as sites in Dynamo. Sites are only defined by their network endpoints for data transfer and deletion. Attributes such as the external network bandwidth, total storage capacity, and the number of associated compute cores that may utilize the data in the storage can be optionally assigned to sites.
A copy of a dataset or a block at a site is called a dataset or block replica. Following the hierarchy between blocks and datasets, a dataset replica at a site consists of replicas of the blocks of the dataset at the site. A block replica is considered complete if copies of all constituent files are at the site, and incomplete otherwise. Similarly, a dataset replica is incomplete if any of the constituent block replicas is incomplete. A dataset replica with no incomplete block replica is complete if all blocks of the dataset have a copy at the site, and partial if replicas of only a subset of the blocks exist.
A partition of the entire global storage system is a group of block replicas defined by a set of rules. For example, a partition can be defined by replicas of blocks belonging to datasets with specific name patterns. Partitions do not have to be mutually exclusive. Sites may set quotas for different partitions at their storage elements. Quotas are however not enforced by the Dynamo system core, and it is up to the individual applications to decide to respect them.
Dynamo has a simple language set that consists of short human-readable predicates regarding datasets, blocks, and their replicas. The predicates may refer directly to attributes of the objects such as their last update timestamps, or can involve dynamically computed quantities such as the total number of replicas that currently exist in the overall system. The language set is called the policy language because its primary use is in setting data replication and deletion policies for the applications, but is available for any other part of the program. In fact, partitions are defined by predicates on block replicas using the policy language.
One of the attributes of a block replica is its owning group. Ownership is an easy way to flag the use purpose of a data element. For example, in the CMS experiment, data managed by Dynamo are mostly used either for physics analysis or for production of derived-format data, with significantly different usage patterns. Therefore, block replicas are owned by analysis or production groups, and partitions and data management policies are set separately for the two ownership groups. Note that the block replica ownership is purely a logical concept within the Dynamo software and does not relate to file ownerships of managed data at the site storage elements.
3 Details of the system components
3.1 Dynamo server and the inventory
The main function of the Dynamo server is to manage the inventory and to launch the applications. The server process is designed to be run as a daemon in an infinite loop of checking for new application to run, spawning an application child process if there is one, checking for inventory updates sent by write-enabled applications, and collecting completed applications.
The inventory object consists of simple Python dictionaries for datasets, sites, groups, and partitions, with the names of objects as the key and the objects themselves as values. The objects are interlinked to reconstruct their conceptual relationships. For example, a dataset object has a list of its replicas and a list of its constituent blocks as attributes, and the dataset replica and block objects each point back to the dataset object also as their associated dataset.
The inventory is constructed in memory during the startup phase of the Dynamo server and kept until the server process is terminated. To keep the memory footprint of the inventory under control, file-level information is not kept in memory but is loaded from the persistence layer only when it is needed, such as when scheduling file transfers of a block, and is discarded immediately after use.
As mentioned in Section 2.1, the inventory can be updated by write-enabled applications. When a write-enabled application commits the changes it made to the inventory, updated objects are serialized and sent to the server process through a pipe at the end of the child process execution. The server process then deserializes the received objects, embeds them into the inventory image, and persists the updates immediately. New applications are not started during the update, but the ones that have been already running at the start of the inventory update keep running with the pre-update inventory image. The web server is restarted upon completion of the update.
3.2 Applications, scheduler, and interactive sessions
Dynamo application executables are single-file python scripts that are submitted to the server and executed asynchronously. Any valid python script will be accepted as an application. Submission is done through an SSL socket connection to a designated port the Dynamo server listens to, using a command-line client program included in the Dynamo package. The python script is sent over the network or, if submitted from the machine the server is running on, copied from a local path. Submitter of the application is authenticated with their X.509 certificate. The certificate Distinguished Name must be authorized beforehand to run applications on the server. Once the user is authenticated and passes the authorization check, the application execution request is queued in the server and will be picked up in one of the server loop iterations.
In a production environment, Dynamo would repeatedly execute the same set of applications, such as transfer request processing and storage cleanup, with some intervals in between. The application scheduler is a component of the Dynamo server, running in an independent thread, that puts applications from predefined sequences into the Dynamo server execution queue automatically. Multiple sequences can be managed concurrently, allowing, for example, having one sequence that executes the transfer request processing with high frequency while scheduling a thorough consistency check of the global storage system once per week. To create a sequence managed by the scheduler, users submit a sequence definition file to the Dynamo server using the command-line client. The sequence definition file uses a simple syntax to specify the applications to run, the order of execution, idle time between the executions, exception handling (ignore the exception, repeat the failed application, or repeat the entire sequence), and how many times the sequence should be repeated.
Users can also start an interactive session over the socket connection using the same command-line client. These sessions also run as child processes of the server and therefore have the fully constructed inventory object available as a python object. The interface for the interactive session resembles the prompt of the interactive mode of the Python interpreter. This feature is useful for inspecting the contents of the inventory or prototyping applications at a small scale.
3.3 Web server
The Dynamo web server is an optional child process of the Dynamo server. It communicates via FastCGI with an external HTTP(S) frontend server, which handles the HTTP requests and their SSL authentications. The web server first parses the requested URL of the incoming HTTP request passed from the frontend. The URL specifies whether a web page or a data service (REST API) is requested, and also the name of the module that provides the contents. If the module is with restricted access, the request must have come over HTTPS, and the Distinguished Name of the user certificate is checked for authorization.
The identified module is then called with the full detail of the HTTP request, including the query string contained in the URL or posted in the HTTP request body. The module returns an HTML document string or a Python dictionary depending on whether a web page or a data service is requested. The web server formats the returned value from the module into the final string passed back to the HTTP frontend, to be sent to the requesting user.
The list of modules, and therefore available web services, is easily extendable. Modules are written as Python classes with certain methods. The author of the module only needs to provide a mapping from the module name in the URL to the class, which can be picked up by the web server without stopping the Dynamo server.
A child process of the web server is spawned for each HTTP request. While a thread-based web server is more efficient in terms of resource usage than one that spawns a process for each request, by isolating each module in its own address space, they are able to make temporary modifications to the inventory in the course of execution without interfering with other concurrently processed requests. For special modules that are intended to make permanent changes to the inventory, such as the API for data injection, the web server sends the updated objects to the Dynamo server through a pipe, in the same way that write-enabled applications send the updates. The web server process then restarts itself to reflect the change in the inventory in the server process.
3.4 Copy and delete operations
The interface to copy and delete operations is provided as a Python module within the Dynamo package. The interface abstractifies the physical operations by representing all copy and deletions in terms of block replicas. The applications importing this Python module are responsible for configuring the interface with a proper plugin, which translates the information at the level of block replicas into operation descriptions that is understood by the operation backend that performs the actual transfer and deletion of files.
The copy and delete operations plugin has to implement only a few methods, making it straightforward for an experiment with an existing data management tool to adopt Dynamo as its higher-level layer. As long as the existing tool exposes its copy and deletion commands in an API, a plugin can be written and Dynamo can function completely agnostic to how the operations are performed.
Multiple Dynamo servers, each equipped with its own inventory, can be linked into a single server cluster for load balancing and high availability. Linked nodes send heart beat signals to each other, but have no dependence relation. Because there is no single point of failure, a crash of any of the nodes will not halt the operation of the server cluster.
Updates to one inventory are immediately broadcast to the linked servers to ensure synchronization. To avoid race conditions, the servers coordinate to allow the execution of only one write-enabled application at a time across the cluster. The heart beat signals are used to make the servers aware of the other nodes in the cluster to coordinate with.
To create a load-balancing cluster where multiple nodes are accessed under a single host name in e.g. a round-robin mechanism, services such as keepalived must be run on top of the Dynamo cluster. Dynamo itself only provides the machinery to operate parallel linked server instances.
While Dynamo server manages the inventory image, it is the individual applications that utilize the information in the inventory and carry out the actual data management tasks. As noted in Section 2.1, any valid Python program can become a Dynamo application, allowing the user of the system to define and execute arbitrary new tasks under the system. This section describes the default applications for common tasks that a data management system would perform. The source code for these applications is included in the standard Dynamo software package.
4.1 Data deletion: detox
The dynamic management of space adheres to two fundamental principles: firstly, the utilization should not go too close to of the available disk space for reasons of flexibility and stability; secondly, having a substantial fraction of empty, but in principle available space resources is not an economic approach. A proper, high utilization is therefore desired.
A deletion agent application, called detox, is be run regularly to prevent storage sites from overflowing. detox evaluates a policy stack at run time to determine if at a given site deletions are necessary (maximum allowed level) and allowed. It will determine deletions until the site occupancy has been brought down to a predefined value (minimum desired level).
Data attributes, which are freely configurable in the detox libraries, are evaluated and matched to the policy stack. This stack categorizes data as either cannot-be-deleted, can-be-deleted, or must-be-deleted according to the specific rules. These rules make use of data attributes like the popularity of a dataset or whether it has a replica on tape storage. The attributes are filled by dedicated producers at runtime. The policy stack file is a simple, human-readable text file and can be modified on-the-fly.
A policy stack to keep storage sites between upper and lower watermarks of 90 and 85, respectively, that protects datasets of a certain type which do not have a full copy on tape and that allows to delete data which has not been accessed by users within 200 days, is provided in the following.
On site.name in [*] When site.occupancy > 0.9 Until site.occupancy < 0.85 Delete dataset.status == INVALID Protect dataset.on_tape != FULL Dismiss dataset.usage_rank > 200 Dismiss Order decreasing dataset.usage_rank increasing replica.size
Priority is given to the first matching condition. The default in this case is set to can-be-deleted (“dismiss”). Deletions are attempted only if a storage site goes above the upper watermark.
Custom locks preventing items from being deleted from sites can be placed with cURL via the lock-API upon proper authorization. These locks will be respected by detox upon adding the line
In Fig. 2 a snapshot of the disk utilization of a system of storage sites is shown after a detox cycle has run. In this cycle, 0.2 petabytes of can-be-deleted data has been deleted because the occupancy of the respective storage site was above the allowed upper watermark. Snapshot plots like this are generated in the detox web page, included as a default web server module in the Dynamo package.
The detox application can be run in simulation mode to easily gauge the effect of a new policy on the system state without actually performing the deletions. Using this feature, detox is also being used in the CMS experiment to plan, organize, and execute dedicated deletion campaigns to remove obsolete datasets from tape archives on a yearly basis.
4.2 Data replication: dealer
Various reasons exist for why a specific piece of data should be replicated at specific sites or unspecifically across the global storage pool: a high demand by users; (temporary) unreliability of specific storage sites; desire to evenly distribute critical datasets to prevent imbalances and therefore single-points-of-failure in the system; recall from tape; initial data injection; etc.
An application called dealer is be run in a regular cycle to evaluate the replication requests and determine the data copies to make. The application collects the requests from its various plugins, each representing a different reason for requiring data replications.
The different plugins are described briefly in the following.
The popularity plugin
uses information about how frequently certain data(sets) are accessed by users to weigh the datasets accordingly when choosing which ones to propose for replication. The size of the dataset plays into the weight, as one does not want the same replication factor for very large datasets as for smaller ones simply because of storage availability. It should be stressed explicitly that this metric is a good place for the incorporation of machine learning algorithms, like reinforcement learning, to predict which datasets will be accessed in the near future and hence have them ready and available on multiple sites to facilitate their access for the user.
The balancer plugin aims at replicating data present only at a single site (“last copy”) which has a large fraction of protected data. It will propose to replicate these data at a second destination, so that the protected space can be freed up at the original site and the protected data are evenly distributed across the storage sites. This minimizes the risk of data unavailability and allows for a contingent of data at each site that can by deleted upon demand.
The enforcer plugin deals with static rules for replication. It will try and accommodate special rules for data placement. An example for such a rule would be “The replication factor for datasets of type on continent should equal 2”. Detox can also be informed about these rules and, in this case, will not delete a dataset if its replication factor were to become smaller than 2.
When a storage site is unavailable for an extended period of time, it is advised to remove all data from the site so that user jobs do not try to access data and get stuck or fail in the attempt of doing so. This is prevented by the undertaker plugin, which allows to clear out sites in morgue state by creating replicas of the data on other storage sites. Figure 2 displays sites in non-functional state as greyed out and cleaned out.
The request plugin allows third parties to request the replication of data, provided authorization was granted to them by Dynamo. For instance, this can be used for the injection of freshly produced data into the dynamically managed space. Another use case would be if a user requests a disk copy of a dataset that is only present on tape storage, making it accessible for data analysis.
The decision on which datasets to finally replicate is made among the proposed candidates at random, taking into account a configurable priority value assigned to the proposing plugin, until the target occupancy of the storage sites is met (also considering the projected volume of ongoing transfers) or until a certain threshold is reached which limits the amount of data replicated per dealer cycle.
4.3 Site Consistency
The application dynamo-consistency checks the consistency between Dynamo’s inventory and files actually located at managed sites. Even though Dynamo controls and tracks the history of file transfers and deletions at its sites, a separate check is needed to ensure that files are not lost or accumulated due to user or system errors. Actual site storage content and the inventory can become inconsistent either when files that are supposed to be at a site according to the inventory are deleted or inaccessible (missing files) or when files that are not cataloged in the inventory exist (orphan files). Missing files cause failures of block transfer requests. Jobs that are assigned to run at the site with missing files, assuming to read these files locally will fail, or if there is a backup scenario will be inefficient as they are forced to read the files remotely instead. Orphan files on the other hand lead to wasted disk space. dynamo-consistency can be run regularly to check consistency by listing the contents of each remote site and comparing the results to the inventory.
Sites managed by Dynamo may all employ different mass storage technologies and their remote interfaces. Currently, dynamo-consistency supports remote site listing using XRootD xrd Python bindings, xrdfs subshell, and the gfal-ls CLI of the GFAL2 library gfal2. These listers are easily extensible in Python, allowing for new site architectures to be checked by dynamo-consistency.
Files matching filtering criteria, which are configurable, are excluded from being listed as missing or orphan, even if they are inconsistent with the inventory. For example, a file with a recent modification time may appear as an orphan only because there is a time lag in updating the inventory, and thus should be exempt from listing. The filtering criteria should be tuned to the specific Dynamo installation.
Summaries of check results, as well as the statuses of running checks, are displayed in a webpage. The page consists of a table that includes links to logs and lists of orphan and missing files. Cells are color coded to allow operators to quickly identify problematic sites. Historic summary data for each site is also accessible through this page.
4.4 File Operations: fom
The Dynamo software package contains an application for scheduling and monitoring file transfers and deletions named fom. As noted in Section 3.4, the transfer and deletion operation backend is decoupled from the Dynamo core, allowing experiments with existing file operations programs to retain them by writing a simple plugin upon adopting Dynamo. When no such program exists or a full-stack standalone operation of Dynamo is desired, fom can be used as the file operations backend.
To use fom, applications must be configured with a fom-specific plugin. These plugins translate the data copy and deletion operations, initiated by the applications and made in terms of block replicas, into file-level instructions which are recorded in an auxiliary database table.
Because fom is a Dynamo application, it cannot be run as a daemon, and therefore does not monitor the progress of file transfers and deletions continuously. In fact, fom delegates the management of transfers and deletions to a backend daemon program. At each execution, fom issues file transfer and deletion commands that are newly recorded in the auxiliary table to the backend, and collects from it the reports on operations started in previous execution iterations. The reports (success or failure) are then used to update the inventory. The backend daemon can either be FTS3 Ayllon_2014 or a standalone lightweight daemon (dynamo-fileopd), based on the GFAL2 library, included in the package.
Transfer success and failure reports collected from the backend are also used to evaluate the quality of links between the sites. Figure 3 is a plot showing ongoing transfers between different sites, where the widths of the bands represent the total volume of scheduled transfers and the colors of the bands encode the historical link quality information. This diagram is available in a web page generated by one of the default web server modules in the Dynamo package.
Another web module exists to display the volume and rate of transfers as a time series. An example of the transfer volume history plot is in Figure 4.
4.5 Rest Api
Although they are not strictly Dynamo applications, Dynamo web server modules, and consequently the REST API, also run as child processes of the Dynamo server process with access to the inventory image. The REST API allows general users access the information in the inventory through a number of remote calls described in this section. Because the inventory is fully loaded onto RAM, responses to most API calls do not involve e.g. database I/O and thus are fast.
There are two distinct types of API calls available. The first type invokes operations that modify the state of the inventory, such as copy and deletion of dataset and block replicas, or injection of new datasets and blocks. Only authorized users are allowed to execute these calls. These calls are blocking, i.e., parallel calls to modify the state of the inventory are serialized and executed one at a time. These calls are also blocked during the execution of write-enabled Dynamo applications. The second type of API calls allows general users to obtain various information about the inventory without changing its state. These calls do not have authorization restrictions and can be executed in parallel with any other web modules or applications running at that time.
The list of existing REST API URLs can be found in Appendix A.
4.5.1 Requests Analysis
All incoming user requests are sorted into two separate queues for further analysis for possible development of the API. The first queue contains calls that are malformed or do not exist at the moment. In this way users can signal the developers what they would like to have available in the future. The second queue contains valid calls to existing functions. Analysis of the second queue can shed light on which calls are popular and which ones can be possibly made obsolete.
4.5.2 DDoS Attack Prevention
The Dynamo web server has two layers of defense against distributed denial of service (DDoS) attacks. First layer is a DenyHosts service that blocks well-recognized sources of attacks. The second layer analyzes the request queues mentioned in the previous section. If the frequency of correct or malformed requests from a single source passes a certain level that is deemed intrusive, the issuing address is automatically blacklisted in the firewall to prevent any further connection.
5 Use Cases
5.1 CMS experiment
Dynamo has been in use by the CMS collaboration since the beginning of the LHC Run 2. This CMS instance handles several hundreds of petabytes of recorded and simulated experimental data stored across a worldwide computing grid, and has proven to work well at these scales and volumes. There are some noteworthy points from the operational experience.
First, loading the inventory at the startup phase of the Dynamo server is not instantaneous for a system of this scale, but completes within a manageable time. The CMS experiment has roughly datasets, blocks, dataset replicas, and block replicas, and the inventory construction takes approximately 15 minutes using a machine with an Intel®Xeon®Gold 6134 CPU and MariaDBmariadb database on a solid-state drive for persistence. The constructed inventory has a size of approximately 8 gigabytes.
Construction of the inventory for the CMS experiment would require a substantial amount of time, if done from scratch. With the order of five machines running parallel Dynamo servers, there is very little risk of losing the information in the inventory. However, even in the case of a catastrophic failure, Dynamo can be started with no block and dataset replicas registered in the inventory, and dynamo-consistency can be used to detect which files, and thus block and dataset replicas, are at each site. Since listing the content of one of the largest CMS site with 20 petabytes of disk storage with dynamo-consistency (remotely) takes roughly 50 hours, such recovery procedure (running many dynamo-consistency application instances in parallel) would take a few days. Locally produced inventories of files though could be produced in less than an hour and fed to the consistency agent, but would require manual intervention.
Applications also do not execute instantaneously in the CMS instance, but complete within practically reasonable time. For example, it takes at least 15 minutes to complete a full cycle of routine detox execution, in which the occupancy of in the order of 60 sites are checked and the dataset replicas to delete are determined. Similarly, a routine dealer cycle evaluating replication requests from all of the plugins listed in Section 4.2 takes 10 minutes. Because the datasets in this instance are typically accessed by non-interactive batch jobs, execution time scale of less than hour is admissible.
Figure 5 demonstrates that the CMS instance of Dynamo is able to operate stably at the required scale. The figure shows the monthly total of data volume transferred to and deleted from the CMS Tier-1 and Tier-2 sites by Dynamo for the year 2019. Several dozens of petabytes were moved and deleted per month. Here, deleted datasets are typically the unpopular ones, perhaps because of their age, and the transfers replaced them with high-demand datasets. Thus Dynamo creates a “data metabolism” of the CMS experiment to utilize the limited disk space most effectively. There are more deletions than transfers because the simulation datasets are constantly being generated at the sites, acting effectively as sourceless transfers.
While file deletion operations usually complete quickly, transferring terabyte-size datasets can take from several hours to even several days. Therefore, at any given moment, there is a queue of incomplete transfers in the CMS Dynamo instance. The dealer application has a feedback mechanism that suppresses new replications when the queue of pending transfers is too long, but if this mechanism is invoked too often, the system will be slow to respond to e.g. surge of popularity of certain datasets. Therefore, a limit must be placed on the total volume of dataset replications to be requested in a single dealer execution to ensure a healthy data metabolism. Experience has found that ordering at most 200 terabytes worth of replicas per dealer execution iteration, repeated after roughly one hour of interval, allows creation of sufficient amount of new replication orders at each cycle while keeping the utilization of the transfer system high. Figure 6 shows a time series of the total volume of data replication (“Total”) scheduled by dealer and its subset that has not completed yet (“Missing”). As individual dataset replications progress, the missing volume are brought lower, and when the replication of a dataset completes, its volume is taken out of the total. In the figure, the total curve stays at a similar level because new replication requests are constantly being made, and the missing curve follows the total curve because the overall CMS storage system is able to handle this scale of transfers.
5.2 Local university research group
To evaluate the behavior of Dynamo in a different scenario, a full-stack instance is installed at a local university research group. This instance manages two storage sites, where one site is the “master” storage that holds all of the approximately 600 terabytes of data under management, and the other site, with a smaller capacity of 150 terabytes, is the cache storage for locally running jobs. Thus, the primary purpose of Dynamo in this instance is to keep the cache storage filled with datasets that are the most useful for the ongoing analyses at any given moment.
Managed data in this instance are organized into approximately datasets, with dataset sizes varying from a few gigabytes to a few tens of terabytes. There are blocks and files, with a typical file size of 2 gigabytes. At this scale, server startup (loading inventory) completes in 20 seconds, and the execution of detox and dealer only takes a few seconds. This enables, in particular, running dealer at a minutely cycle, making Dynamo respond to user demands of dataset caching at virtually real-time.
A data management software named Dynamo was created to satisfy the operational needs of the CMS experiment. Dynamo consists of a main server, which holds the full image of the managed storage system in memory, and several applications, which perform the actual data management tasks. Its extensive web interface allows remote users to monitor the status of various operations and to interact with the system. While the system was designed with usage in the CMS experiment in mind, its architecture easily accommodates different use cases at a wide range of installation scales.
7 Software availability
Dynamo standard software package is available at
The dynamo-consistency application is available at
Acknowledgements.This material is based upon work supported by the U.S. National Science Foundation under Award Number PHY-1624356 and the U.S. Department of Energy Office of Science Office of Nuclear Physics under Award Number DE-SC0011939. Disclaimer: “This report was prepared as an account of work sponsored by an agency of the United States Government. Neither the United States Government nor any agency thereof, nor any of their employees, makes any warranty, express or implied, or assumes any legal liability or responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or process disclosed, or represents that its use would not infringe privately owned rights. Reference herein to any specific commercial product, process, or service by trade name, trademark, manufacturer, or otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the United States Government or any agency thereof. The views and opinions of authors expressed herein do not necessarily state or reflect those of the United States Government or any agency thereof.” The authors thank the CMS collaboration for extensive feedback and support.
Appendix A List of APIs
Groups: A List of known groups.
Input options are:
optional: group (name of the group)
name: group name
id: group id
Nodes: A list of sites known to Dynamo.
Input options are:
node: Dynamo site list to filter on (*)222 (*) means any sequence of chracters. For example, ’T2_US*’ would match any site that starts with ’T2_US’
noempty: filter out sites that do not host any data
name: Dynamo site name
se: node type, can be ‘Disk’ or ‘MSS’ (i.e., tape)
id: unique site id assigned intrinsically by Dynamo
Datasets: Basic information about datasets.
Input options are:
dataset: dataset name, can be multiple (*)
name: List of the matched datasets that include full dataset name, size, number of file, status, and type.
Subscriptions: Show current subscriptions (dataset and block replicas) and their parameters.
Input options are:
dataset: Dataset name (*)
block: Block name (*)
node: Site name (*)
group: Group name
custodial: y or n, indicates if it assigned to tape storage
dataset: List of datasets, each list item contains a dataset replica if a complete replica exists, and a list of blocks replicas if not
block: Each item in list of blocks contains a block replica
subscription: contains node (site name), id (site name), request (request id), node_files (number of files at this site), node_bytes (number of bytes at this site), group, time_create (when the replication request was made), percent_files (percentage of files at the site), and percent_bytes (percentage of bytes at the site).
RequestList: A list of requests.
Input options are:
request: request id (*)
node: name of the targeted site (*)
dataset: dataset name as a part of the request (*)
block: block name as a part of the request (*)
requested_by: requester name (*)
id: request id
time_create: time of the request creation
requested_by: requester name
list of sites: for each site
node_id: target site id
name: target site name