Modern distributed file systems usually decouple metadata access from read and write to exaggerate the throughput of data storage nodes and hide the latency of metadata operations in data transfers. This mechanism works well when multiple users attempt to read or write large files from distributed file systems as the metadata-to-data ratio is low. The time cost of metadata access is small enough to be neglected when files are large, but when the files are small, the remote procedure call (RPC) latency of accessing metadata becomes significant. As the numbers of small files are estimated to soon achieve and exceed billions, how to optimize RPC latency to metadata is a known challenge for existing distributed file systems. While distributed file system metadata servers generate RPCs for many functions, our focus in this paper is on the access permission operation of the file system duringopen().
Distributed file systems usually perform verification of access control on a given pathname argument before storage servers can provide proper file operations. In other words, access to the file specified by a pathname requires permission to check its ownership and its grouping information and permission to access the data in a mixed mode of read, write, and execute. We call this file access control the permission check. The permission check is traditionally the first operation for all file system operations on files and is the basis of most user’s understanding of file system access controls. Modern distributed file systems usually apply centralized single-node or multi-node metadata services to break the bottleneck from the serialization of permission checks. While distributed file systems such as Lustre(Cluster File Systems, 2002), GPFS(Schmuck and Haskin, 2002), GlusterFS(Gluster, 2011), and GFS(Ghemawat et al., 2003) benefit from better scalability and concurrent data access, the independent metadata services introduce additional RPCs to access a file. For example, Lustre usually generates at least three round-trip RPCs to access a file: open(), read() or write(), and close(). Although the RPC by close() can be executed asynchronously, RPCs by open() and read()/write() still cost a constant latency to Lustre.
Current studies(Chen et al., 2017, 2015, 2016) usually try to reduce the number of RPCs by combining open() with read()/write() requests or aggregating a number of requests. However, these attempts require additional efforts to modify applications’ API due to their incapability with the standard POSIX API(Bok et al., 2017). Some studies aim to optimize the metadata management efficiency by flattening metadata via a key-value store or replicating metadata across WAN servers(Thomson and Abadi, 2015)(Li et al., 2018)(Xiao et al., 2015). Although these solutions mitigate the overhead by hierarchical metadata operations, they barely reduce the number of RPCs by metadata access, which is mainly causes existing distributed file systems to perform poorly to access small files. Studies also make efforts to improve the efficiency of assessing small files by diminishing RPCs. For example, the Lustre group proposes Data on MDT (a.k.a. DoM) mechanism that aims at reducing RPCs for data accessing(Wiki, 2018). DoM optimizes the number of RPCs to read small files by storing them on the metadata servers. However, when the number of small files keeps growing, this approach may drain precious storage capacity on the metadata servers and aggravate the already serious bottleneck of metadata access. Besides, this approach is not write-friendly because all the writes to small files will congest the metadata servers.
open() is the left-aside operation that generates an RPC for every file access. open() usually consists of two steps– Step 1: checks the permission of the requested file and Step 2: records the open status of the file. Upon every open(), a client node (hereafter referred to client) in a distributed file system issues one RPC to check the permission on the server while the server performs both the two steps and returns the information to the client. We can foresee that avoiding the RPC that checks files permission between the client and metadata server could improve the latency of distributed file systems for accessing enormous small files. Yet, it is very challenging in existing distributed file systems to diminish such RPC since open() has to access the metadata server via the network to initial operations to a file.
We propose BuffetFS, a user-level distributed file system that conceals the RPC by open() during the file accessing initial. BuffetFS dis-aggregates open() then leverages the permission check (Step 1) to the client-side for better responsiveness and defers the file offset and flags recording (Step 2) to its successor operations that need to contact metadata servers. In this case, BuffetFS restrain the number of RPCs to one for actual data accesses (i.e. read() or write()), which alleviates the latency for accessing a large number of small files. The core part of BuffetFS is a mechanism to attach files’ permission information to their parent directory. Besides inode numbers and name strings, the BeffetFS directory also contains the permission information of all the files and sub-directories that belong to it. Each client in BuffetFS maintains an incomplete directory tree structure that consists of directories accessed before and their children. Besides, each client holds the complete permission information in the directory tree. By doing so, BuffetFS balances the response time for open() and the storage capacity. As a price, the servers must keep all related clients updated when applications modify the permission of a file/directory, which usually doesn’t occur frequently.
As the open() is dis-aggregated, BuffetFS only needs to manage servers that store files and directories data and does not require a centralized metadata server. Hence a decentralized distributed file system becomes possible via BuffetFS. The clients can locate a file or a directory from its inode number that consists of a host-ID and a unique file-ID. Thus, a client can check files’ permission by itself and access the files without requesting their location and metadata from other clients.
The main contributions of this paper include: (1) we propose BuffetFS, a user-level file system to optimize I/O performance to enormous small files in a distributed environment. BuffetFS is designed to reduce open() latency by eliminating RPCs for permission check; (2) we implement a BuffetFS prototype and a decentralized distributed file system sandbox; (3) a comprehensive experimental study is provided to evaluate the efficacy of the BuffetFS prototype.
The rest of the paper is organized as follows. Section 2 provides the background and motivation of this research. The design and implementation details of BuffetFS are presented in Section 3, which is followed by an evaluation of BuffetFS shown in Section 4. Section 5 summarizes the related work. Finally, Section 6 concludes this paper.
2. Motivation and Background
Besides providing I/O services for traditional compute-intensive applications such as scientific simulations, the modern distributed file systems start to support applications like machine learning and Big Data analysis, which access enormous small files. According to our observation on an object storage server (OSS) of a Lustre cluster in TaihuLight supercomputing center, more than 90% RPCs come from accessing small files. This Lustre cluster is mainly utilized by some machine learning studies and the Beacon system(Yang et al., 2019), which is an I/O surveillance system of the whole TaihuLight supercomputer. The accessing statistic of this Lustre cluster indicates that more than 70% of metadata operations are open() and close(). As Lustre can execute close() in an asynchronous way, open() becomes the major operation that causes remote I/O latency for accessing small files.
We further notice that metadata of directories can be cached on the client-side to optimize the response time, however, the permission of the target file in open() can not be cached unless the file is accessed before. How to optimize RPC latency by permission check motivates us to design a scheme to leverage this operation to the client side.
2.2. Background of open() operation
In this section, we briefly explain the general open() operation in two dominant types of file systems – local file systems and network-connected file systems.
For local file systems, the open() function invoked by any applications will be interpreted as an open() system call to the kernel. The kernel then parses the path string and finds dentries and inode objects for each involved path component in turn. For every step of directory component traversal, the kernel has to check the permission of the current component before proceeding to the next step. Only upon the granted permission of the current component will allow the kernel to move on to the next step. Upon reaching the target component, the kernel checks its complete permission according to the open() flags while for its parent directory components, the kernel checks the execution permission only. The kernel then marks the referenced file as opened and returns a unique file descriptor to the application. Other file objects such as the inode object, file object, and superblock object will be updated consequently.
As for network-connected file systems such as distributed file systems, the kernel reserves APIs of the steps mentioned in the local file systems (i.e., request for metadata, permission checks). For example, a Lustre client sets a flag for each dentry to mark its availability to deal with concurrent modification to the same directory or files by other remote Lustre clients. Distributed file systems usually maintain a global lock manager to preserve the data and metadata integrity of files. For example, Lustre implements a Lustre Distributed Lock Manager (a.k.a LDLM) to ensure the data consistency(LABORATORY, 2009)). One side-effect of global lock management is that it introduces external permission management. Hence distributed file systems maintain metadata and a list of open files on the server-side, which requires one RPC to perform open() on every objective file.
However, there is no need to perform all the steps of open() on the server-side. Recall the two major steps in open() (i.e., check permissions and record file status), only the permission check has to executed immediately while the status record can be postponed and be performed asynchronously. If a client can perform a permission check for open(), one RPC to the server can be concealed.
3. The Design of BuffetFS
3.1. BuffetFS Architecture
Fig. 1 illustrates the architecture of BuffetFS, Serving as an userspace distributed file system, BuffetFS sits between the applications and an underlying file system. BuffertFS consists of three components: a BuffetFS Library (BLib), a BuffetFS Agent (BAgent), and a BuffetFS Server (BServer).
BLib serves as a dynamic library that intercepts and redirects the POSIX I/O requests from applications to BAgent. Locates in a client, a BAgent maintains an incomplete directory tree where each tree node occupies the permission status and the pointers of files. A BAgent also maintains a corresponding context to a user process including the PID, file descriptors, and file objects. Every client can only have one BAgent.
BServer is deployed on a server node to manage actual file data. It collects file access requests from all the clients and maintains the files’ metadata. For the open() operation, a BServer maintains a list of opened files to ensure data consistency for concurrent file modifications from multiple clients.
3.2. Namespace and Metadata Handling
BuffetFS does not have a centralized metadata server since the permission checks are leveraged to clients. To manage a global namespace, BuffetFS re-modifies the inode to contain three segments: (1) a hostID, representing the server that stores the actual file data; (2) a fileID, which is a unique number to identify a file on the corresponding BServer locally; and (3) a version number of the server, which is designed to record exceptions of a server (e.g., reboot or restore). The BAgent on each client maintains a local configuration file that maps a tuple (a hostID and a version number) to a server address. Thus, every single inode number on a client can identify the location of a corresponding file on a server.
Each file in BuffetFS owns a pair of front-end/ back-end metadata, where the front-end metadata is related to the client user and the back-end metadata on the server-side is for the management of actual files. Some front-end metadata will be stored in the extended attributes of the actual file in BServer to handle BuffetFS inode numbers and clients’ permission. Other than that, both the front-end and back-end metadata store the same information including last access time, modify time and create time.
In addition to the regular inode number and name strings for files and sub-directory, BuffetFS uses ten extra bytes for each directory entry to store the permission information. The total extra bytes for a complete directory is commonly no more than hundreds of bytes, which shouldn’t be a problem.
3.3. Buffet I/O control flow
In this subsection, we discuss the control flow details of BuffetFS.
Presented in Fig. 2(a) Upon an application issues open(), BuffetFS first intercepts the operation to a BLib (a-1), the BLib then redirects the open() to a BAgent (a-2) and monitors the returned file descriptor (fd) from the BAgent (a-3). The BAgent traverses the path string in the local cached directory tree to find the corresponding tree node. If the tree node is cached, the BAgent obtains the permission locally; otherwise, it obtains the complete location and permission data of the parent node then extends the cached directory tree. As for the referenced file, the BAgent does not need to generate one RPC to collect its metadata from the server since the file’s permission is recorded in its parent directory. For example, when a user tries to open the file foo with the pathname ‘‘/a/b/foo" while a BAgent has cached the directories a/ and b/ locally in advance, the BAgent first obtains the data of b/ and inserts all the b/’s children to the cached tree to replenish foo information, then proceeds the permission check of foo. In the end, the BAgent picks and returns a valid fd to BLib. Meanwhile, the BAgent marks the file object as incomplete-opened until the BServer finishes the rest of open() operations.
Once the BLib receives and returns the fd the application, BuffetBF can proceed to provide read() or write() operation (demonstrated in Fig. 2(b)). A BAgent attaches the incomplete-opened flag to the first read or write request cauesd by the same process (b-2) and sends them to the corresponding BServer based on the BuffetFS inode number (b-3). After parsing the RPCs, BServer executes the rest operations of open() operations (i.e. updates the opened-file list) then the read request. It identifies the back-end file with the fileID and returns the file data to the BAgent via one RPC (b-4). The BAgent finally returns the requested data (for read()) or the statue (for write()) to the application (b-5).
As for close(), the BAgent returns a signal immediately and performs an RPC asynchronously to inform the corresponding the BServer to wrap up operations by removing the file object from the opened-file list.
To sum up, only read() or write() RPCs impacts the whole latency while the RPC by close() can be hided asynchronously.
3.4. Metadata Modification and Consistency
Since BuffetFS executes permission checks on the client-side, it introduces overhead for modifying file permission. Upon any permission updates, a BServer has to inform all the related clients to invalid the corresponding cache entries then execute the permission changes.
For each directory, a BServer records a list of clients that cache the directory data. Upon any file permission changes, a BServer has the big picture of all the related clients. Therefore, the BServer produces RPCs to inform the corresponding BAgents to invalid the involved tree nodes. After receiving all the responses from clients, the BServer then executes the permission modification.
If a BAgent tries to access an invalid tree node, it will ask for the updated permission from the BServer. This mechanism ensures the strong consistency of the metadata. Other metadata modifications, such as changing file name and file migration, may cause similar overheads among BuffetFS and DFS theoretically. They all need to ask the related clients to invalidate corresponding local cached metadata.
4. BuffetFS Evaluation
We carry out our evaluation on a Sunway TaihuLight HPC testbed cluster to test the access latency and concurrency. Each node is equipped with two 2.6GHz 8-core 16-thread Intel Xeon CPU and 64GB RAM, running CentOS v7.5. The cluster is built on a Lustre (v2.10) file system with 1 MDS and 4 OSSes interconnecting with InfiniBand. Lustre’s storage targets consist of 12 HDDs which are arranged as two groups of RAID6 by a Sugon DS800-F20 storage system providing 71TB HDD storage capacity. We deploy BuffetFS laying over ext4 in the cluster and compare its performance with Lustre.
We arrange our tests with three groups: BuffetFS, Lustre-Normal, and Lustre-DoM. While Lustre-Normal stands for the common utilization of Lustre, particularly, Lustre-DoM presents the performance of tests running on Lustre with DoM mode (Data on MDS). With DoM, a small file’s data can be stored on MDS and clients can get both metadata and data on MDS (please see Section 5).
Figure 3 shows the performance of accessing a single small file including three operations: open(), read(), and close(). BuffetFS performs the lowest latency in the test compared to the two Lustre cases. Firstly it conceals one round RPCs caused by open() so that only the read() RPCs affect the latency (close() will be executed asynchronously). Another reason is that BuffetFS arranges files locks inside the BServer for concurrency while Lustre arranges its distributed file locks among all of its clients.
Further, we explore the performance of the three groups on concurrent access (Figure 4). We fork different numbers of processes each of which randomly accesses 1000 files among 100000 4KB files. To eliminate the effect of data cache and other internal mechanisms in Lustre, we regenerate the files set for each test.
BuffetFS requests for the directory data once and built the directory tree on the client, which means that the following access to other files in the same directory can benefit from the cached directory tree without applying metadata from BServer. Differently, the two Lustre cases have to request metadata from MDS for each open() operation. And BuffetFS presents a standout performance in this case.
5. Related Work
In this section, we discuss other studies focusing on optimizing metadata access to improve the performance of distributed file systems.
IndexFS(Ren et al., 2014) is a layered cluster file system to optimize metadata service for common distributed file systems (e.g., HDFS(Shvachko et al., 2010), Lustre(Cluster File Systems, 2002)). It caches partition directory tree on the clients and uses a short-term lease for each cached directory entry to ensure its consistency. What the client cache is interested in are directories entries visited before. In other words, when the client accesses a file, components on the file path will be cached on the client. However, this strategy does not always effective, for example, Lustre itself will keep directory entry valid on the client after access. Besides, the open operations can not benefit from the cache of IndexFS since it does not cache the last component.
As we mentioned above, Lustre(whamcloud, 2021) keeps directory entries valid on a client after accessed. The following visits to the valid entries do not need to contact the Metadata Server (MDS). Besides, the Data on MDS (DoM)(Wiki, 2018) mechanism speedups small file accessing for Lustre. It stores file data on MDS as an extended attribute of the file. With DoM, when an open() request comes, MDS will attach the file data to the returned RPC so that the client does not need to access the Object Storage Server (OSS) anymore. But DoM only optimize the open()-read()-close() operations while open()-write()-close() does not benefit from it. Further, DoM occupies the space of MDS which is expensive to Lustre.
In this paper, we identified a challenge of data access faced by many distributed file systems for accessing a large number of small files. To solve it, we developed a prototype of a user-level file system called BuffetFS which is then integrated into a cluster. Further, we applied BufferFS to optimize RPCs for permission check in open(). Our experimental results demonstrate that BufferFS can noticeably reduce the latency of open() in distributed file systems.
BuffetFS currently can support major I/O operations including open(), read(), write(), and close. As an ongoing work, we are promoting BuffetFS to support POSIX I/O API and internal optimizations such as full metadata caching.
- An efficient distributed caching for accessing small files in hdfs. Cluster Computing 20, pp. 3579–3592. Cited by: §1.
- Newer is sometimes better: an evaluation of nfsv4.1. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS ’15, New York, NY, USA, pp. 165–176. External Links: Cited by: §1.
VNFS: maximizing NFS performance with compounds and vectorized i/o. In 15th USENIX Conference on File and Storage Technologies (FAST 17), Santa Clara, CA, pp. 301–314. External Links: Cited by: §1.
- SeMiNAS: a secure middleware for wide-area network-attached storage. In Proceedings of the 9th ACM International on Systems and Storage Conference, SYSTOR ’16, New York, NY, USA. External Links: Cited by: §1.
- External Links: Cited by: §1, §5.
- The google file system. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, New York, NY, USA, pp. 29–43. External Links: Cited by: §1.
- External Links: Cited by: §1.
- External Links: Cited by: §2.2.
- A flattened metadata service for distributed file systems. IEEE Transactions on Parallel and Distributed Systems 29 (12), pp. 2641–2657. External Links: Cited by: §1.
- IndexFS: scaling file system metadata performance with stateless caching and bulk insertion. In SC ’14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Vol. , pp. 237–248. External Links: Cited by: §5.
- GPFS: a shared-disk file system for large computing clusters. In Proceedings of the 1st USENIX Conference on File and Storage Technologies, FAST ’02, Berkeley, CA, USA. External Links: Cited by: §1.
- The hadoop distributed file system. In 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), Vol. , pp. 1–10. External Links: Cited by: §5.
- CalvinFS: consistent WAN replication and scalable metadata management for distributed file systems. In 13th USENIX Conference on File and Storage Technologies (FAST 15), Santa Clara, CA, pp. 1–14. External Links: Cited by: §1.
- External Links: Cited by: §5.
- External Links: Cited by: §1, §5.
- ShardFS vs. indexfs: replication vs. caching strategies for distributed metadata management in cloud storage systems. In Proceedings of the Sixth ACM Symposium on Cloud Computing, SoCC ’15, New York, NY, USA, pp. 236–249. External Links: Cited by: §1.
- End-to-end i/o monitoring on a leading supercomputer. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), Boston, MA, pp. 379–394. External Links: Cited by: §2.1.