Keywords

1 Introduction

In recent years, the need for applications used in areas such as Artificial Intelligence and big data has evolved. These require improved I/O operations to avoid bottlenecks in accessing large amounts of data.

The ARCOS research group is developing an Ad-Hoc parallel and distributed file system, the Expand Ad-HocFootnote 1, which is designed for those new needs.

In [7], an initial prototype and a complete evaluation of the Expand Ad-Hoc prototype were introduced, and the system’s scalability was tested. For this purpose, the Marenostrum IV supercomputer [4] with up to 128 compute nodes and files up to 4 TiB was used.

In this new work, we have added fault tolerance support based on replication to the Expand Ad-Hoc parallel file system and evaluated this new feature on the HPC4AI Laboratory supercomputer in Torino [10] using up to 64 homogeneous compute nodes.

The main goal of evaluating Expand Ad-Hoc with fault tolerance is to study the file system’s performance and scalability when using different replication levels. This study used the IOR and DLIO benchmarks and a real deep-learning application.

The rest of the paper is structured as follows: Sect. 2 shows the State of the Art; Sect. 3 describes the main features of Expand Ad-Hoc; Sect. 4 describes the fault-tolerant model designed for Expand Ad-Hoc; Sect. 5 presents the results of the evaluation of Expand Ad-Hoc with fault tolerance on the HPC4AI Laboratory supercomputer in Torino. Finally, Sect. 6 contains the main conclusions and future work.

2 State of the Art

This section has studied parallel file systems, particularly ad-hoc file systems with fault tolerance support. Some well-known examples of fault-tolerant parallel file systems are GPFS [14], Lustre [8], and BeeGFS [9].

The GPFS [14] parallel file system originates in the Tiger Shark file system, a project of IBM’s Almaden Research Center in 1993. Although the original design was intended for high-performance multimedia applications, its properties proved well-suited for scientific computing. This led to its widespread use as a backend file system on supercomputers. GPFS stands out for its client-level caching capability, allowing asynchronous I/O operations. To ensure data consistency, it uses a distributed lock manager. GPFS also offers fault tolerance thanks to RAID-5 or data replication, the latter only if it has been enabled in the configuration.

The Lustre [2] file system was started as a research project at Carnegie Mellon University (CMU) in 1999. Since the Beowulf cluster project [13], Lustre has become one of the most used storage solutions for Linux cluster computing. Lustre provides POSIX semantics, allowing concurrent read and write access to its files. Also, this parallel file system offers fault tolerance based on data replication through RAID-1, which can be RAID-0+1 or RAID-1+0.

Another parallel file system is BeeGFS [9], which started to be developed in 2005, presenting a first beta version in 2007. Its design focuses on HPC use, maximizing performance and scalability because it distributes the data and metadata among the storage nodes. Like the previous ones, BeeGFS is also based on the POSIX semantics. BeeGFS supports replication-based fault tolerance, thanks to its buddy groups, where data and metadata are replicated [1].

Ad-hoc file systems allow the virtualization of the storage of the compute nodes dynamically. These file systems can adapt to the available compute node resources and the application needs. Also, this feature brings data closer to the application and allows the exploitation of the data locality. This data locality decreases the supercomputer’s backend storage system load, which is distributed and shared among all the compute nodes [3]. Some examples of ad-hoc file systems are BurstFS [20], GekkoFS [19], and BeeOND [11].

One of the earliest known ad-hoc file systems is BurstFS [20], which is specially designed and optimized for use as a local file system. Despite this, BurstFS has some drawbacks; for example, it does not contain fault tolerance mechanisms. Also, BurstFS is not full compliance with POSIX.

Another example of ad-hoc file system is GekkoFS [18], which started to be developed in 2017. Among its main features is the use of the RocksDB database [6], which is in charge of storing metadata, and the RPC Mercury interface [17], which is in charge of communications. However, this ad-hoc file system is not fault-tolerant.

Finally, BeeOND [11] is another ad-hoc file system that allows to dynamically deploy different instances of the BeeGFS [9] file system, taking advantage of its benefits. By using BeeGFS as the backend file system, it has the same fault tolerance mentioned above. A notable difference of BeeOND is that it is free of charge for personal use but paid for professional use, unlike Expand Ad-Hoc.

3 Expand Ad-Hoc

Expand Ad-Hoc [7] is a distributed and parallel file system based on ad-hoc servers, which is specially designed for the execution of data-intensive applications. In this section, we will briefly describe its main features.

3.1 Expand Ad-Hoc Design

The Expand Ad-Hoc design is based on the client/server paradigm, as shown in Fig. 1, where clients and servers communicate using MPI, facilitating their use in HPC environments.

Fig. 1.
figure 1

Expand Ad-Hoc architecture design.

Ad-hoc servers can be deployed either in the compute nodes where the application will be executed or in other nodes. If the first case is used, they can take advantage of data locality, which will be explained later in Subsect. 3.4, offering better I/O performance.

Expand Ad-Hoc uses the compute node’s local storage (HDD, SSD, shared memory, etc.) through operating system services such as POSIX.

3.2 Metadata Management

Expand Ad-Hoc does not use a metadata manager. To manage metadata, each subfile stored on the ad-hoc server has a small header reserved for that. Although all subfiles have this space reserved for storing metadata, only the master node subfile contains this information, as shown in Fig. 2. Expand Ad-Hoc chooses the master node among those that form the partition using a hash function on the file name, which returns a number corresponding to that file’s master node.

3.3 Parallel Access

Expand Ad-Hoc uses a virtual file handler with the associated subfiles for each ad-hoc server. This makes it possible to optimize I/O operations because each operation is divided into several operations carried out in parallel on the corresponding ad-hoc servers.

3.4 Data Locality

As mentioned above, ad-hoc servers can be deployed on the same compute nodes where the application runs. This feature allows the application to take advantage of data locality when one compute node’s application process uses the data stored in this same compute node. When this happens, the Expand Ad-Hoc ad-hoc client can detect it and directly access the data stored in this compute node, as shown in Fig. 1. This optimization reduces the overhead and improves the latency.

3.5 System Call Interception Library

Expand Ad-Hoc provides a POSIX system call interception library that allows the use of the Expand Ad-hoc file system without modifying the source code of existing applications using the LD_PRELOAD environment variable.

When a system call is intercepted, the library first checks that the file to which the call is directed is stored in Expand Ad-Hoc. If so, the corresponding Expand Ad-Hoc API function is called. Conversely, if the file is not stored in Expand Ad-Hoc, the system will execute the corresponding libc.so library call.

4 Expand Ad-Hoc Fault Tolerance Model

Originally in Expand Ad-Hoc without fault tolerance (see Fig. 2), the local storage of several compute nodes can be combined to build a parallel partition. Each compute node runs an ad-hoc server that manages one or more directory trees where the distributed partition is stored. When a directory is stored in Expand Ad-Hoc, this directory is created in all compute nodes. When a file is stored in Expand Ad-Hoc, a subfile is created on each server, and the Expand Ad-Hoc file data is divided into blocks (the block size is defined at the partition level as the partition blocksize). These blocks are distributed among all the subfiles following a round-robin pattern starting at the node where the file metadata is stored (the master node).

Fig. 2.
figure 2

File structure and directory mapping in Expand Ad-Hoc.

Expand Ad-Hoc uses a fault tolerance model based on block replication. For a degree of replication N, each block is stored on N+1 different servers. This ensures that the system can handle the failure of up to N servers. Figure 3 shows a model with replication levels 1 and 2. The figure shows the sharing pattern of the blocks in case of replication.

When there is fault tolerance, the block distribution algorithm first divides the file into blocks of the same size, then replicates these blocks N times (depending on the replication level), and finally, distributes all the blocks using the round-robin pattern. In Fig. 4 you can see the result of this distribution pattern when there is replication 1 and the master node is located on the server Serv 0.

Fig. 3.
figure 3

File structure with replication level 1 and 2 in Expand Ad-Hoc.

Fig. 4.
figure 4

Block distribution using fault tolerance with 4 servers, replication level 1, and a file with 4 blocks.

4.1 Metadata Management

The metadata of a file in Expand Ad-Hoc is stored in a header at the beginning of the subfiles. This header is replicated on N nodes when using N replicas. Figure 3 shows three nodes with replication levels 1 and 2. This allows the load of metadata accesses to be shared among all nodes when accessing the master node while providing fault tolerance.

4.2 Read Optimizations

When blocks are replicated on different servers, there is more chance of finding locality in the data, thus enabling read optimizations.

When a read operation is performed on one client node, Expand Ad-Hoc first checks if one replica of the associated block is stored in this client node. By using this replica, we can take advantage of the locality. If no block replication is found on the current client node, the server is randomly chosen from those with that replicated block (skipping the servers with any error). The pseudocode of this algorithm can be seen in Listing 1.1.

figure b

5 Evaluation

This section introduces the performance evaluation of Expand Ad-Hoc, with and without fault tolerance support, compared with BeeGFS 7.3.1 and GekkoFS 0.9.1 file systems. For this evaluation, we used the IOR and DLIO benchmarks and a real deep learning application that trains a multispectral ResNet Convolutional Neural Network (CNN) on BigEarthNet, a large remote sensing dataset.

The evaluation has been conducted in the HPC4AI Laboratory cluster at Torino, described in Table 1. This cluster has a total of 2448 processors and 8.50 TiB of main memory.

Table 1. Summary of the HPC4AI Laboratory

The default configuration recommended for each file system has been used. For the Expand Ad-Hoc parallel file system, the block size was 512 KiB, each node’s local storage was its SSD device, and the supercomputer’s default MPI configuration was used.

Different levels of replication will be used to evaluate the performance of Expand Ad-Hoc fault tolerance, both without failure and, in the case of failure, as many servers as replicas. To simulate server errors, servers are stopped at the beginning of the test so the clients cannot communicate with them and are marked as erroneous.

All tests used one Expand Ad-Hoc server per compute node (from 4 to 64). Each test was run as an individual job with the compute nodes in exclusivity using the SLURM (Simple Linux Utility for Resource Management) queuing system [16].

5.1 IOR Evaluation

The open source IORFootnote 2 benchmark is widely used for the evaluation of parallel and distributed file systems as it allows to simulate different I/O loads (read/write operations, shared/individual file access, etc.) and to obtain the bandwidth archived.

To evaluate Expand Ad-Hoc with the IOR benchmark, the evaluation has been performed with different configurations to obtain the write and read bandwidth in parallel accesses to a shared file:

  • Compute nodes: 4, 8, 16, 32, and 64.

  • Local storage: SSD.

  • Transfer size used in IOR: 64 KiB, 512 KiB and 1 MiB.

  • Client processes per computation node: 8.

  • Operations: read and write in parallel on a shared file.

  • Size written by each client process: 1 GiB (a maximum of 512 GiB file).

  • File systems: BeeGFS, GekkoFS, and Expand Ad-Hoc.

  • Expand Ad-Hoc replication levels: 0, 1, 2, and 3.

Fig. 5.
figure 5

BeeGFS vs. GekkoFS vs. Expand Ad-Hoc with fault tolerance. Bandwidth (MiB/sec) writing and reading with different transfer sizes (64KiB, 512KiB and 1MiB), compute nodes (4, 8, 16, 32, and 64), with 8 client processes per node and file share. Results in logarithmic scale.

Fig. 6.
figure 6

BeeGFS vs. GekkoFS vs. Expand Ad-Hoc with fault tolerance and servers with errors. Bandwidth (MiB/sec) writing and reading with different transfer sizes (64KiB, 512KiB and 1MiB), compute nodes (4, 8, 16, 32, and 64), with 8 client processes per node and file share. Results in logarithmic scale.

Figures 5 and 6 show the bandwidth (MiB/second in logarithmic scale) when writing and reading data using different transfer sizes with a shared file, without error servers and with as many error servers as the replication level allows, respectively. It should be noted that BeeGFS and GekkoFS do not have data replication, so we need to understand the following explanations.

In the performance results obtained with the IOR benchmark, it can be seen that when using a shared file (see Fig. 5) among all processes, Expand Ad-Hoc with replication level 0 obtains much better write and read performance than BeeGFS and GekkoFS. As the replication level grows in Expand Ad-hoc, a gradual decrease in write bandwidth is seen, given the gradual growth of the data size to be written at each replication level. Despite this, better results are achieved with a transfer size of 64 KiB with replication level 2 than BeeGFS and GekkoFS from 16 nodes onwards. With the other transfer sizes, Expand Ad-Hoc up to replication level 1 achieves better results than the rest.

The replication level gradually improves these readings, thanks to the reading optimizations explained previously in Subsect. 4.2. It should be noted that with 4 nodes, there is a much more significant increase. This is due to the increase of data locality, when there is replication level 3 there is total data locality, since each node has a copy of each block, as seen in Fig. 3.

Regarding the results obtained when server errors are found (see Fig. 6), in writing, the bandwidth obtained when there were no errors in the servers is maintained. However, when reading, the bandwidth is affected. This is due to several reasons, the first one is due to not being able to use the read optimizations in the affected blocks when having servers with error, the affected data block has to be obtained from the server that is in good condition; it is not possible to choose between several as it is done in the optimization, and based on this, the second reason is due to the overload of that particular server, that when more read operations arrive to it, the bandwidth is reduced.

Despite the errors, Expand Ad-Hoc with fault tolerance and failing servers perform better than BeeGFS and GekkoFS, up to replication level 1 on writes. It also outperforms BeeGFS and GekkoFS in reads from 8 nodes on all replication levels.

As a summary of the evaluation with the IOR benchmark, it can be stated that despite having up to replication level 2 in Expand Ad-hoc with fault tolerance and despite having as many servers with errors as the replication level allows, better results are obtained than BeeGFS and GekkoFS. It should also be noted that Expand Ad-Hoc has better scalability than BeeGFS and GekkoFS.

5.2 DLIO Evaluation

The DLIOFootnote 3 [5] benchmark was developed by Argonne Leadership Computing Facility. It measures the performance of a file system through the emulation of the I/O behavior of Deep Learning scientific applications.

To evaluate Expand Ad-Hoc’s performance with fault tolerance and failing servers, we used the DLIO benchmark with the UNET3D workload. The following configuration was used:

  • Compute nodes: 4, 8, 16, 32, and 64 nodes.

  • Storage: Local SSD of each compute node.

  • UNET3D Workload: 3D medical image segmentation.

  • Dataset size: 36 GiB.

  • Epoch: 10 epoch.

  • File systems: BeeGFS and Expand Ad-Hoc.

  • Expand Ad-Hoc replication levels: 0, 1, 2, and 3.

Figure 7 shows the I/O bandwidth (MiB/second) obtained during the training performed by DLIO for the UNET3D workload on BeeGFS and Expand Ad-Hoc with fault tolerance with different levels of replication and servers with errors. It should be noted that in this test, it was not possible to obtain GekkoFS evaluation results because, in the tested version, this file system does not support this benchmark. The results show that the bandwidth obtained by the Expand Ad-Hoc file system with fault tolerance is higher than that obtained by BeeGFS in all configurations, even with failing servers.

Fig. 7.
figure 7

BeeGFS vs. Expand Ad-Hoc with fault tolerance and failing servers. Bandwidth (MiB/sec) of training performed by DLIO with 4, 8, 16, 32, and 64 compute nodes.

This behavior is because Expand Ad-Hoc takes advantage of the data locality explained above. As in UNET3D 10 epoch are performed on the same dataset, Expand Ad-Hoc is designed to use the local storage of the nodes to store the data that is going to be needed by that node during the execution of an application. This avoids having to remotely access the storage system that is shared by all nodes. Therefore, the more epochs are performed, the more avoided accesses.

Superlinear growth is observed from 4 to 8 and from 32 to 64, this may be because the server configuration and the application obtains a better data locality in these cases.

It is important to highlight Expand Ad-Hoc’s better scalability than BeeGFS in this benchmark since the more compute nodes, the more advantage Expand Ad-Hoc has. It should also be noted that, even if there are failing servers, speeds are very similar to those of having all the servers in good working condition.

5.3 Real Deep Learning Application Evaluation

Finally, we evaluated the performance of Expand Ad-Hoc when used in a real deep-learning application provided by the FZJ Research Institute. This application that uses Horovod trains a multispectral (not only RGB channels but also infrared) ResNet Convolutional Neural Network (CNN) on BigEarthNet, a large remote sensing dataset while performing a classification on a subset of the dataset. The classification problem is multi-label, meaning that more than one label can be associated with each sample [15]. In this application, the total amount of data read is proportional to the total number of processes involved, i.e., less data is read for a lower number of processes. The configurations used for these evaluations were:

  • Compute nodes: 4, 8, 16, 32, and 64 nodes.

  • Storage: Local SSD of each compute node.

  • Dataset size: 1,19 GiB train, 0,55 GiB validation, and 0,55 GiB test.

  • Epoch: 500 epoch.

  • File systems: BeeGFS and Expand Ad-Hoc.

  • Expand Ad-Hoc replication levels: 0, 1, 2, and 3.

As seen in Fig. 8, Expand Ad-Hoc achieves from 16 nodes onwards better execution times than BeeGFS, demonstrating a better scalability since it achieves better difference in the results with respect to BeeGFS. It manages to reduce the total time of execution of the training in 64 nodes by about 50%. It is also worth mentioning that despite having as many failing servers as the replication level allows, Expand Ad-Hoc does not see its performance diminished in this application. On the contrary, as the locality is increased, with 4 nodes, a considerable improvement is seen. It is necessary to comment that with 64 nodes the execution time is longer because the dataset is only 1.12 GiB of training, if it were longer, the time would be better than with 32 nodes.

Fig. 8.
figure 8

BeeGFS vs. Expand Ad-Hoc with fault tolerance. In columns the time of the execution (sec) and in lines the bandwidth (MiB/sec) of training performed by the real deep learning application with 4, 8, 16, 32, and 64 compute nodes.

6 Conclusions and Future Works

This paper introduces Expand Ad-Hoc parallel file system with replication-based fault-tolerant support for HPC environments. Also, it has been evaluated on the HPC4AI Laboratory cluster in Torino using the IOR and DLIO benchmarks, and a real deep learning application.

As seen in the evaluations presented in this paper, the design of Expand Ad-Hoc with replication-based fault tolerance (including MPI-based communication with the data servers, locality leveraging, and optimizations made with data replication) provides good scalability and higher bandwidth up to replication level 1, compared with BeeGFS and GekkoFS. Expand Ad-Hoc has been shown to reduce execution time by 50% compared to the BeeGFS parallel file system on 64 compute nodes in a real deep learning application.

As future work, it is proposed to study new replication models in the file system and evaluate fault tolerance models with more real applications using different platforms.