Fault Tolerant in the Expand Ad-Hoc Parallel File System

Muñoz-Muñoz, Dario; Garcia-Carballeira, Felix; Camarmas-Alonso, Diego; Calderon-Mateos, Alejandro; Carretero, Jesus

doi:10.1007/978-3-031-69766-1_5

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14802))

Included in the following conference series:

European Conference on Parallel Processing

546 Accesses

Abstract

In the last years, applications related to Artificial Intelligence and big data, among others, have been involved. There is a need to improve I/O operations to avoid bottlenecks in accessing a larger amount of data. For this purpose, the Expand Ad-Hoc parallel file system is being designed and developed.

Since these applications have very long execution times, fault tolerance mechanisms in the file system are necessary to allow them to continue running in the presence of failures.

This work introduces a fault-tolerant design based on data replication for the Expand Ad-Hoc parallel file system and an initial evaluation conducted on the HPC4AI Laboratory supercomputer in Torino.

The evaluation of Expand Ad-Hoc with fault-tolerant found that, despite data replication, its performance and scalability are generally better than those of other parallel file systems without fault-tolerant.

You have full access to this open access chapter, Download conference paper PDF

Ad Hoc File Systems for High-Performance Computing

Article 17 January 2020

Malleable and Adaptive Ad-Hoc File System for Data Intensive Workloads in HPC Applications

ADA-FS—Advanced Data Placement via Ad hoc File Systems at Extreme Scales

Keywords

1 Introduction

In recent years, the need for applications used in areas such as Artificial Intelligence and big data has evolved. These require improved I/O operations to avoid bottlenecks in accessing large amounts of data.

The ARCOS research group is developing an Ad-Hoc parallel and distributed file system, the Expand Ad-Hoc^{Footnote 1}, which is designed for those new needs.

In [7], an initial prototype and a complete evaluation of the Expand Ad-Hoc prototype were introduced, and the system’s scalability was tested. For this purpose, the Marenostrum IV supercomputer [4] with up to 128 compute nodes and files up to 4 TiB was used.

In this new work, we have added fault tolerance support based on replication to the Expand Ad-Hoc parallel file system and evaluated this new feature on the HPC4AI Laboratory supercomputer in Torino [10] using up to 64 homogeneous compute nodes.

The main goal of evaluating Expand Ad-Hoc with fault tolerance is to study the file system’s performance and scalability when using different replication levels. This study used the IOR and DLIO benchmarks and a real deep-learning application.

The rest of the paper is structured as follows: Sect. 2 shows the State of the Art; Sect. 3 describes the main features of Expand Ad-Hoc; Sect. 4 describes the fault-tolerant model designed for Expand Ad-Hoc; Sect. 5 presents the results of the evaluation of Expand Ad-Hoc with fault tolerance on the HPC4AI Laboratory supercomputer in Torino. Finally, Sect. 6 contains the main conclusions and future work.

2 State of the Art

This section has studied parallel file systems, particularly ad-hoc file systems with fault tolerance support. Some well-known examples of fault-tolerant parallel file systems are GPFS [14], Lustre [8], and BeeGFS [9].

The GPFS [14] parallel file system originates in the Tiger Shark file system, a project of IBM’s Almaden Research Center in 1993. Although the original design was intended for high-performance multimedia applications, its properties proved well-suited for scientific computing. This led to its widespread use as a backend file system on supercomputers. GPFS stands out for its client-level caching capability, allowing asynchronous I/O operations. To ensure data consistency, it uses a distributed lock manager. GPFS also offers fault tolerance thanks to RAID-5 or data replication, the latter only if it has been enabled in the configuration.

The Lustre [2] file system was started as a research project at Carnegie Mellon University (CMU) in 1999. Since the Beowulf cluster project [13], Lustre has become one of the most used storage solutions for Linux cluster computing. Lustre provides POSIX semantics, allowing concurrent read and write access to its files. Also, this parallel file system offers fault tolerance based on data replication through RAID-1, which can be RAID-0+1 or RAID-1+0.

Another parallel file system is BeeGFS [9], which started to be developed in 2005, presenting a first beta version in 2007. Its design focuses on HPC use, maximizing performance and scalability because it distributes the data and metadata among the storage nodes. Like the previous ones, BeeGFS is also based on the POSIX semantics. BeeGFS supports replication-based fault tolerance, thanks to its buddy groups, where data and metadata are replicated [1].

Ad-hoc file systems allow the virtualization of the storage of the compute nodes dynamically. These file systems can adapt to the available compute node resources and the application needs. Also, this feature brings data closer to the application and allows the exploitation of the data locality. This data locality decreases the supercomputer’s backend storage system load, which is distributed and shared among all the compute nodes [3]. Some examples of ad-hoc file systems are BurstFS [20], GekkoFS [19], and BeeOND [11].

One of the earliest known ad-hoc file systems is BurstFS [20], which is specially designed and optimized for use as a local file system. Despite this, BurstFS has some drawbacks; for example, it does not contain fault tolerance mechanisms. Also, BurstFS is not full compliance with POSIX.

Another example of ad-hoc file system is GekkoFS [18], which started to be developed in 2017. Among its main features is the use of the RocksDB database [6], which is in charge of storing metadata, and the RPC Mercury interface [17], which is in charge of communications. However, this ad-hoc file system is not fault-tolerant.

Finally, BeeOND [11] is another ad-hoc file system that allows to dynamically deploy different instances of the BeeGFS [9] file system, taking advantage of its benefits. By using BeeGFS as the backend file system, it has the same fault tolerance mentioned above. A notable difference of BeeOND is that it is free of charge for personal use but paid for professional use, unlike Expand Ad-Hoc.

3 Expand Ad-Hoc

Expand Ad-Hoc [7] is a distributed and parallel file system based on ad-hoc servers, which is specially designed for the execution of data-intensive applications. In this section, we will briefly describe its main features.

3.1 Expand Ad-Hoc Design

The Expand Ad-Hoc design is based on the client/server paradigm, as shown in Fig. 1, where clients and servers communicate using MPI, facilitating their use in HPC environments.

Ad-hoc servers can be deployed either in the compute nodes where the application will be executed or in other nodes. If the first case is used, they can take advantage of data locality, which will be explained later in Subsect. 3.4, offering better I/O performance.

Expand Ad-Hoc uses the compute node’s local storage (HDD, SSD, shared memory, etc.) through operating system services such as POSIX.

3.2 Metadata Management

Expand Ad-Hoc does not use a metadata manager. To manage metadata, each subfile stored on the ad-hoc server has a small header reserved for that. Although all subfiles have this space reserved for storing metadata, only the master node subfile contains this information, as shown in Fig. 2. Expand Ad-Hoc chooses the master node among those that form the partition using a hash function on the file name, which returns a number corresponding to that file’s master node.

3.3 Parallel Access

Expand Ad-Hoc uses a virtual file handler with the associated subfiles for each ad-hoc server. This makes it possible to optimize I/O operations because each operation is divided into several operations carried out in parallel on the corresponding ad-hoc servers.

3.4 Data Locality

As mentioned above, ad-hoc servers can be deployed on the same compute nodes where the application runs. This feature allows the application to take advantage of data locality when one compute node’s application process uses the data stored in this same compute node. When this happens, the Expand Ad-Hoc ad-hoc client can detect it and directly access the data stored in this compute node, as shown in Fig. 1. This optimization reduces the overhead and improves the latency.

3.5 System Call Interception Library

Expand Ad-Hoc provides a POSIX system call interception library that allows the use of the Expand Ad-hoc file system without modifying the source code of existing applications using the LD_PRELOAD environment variable.

When a system call is intercepted, the library first checks that the file to which the call is directed is stored in Expand Ad-Hoc. If so, the corresponding Expand Ad-Hoc API function is called. Conversely, if the file is not stored in Expand Ad-Hoc, the system will execute the corresponding libc.so library call.

4 Expand Ad-Hoc Fault Tolerance Model

Originally in Expand Ad-Hoc without fault tolerance (see Fig. 2), the local storage of several compute nodes can be combined to build a parallel partition. Each compute node runs an ad-hoc server that manages one or more directory trees where the distributed partition is stored. When a directory is stored in Expand Ad-Hoc, this directory is created in all compute nodes. When a file is stored in Expand Ad-Hoc, a subfile is created on each server, and the Expand Ad-Hoc file data is divided into blocks (the block size is defined at the partition level as the partition blocksize). These blocks are distributed among all the subfiles following a round-robin pattern starting at the node where the file metadata is stored (the master node).

Expand Ad-Hoc uses a fault tolerance model based on block replication. For a degree of replication N, each block is stored on N+1 different servers. This ensures that the system can handle the failure of up to N servers. Figure 3 shows a model with replication levels 1 and 2. The figure shows the sharing pattern of the blocks in case of replication.

When there is fault tolerance, the block distribution algorithm first divides the file into blocks of the same size, then replicates these blocks N times (depending on the replication level), and finally, distributes all the blocks using the round-robin pattern. In Fig. 4 you can see the result of this distribution pattern when there is replication 1 and the master node is located on the server Serv 0.

4.1 Metadata Management

The metadata of a file in Expand Ad-Hoc is stored in a header at the beginning of the subfiles. This header is replicated on N nodes when using N replicas. Figure 3 shows three nodes with replication levels 1 and 2. This allows the load of metadata accesses to be shared among all nodes when accessing the master node while providing fault tolerance.

4.2 Read Optimizations

When blocks are replicated on different servers, there is more chance of finding locality in the data, thus enabling read optimizations.

When a read operation is performed on one client node, Expand Ad-Hoc first checks if one replica of the associated block is stored in this client node. By using this replica, we can take advantage of the locality. If no block replication is found on the current client node, the server is randomly chosen from those with that replicated block (skipping the servers with any error). The pseudocode of this algorithm can be seen in Listing 1.1.

5 Evaluation

This section introduces the performance evaluation of Expand Ad-Hoc, with and without fault tolerance support, compared with BeeGFS 7.3.1 and GekkoFS 0.9.1 file systems. For this evaluation, we used the IOR and DLIO benchmarks and a real deep learning application that trains a multispectral ResNet Convolutional Neural Network (CNN) on BigEarthNet, a large remote sensing dataset.

The evaluation has been conducted in the HPC4AI Laboratory cluster at Torino, described in Table 1. This cluster has a total of 2448 processors and 8.50 TiB of main memory.

Table 1. Summary of the HPC4AI Laboratory

Full size table

The default configuration recommended for each file system has been used. For the Expand Ad-Hoc parallel file system, the block size was 512 KiB, each node’s local storage was its SSD device, and the supercomputer’s default MPI configuration was used.

Different levels of replication will be used to evaluate the performance of Expand Ad-Hoc fault tolerance, both without failure and, in the case of failure, as many servers as replicas. To simulate server errors, servers are stopped at the beginning of the test so the clients cannot communicate with them and are marked as erroneous.

All tests used one Expand Ad-Hoc server per compute node (from 4 to 64). Each test was run as an individual job with the compute nodes in exclusivity using the SLURM (Simple Linux Utility for Resource Management) queuing system [16].

5.1 IOR Evaluation

The open source IOR^{Footnote 2} benchmark is widely used for the evaluation of parallel and distributed file systems as it allows to simulate different I/O loads (read/write operations, shared/individual file access, etc.) and to obtain the bandwidth archived.

To evaluate Expand Ad-Hoc with the IOR benchmark, the evaluation has been performed with different configurations to obtain the write and read bandwidth in parallel accesses to a shared file:

Compute nodes: 4, 8, 16, 32, and 64.
Local storage: SSD.
Transfer size used in IOR: 64 KiB, 512 KiB and 1 MiB.
Client processes per computation node: 8.
Operations: read and write in parallel on a shared file.
Size written by each client process: 1 GiB (a maximum of 512 GiB file).
File systems: BeeGFS, GekkoFS, and Expand Ad-Hoc.
Expand Ad-Hoc replication levels: 0, 1, 2, and 3.

Figures 5 and 6 show the bandwidth (MiB/second in logarithmic scale) when writing and reading data using different transfer sizes with a shared file, without error servers and with as many error servers as the replication level allows, respectively. It should be noted that BeeGFS and GekkoFS do not have data replication, so we need to understand the following explanations.

In the performance results obtained with the IOR benchmark, it can be seen that when using a shared file (see Fig. 5) among all processes, Expand Ad-Hoc with replication level 0 obtains much better write and read performance than BeeGFS and GekkoFS. As the replication level grows in Expand Ad-hoc, a gradual decrease in write bandwidth is seen, given the gradual growth of the data size to be written at each replication level. Despite this, better results are achieved with a transfer size of 64 KiB with replication level 2 than BeeGFS and GekkoFS from 16 nodes onwards. With the other transfer sizes, Expand Ad-Hoc up to replication level 1 achieves better results than the rest.

The replication level gradually improves these readings, thanks to the reading optimizations explained previously in Subsect. 4.2. It should be noted that with 4 nodes, there is a much more significant increase. This is due to the increase of data locality, when there is replication level 3 there is total data locality, since each node has a copy of each block, as seen in Fig. 3.

Regarding the results obtained when server errors are found (see Fig. 6), in writing, the bandwidth obtained when there were no errors in the servers is maintained. However, when reading, the bandwidth is affected. This is due to several reasons, the first one is due to not being able to use the read optimizations in the affected blocks when having servers with error, the affected data block has to be obtained from the server that is in good condition; it is not possible to choose between several as it is done in the optimization, and based on this, the second reason is due to the overload of that particular server, that when more read operations arrive to it, the bandwidth is reduced.

Despite the errors, Expand Ad-Hoc with fault tolerance and failing servers perform better than BeeGFS and GekkoFS, up to replication level 1 on writes. It also outperforms BeeGFS and GekkoFS in reads from 8 nodes on all replication levels.

As a summary of the evaluation with the IOR benchmark, it can be stated that despite having up to replication level 2 in Expand Ad-hoc with fault tolerance and despite having as many servers with errors as the replication level allows, better results are obtained than BeeGFS and GekkoFS. It should also be noted that Expand Ad-Hoc has better scalability than BeeGFS and GekkoFS.

5.2 DLIO Evaluation

The DLIO^{Footnote 3} [5] benchmark was developed by Argonne Leadership Computing Facility. It measures the performance of a file system through the emulation of the I/O behavior of Deep Learning scientific applications.

To evaluate Expand Ad-Hoc’s performance with fault tolerance and failing servers, we used the DLIO benchmark with the UNET3D workload. The following configuration was used:

Compute nodes: 4, 8, 16, 32, and 64 nodes.
Storage: Local SSD of each compute node.
UNET3D Workload: 3D medical image segmentation.
Dataset size: 36 GiB.
Epoch: 10 epoch.
File systems: BeeGFS and Expand Ad-Hoc.
Expand Ad-Hoc replication levels: 0, 1, 2, and 3.

Figure 7 shows the I/O bandwidth (MiB/second) obtained during the training performed by DLIO for the UNET3D workload on BeeGFS and Expand Ad-Hoc with fault tolerance with different levels of replication and servers with errors. It should be noted that in this test, it was not possible to obtain GekkoFS evaluation results because, in the tested version, this file system does not support this benchmark. The results show that the bandwidth obtained by the Expand Ad-Hoc file system with fault tolerance is higher than that obtained by BeeGFS in all configurations, even with failing servers.

This behavior is because Expand Ad-Hoc takes advantage of the data locality explained above. As in UNET3D 10 epoch are performed on the same dataset, Expand Ad-Hoc is designed to use the local storage of the nodes to store the data that is going to be needed by that node during the execution of an application. This avoids having to remotely access the storage system that is shared by all nodes. Therefore, the more epochs are performed, the more avoided accesses.

Superlinear growth is observed from 4 to 8 and from 32 to 64, this may be because the server configuration and the application obtains a better data locality in these cases.

It is important to highlight Expand Ad-Hoc’s better scalability than BeeGFS in this benchmark since the more compute nodes, the more advantage Expand Ad-Hoc has. It should also be noted that, even if there are failing servers, speeds are very similar to those of having all the servers in good working condition.

5.3 Real Deep Learning Application Evaluation

Finally, we evaluated the performance of Expand Ad-Hoc when used in a real deep-learning application provided by the FZJ Research Institute. This application that uses Horovod trains a multispectral (not only RGB channels but also infrared) ResNet Convolutional Neural Network (CNN) on BigEarthNet, a large remote sensing dataset while performing a classification on a subset of the dataset. The classification problem is multi-label, meaning that more than one label can be associated with each sample [15]. In this application, the total amount of data read is proportional to the total number of processes involved, i.e., less data is read for a lower number of processes. The configurations used for these evaluations were:

Compute nodes: 4, 8, 16, 32, and 64 nodes.
Storage: Local SSD of each compute node.
Dataset size: 1,19 GiB train, 0,55 GiB validation, and 0,55 GiB test.
Epoch: 500 epoch.
File systems: BeeGFS and Expand Ad-Hoc.
Expand Ad-Hoc replication levels: 0, 1, 2, and 3.

As seen in Fig. 8, Expand Ad-Hoc achieves from 16 nodes onwards better execution times than BeeGFS, demonstrating a better scalability since it achieves better difference in the results with respect to BeeGFS. It manages to reduce the total time of execution of the training in 64 nodes by about 50%. It is also worth mentioning that despite having as many failing servers as the replication level allows, Expand Ad-Hoc does not see its performance diminished in this application. On the contrary, as the locality is increased, with 4 nodes, a considerable improvement is seen. It is necessary to comment that with 64 nodes the execution time is longer because the dataset is only 1.12 GiB of training, if it were longer, the time would be better than with 32 nodes.

6 Conclusions and Future Works

This paper introduces Expand Ad-Hoc parallel file system with replication-based fault-tolerant support for HPC environments. Also, it has been evaluated on the HPC4AI Laboratory cluster in Torino using the IOR and DLIO benchmarks, and a real deep learning application.

As seen in the evaluations presented in this paper, the design of Expand Ad-Hoc with replication-based fault tolerance (including MPI-based communication with the data servers, locality leveraging, and optimizations made with data replication) provides good scalability and higher bandwidth up to replication level 1, compared with BeeGFS and GekkoFS. Expand Ad-Hoc has been shown to reduce execution time by 50% compared to the BeeGFS parallel file system on 64 compute nodes in a real deep learning application.

As future work, it is proposed to study new replication models in the file system and evaluate fault tolerance models with more real applications using different platforms.

Notes

References

BeeGFS: BeeGFS documentation 7.4.2 » architecture (2024). https://doc.beegfs.io/7.4.2/architecture/overview.html#mirroring (Accessed 18 March 2024)
Braam, P.: The lustre storage architecture. CoRR arXiv: 1903.01955 (2019)
Brinkmann, A., et al.: Ad hoc file systems for high-performance computing. J. Comput. Sci. Technol. 35(1), 4–26 (2020)
Article Google Scholar
BSC: MareNostrum specification (2023). https://www.bsc.es/marenostrum/marenostrum/technical-information, (Accessed 18 March 2024)
Devarajan, H., Zheng, H., Kougkas, A., Sun, X.H., Vishwanath, V.: Dlio: A data-centric benchmark for scientific deep learning applications. In: 21st International Symposium on Cluster, Cloud and Internet Computing (CCGrid), vol. 1(81–91) (2021)
Google Scholar
Dong, S., Kryczka, A., Jin, Y., Stumm, M.: Rocksdb: evolution of development priorities in a key-value store serving large-scale applications. ACM Trans. Storage 17(4) (2021). https://doi.org/10.1145/3483840
Garcia-Carballeira, F., Camarmas-Alonso, D., Caderon-Mateos, A., Carretero, J.: A new ad-hoc parallel file system for hpc environments based on the expand parallel file system. In: 2023 22nd International Symposium on Parallel and Distributed Computing (ISPDC), pp. 69–76 (2023). https://doi.org/10.1109/ISPDC59212.2023.00015
Han, J., Kim, D., Eom, H.: Improving the performance of lustre file system in hpc environments. In: 2016 IEEE 1st International Workshops on Foundations and Applications of Self* Systems (FAS*W), pp. 84–89 (2016). https://doi.org/10.1109/FAS-W.2016.29
Herold, F., Breuner, S., Heichler, J.: An introduction to beegfs. Tech. Rep, ThinkParQ, Kaiserslautern, Germany (2014)
Google Scholar
HPC4AI: HPC4AI laboratory specification (2024). https://hpc4ai.unito.it/. (Accessed March 18 2024)
Masih, Z.: On demand file systems with beegfs (2023)
Google Scholar
Muñoz-Muñoz, D., Garcia-Carballeira, F., Camarmas-Alonso, D., Calderon-Mateos, A., Carretero, J.: Fault tolerant in the Expand Ad-Hoc parallel file system (2024)
Google Scholar
Reed, D.A.: Beowulf clusters: from research curiosity to exascale. In: Proceedings of the 20 Years of Beowulf Workshop on Honor of Thomas Sterling’s 65th Birthday, pp. 28-33. Beowulf 2014, Association for Computing Machinery, New York (2014). https://doi.org/10.1145/2737909.2737913
Schmuck, F., Haskin, R.: GPFS: A Shared-Disk file system for large computing clusters. In: Conference on File and Storage Technologies (FAST 02). USENIX Association, Monterey, CA (Jan 2002). https://www.usenix.org/conference/fast-02/gpfs-shared-disk-file-system-large-computing-clusters
Sedona, R., Cavallaro, G., Jitsev, J., Strube, A., Riedel, M., Benediktsson, J.A.: Remote sensing big data classification with high performance distributed deep learning. Remote Sensing 11(24), 3056 (2019)
Article Google Scholar
SLURM: Slurm workload manager (2023). https://slurm.schedmd.com/documentation.html, (Accessed 18 March 2024
Soumagne, J., et al.: Mercury: enabling remote procedure call for high-performance computing. In: 2013 IEEE International Conference on Cluster Computing (CLUSTER), pp. 1–8. IEEE (2013)
Google Scholar
Vef, M.A., et al.: Gekkofs-a temporary distributed file system for hpc applications. In: 2018 IEEE International Conference on Cluster Computing (CLUSTER), pp. 319–324. IEEE (2018)
Google Scholar
Vef, M.A., Moti, N., Süß, T., Tacke, M., Tocci, T., Nou, R., Miranda, A., Cortes, T., Brinkmann, A.: Gekkofs: A temporary burst buffer file system for hpc applications. J. Comput. Sci. Technol. 35(1), 72–91 (2020)
Article Google Scholar
Wang, T., Mohror, K., Moody, A., Sato, K., Yu, W.: An ephemeral burst-buffer file system for scientific applications. In: SC 2016: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysism pp. 807–818. IEEE (2016)
Google Scholar

Download references

Acknowledgment and Artifact Availability

This work has been partially funded by the Spanish Ministry of Science and Innovation project “Expand: High Performance Storage System for HPC and Big Data Environments”. Ref. TED2021-131798B-I00. And partially funded by the “Adaptive multi-tier intelligent data manager for Exascale (ADMIRE)” project of the European Union’s Horizon 2020 JTI-EuroHPC research and innovation programme. Ref. H2020-JTI-EuroHPC-2019-1, Project no. 956748. And partially supported by the EuroHPC project “Adaptive multi-tier intelligent data manager for Exascale” under grant 956748 - ADMIRE - H2020-JTI-EuroHPC-2019-1 and by the Agencia Española de Investigación under Grant PCI2021-121966. The Code is available in the Zenodo repository [12].

Author information

Authors and Affiliations

Computer Science and Engineering Department, Universidad Carlos III de Madrid, Av. Universidad, 30, 28911, Leganés, Madrid, Spain
Dario Muñoz-Muñoz, Felix Garcia-Carballeira, Diego Camarmas-Alonso, Alejandro Calderon-Mateos & Jesus Carretero

Authors

Dario Muñoz-Muñoz
View author publications
You can also search for this author in PubMed Google Scholar
Felix Garcia-Carballeira
View author publications
You can also search for this author in PubMed Google Scholar
Diego Camarmas-Alonso
View author publications
You can also search for this author in PubMed Google Scholar
Alejandro Calderon-Mateos
View author publications
You can also search for this author in PubMed Google Scholar
Jesus Carretero
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dario Muñoz-Muñoz .

Editor information

Editors and Affiliations

University Carlos III of Madrid, Madrid, Spain
Jesus Carretero
University of Oregon, Eugene, OR, USA
Sameer Shende
University Carlos III of Madrid, Madrid, Spain
Javier Garcia-Blas
TU Wien, Vienna, Austria
Ivona Brandic
Universidad Complutense de Madrid, Madrid, Spain
Katzalin Olcoz
Université Grenoble Alpes, Saint Martin d'Hères, France
Martin Schreiber

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Muñoz-Muñoz, D., Garcia-Carballeira, F., Camarmas-Alonso, D., Calderon-Mateos, A., Carretero, J. (2024). Fault Tolerant in the Expand Ad-Hoc Parallel File System. In: Carretero, J., Shende, S., Garcia-Blas, J., Brandic, I., Olcoz, K., Schreiber, M. (eds) Euro-Par 2024: Parallel Processing. Euro-Par 2024. Lecture Notes in Computer Science, vol 14802. Springer, Cham. https://doi.org/10.1007/978-3-031-69766-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-69766-1_5
Published: 26 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-69765-4
Online ISBN: 978-3-031-69766-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Fault Tolerant in the Expand Ad-Hoc Parallel File System