JCP 2006 Vol.1(8): 43-54 ISSN: 1796-203X
doi: 10.4304/jcp.1.8.43-54
doi: 10.4304/jcp.1.8.43-54
Symmetric Active/Active High Availability for High-Performance Computing System Services
Christian Engelmann1, Stephen L. Scott2, Chokchai Leangsuksun3, Xubin He4
1Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN, USA
2Department of Computer Science, The University of Reading, Reading, UK
3Computer Science Department, Louisiana Tech University, Ruston, LA, USA
4Department of Electrical and Computer Engineering, Tennessee Tech University, Cookeville, TN, USA
Abstract—This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.
Index Terms—high-performance computing, high availability, virtual synchrony, group communication
2Department of Computer Science, The University of Reading, Reading, UK
3Computer Science Department, Louisiana Tech University, Ruston, LA, USA
4Department of Electrical and Computer Engineering, Tennessee Tech University, Cookeville, TN, USA
Abstract—This work aims to pave the way for high availability in high-performance computing (HPC) by focusing on efficient redundancy strategies for head and service nodes. These nodes represent single points of failure and control for an entire HPC system as they render it inaccessible and unmanageable in case of a failure until repair. The presented approach introduces two distinct replication methods, internal and external, for providing symmetric active/active high availability for multiple redundant head and service nodes running in virtual synchrony utilizing an existing process group communication system for service group membership management and reliable, totally ordered message delivery. Resented results of a prototype implementation that offers symmetric active/active replication for HPC job and resource management using external replication show that the highest level of availability can be provided with an acceptable performance trade-off.
Index Terms—high-performance computing, high availability, virtual synchrony, group communication
Cite: Christian Engelmann, Stephen L. Scott, Chokchai Leangsuksun, Xubin He, "Symmetric Active/Active High Availability for High-Performance Computing System Services," Journal of Computers vol. 1, no.8, pp. 43-54, 2006.
General Information
ISSN: 1796-203X
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Abbreviated Title: J.Comput.
Frequency: Bimonthly
Editor-in-Chief: Prof. Liansheng Tan
Executive Editor: Ms. Nina Lee
Abstracting/ Indexing: DBLP, EBSCO, ProQuest, INSPEC, ULRICH's Periodicals Directory, WorldCat,etc
E-mail: jcp@iap.org
-
Nov 14, 2019 News!
Vol 14, No 11 has been published with online version [Click]
-
Mar 20, 2020 News!
Vol 15, No 2 has been published with online version [Click]
-
Dec 16, 2019 News!
Vol 14, No 12 has been published with online version [Click]
-
Sep 16, 2019 News!
Vol 14, No 9 has been published with online version [Click]
-
Aug 16, 2019 News!
Vol 14, No 8 has been published with online version [Click]
- Read more>>