Sometimes when we engage customers on HPC cluster architectures we get asked about HA (Highly Available) configurations.

Many classical HA configurations are overly complex.  Complex configurations lead to complex bugs, complex debugging, and often create more downtime than the rare issue they are trying to guard against. Most catastrophic failures for single points of failure (SPF)  are extremely rare.

For customers who need extreme availability we recommend a “multi-island” architecture which eliminates all SPF through a simple architecture that is complexity proof, customer weathered,  and time proven.

There are some types of simple redundancy mechanisms that offer a good complexity/benefit tradeoff, such as scheduler failover.

For example, the SLURM scheduler which offers a very simple failover mechanism. The only requirement is that another machine ( typically the cluster login node)  runs a SLURM controller, and that there is a shared state NFS directory between the two of them.

The diagram below shows this architecture

Slurm Failover
Slurm Failover

When the primary SLURM controller is unavailable, the backup controller transparently takes over. It queues data and when the primary head node becomes available, writes back any needed data and relinquishes control back. This just requires a  shared state directory.

The following three settings enable HA in SLURM:

BackupController=[backup name]
BackupAddr=[backup address]
StateSaveLocation=[shared directory]
AccountingStorageBackupHost=[backup name]

The failover is automatic, you can also force a takeover: 

 scontrol takeover

The SLURM philosophy for HA aligns with the  TotalCAE production philosophy we have learned over the last twenty years is to make everything as simple and possible. Simple solutions have fewer and simpler issues, which translate to higher uptime and availability.