Skip to main content Skip to navigation

An SSD Primer for On-prem and Cloud HPC

Most clients are familiar with SSD’s that they have in their laptops due to their great performance for CAE applications. There are some tradeoffs to consider for when using SSD’s for HPC both on-prem and cloud that can bite you if you aren’t aware of them.

SSD Reliability: Don’t Drive The Tires Bald.

It is often assumed an SSD must be more reliable than a hard drive( e.g. spinning disk) since they have no moving parts. However, this recent study from backblaze showed SSD’s are actually not really more reliable than their spinning disk counterparts.

One possible reason is that SSD’s wear out by design. An SSD is made out of NAND flash memory, and the 1’s and 0’s are written to disk using voltage, and that voltage-based write wears the memory out over time, each time you write it you are wearing it out a little more. This is similar to the tires on your car, eventually you will wear down the tire through use. The more you write to it, the faster you wear it out. Most hardware vendors do not cover an SSD that has failed because it is “worn out” as part of their standard warranty.

There are different types of SSD that handle this wear differently, and while there are no moving parts, they have software that controls the wear level by spreading these writes out across the drive. In fact, most drives have more physical capacity than they advertise, as cells wear out, they are replaced with some of the “spare” capacity. For example, a 480GB Intel SSD actually has more physical capacity that you can’t directly use or access because its intended use is to replace worn-out cells.

When you see a drive rated as “read-intensive” or “write-intensive”, this is related to how much write wear and tear the drive can handle, this is analogous to how much tread comes on the tire. Some flash storage is more tolerant for the amount of re-writes and speed of those writes. You will hear terms like Single-Cell (SLC) flash storage, which has the fastest write speeds but at smaller capacities. Other types including MLC, TLC, and QLC tend to have higher capacities, with reduced write speeds.

SSD Recoverability. You Deleted All Files With No Backups.

Note if you delete a bunch of data accidentally housed on an SSD, data recovery can be a bit more complicated. Many SSDs utilize something called TRIM, which allows the operating system to know which data blocks it can clear. TRIM completely erases the data ( unlike with a hard drive, which just the pointer to the data is deleted when you delete a file and not the actual data blocks). For this reason, SSD’s with TRIM may lose their data permanently and not be recoverable by professional data recovery firms.

SSD Firmware. Bugs, and Not the Cute Kind.

SSD Firmware Bugs
SSD’s have firmware, firmware has bugs

SSD’s also have software/firmware which has been the subject of various serious bugs or issues. For example, in 2019 there was a flaw with enterprise SSD’s from some vendors that would fail exactly at  32,768 Hours of operation, in which case even RAID will not protect you. In other cases, embedded SSD’s in some recent InfiniBand switches wore out prematurely when running certain firmware versions, eventually bricking the device.

What About SSDs on the Cloud?

You might be thinking that you don’t have to worry about all of this detail on the cloud. However, there are other concerns to be aware of unique to cloud:

SSD’s May be Throttled – Credit Based I/O

Some types of SSD on the cloud have credit based performance, you may be throttled when you need it most. You could be having some blazing speed when testing your application with this type of cloud SSD for brief periods, then you go to production and a burst of activity brings your application down after you run out of I/O credits.

When your storage runs out of I/O credits, performance is reduced/throttled back to baseline performance. Many people new to cloud do not realize that many SSD types on the cloud are metered/throttled and do not monitor I/O credits until they get hit by this issue.

Your options around this are to increase storage to get more I/O credits, or choose a different type of SSD on the cloud that lets you specifically provision the amount of IOPS you need.

Choose your SSD based on Workload Requirements.

There can be 4+ choices of the type of SSD on the cloud with various tradeoffs, so it is important to understand your use case and pick the right type of SSD volume to avoid unexpected performance issues.

Know the Costs

Choosing SSDs on the cloud have a higher cost than magnetic/standard hard disks, and so you may not want to pay a premium if you don’t require it. On the cloud, the choice of hard drive doesn’t impact reliability, both options have around a 0.1% – 0.2% annual failure rate which is a very low failure rate.

Know the Availability and Durability

In some cases, the choice of SSD ( or not SSD) will affect the availability and durability of cloud data. For example, the data may only be locally redundant and be lost in a complete loss of a cloud data center. Knowing how your choices affect your availability is important to consider, and as always be sure to have 3-2-1 backups of critical data

Summary

SSD’s can be a great choice for many types of CAE workloads both on-prem and in the cloud, but they do come with some tradeoffs, feel free to reach out to us on our TotalCAE managed HPC cluster appliances and cloud on the best design for your needs.