In the not so new post Why SSD-based arrays are a bad idea, Robin Harris wonders what is the right form factor for flash-based disk arrays. My take on this subject is that form factor matters, but the right form factor is derived from your main design target. If for example your main target is latency then you are better off using DRAM/PCIe interfaces as the SAS/SATA interface introduce some latency and limits your control over the IO path. This applies to tier 0 systems. If you are more in the tier 1 area where some latency penalty can be traded for enterprise level functionalities such as cost, dedup, snapshots and replication, then SSD is probably your best choice. Why is that? let’s go over Robin’s arguments list one by one:
- Latency – not as important, the extra ~100 usec is not significant for most use cases
- Bandwidth – even though SSD are not the best form factor to drain the juice out of the flash chips, it is meaningless for most tier 1 systems because the bottleneck is not within the SSD level! As the array becomes smarter, most bottlenecks are shifted to the compute elements, memory buses, and IO buses in the system.
- Reliability – here I have to disagree with Robin. It is true that DIMMs are more reliable than disks and probably also more than SSDs, but Robin assume that a system that uses flash chips instead of SSDs is more reliable. This is not necessarily true! flash chips do not handle many (if not most) flash related issues in the chip level. They rely on an external controller and other components to perform critical tasks such as wear leveling, bad block handling and media scrubbing. Moreover, implementing such controllers by yourself assume that you are smarter than the SSD manufacturers and/or can produce some gains out of your low-level control. Personally I think believe in both assumptions. Anyhow, once you start implementing a flash controller you are getting to the same problems as SSD systems and would also get to the same level of reliability. There is a small part where Robin may have a point – if you don’t work with SSDs you can pass the local SCSI stack. But even that is questionable because not everything in the SCSI stack is wrong…
- Flexibility – as a SSD based disk array designer I can tell you that the SSD nature of the SSD didn’t cause us so many problems as you may assume even though we designed everything from scratch. This is because flash is still a block access media, and that’s what really count.
- Cost – as I already wrote, flash chips require flash controllers and other resources (RAM, compute) so the comparison Robin did to DRAM is not really apples to apples comparison. That said, it is possible that you can reduce the cost of the flash control sub-system, but as enterprise level SSDs starts to commodity, the large numbers economy is against such approach.
In fact I am willing to claim that even for tier 0 kind of systems it is not trivial to assume that flash chips/PCIe based design is better than SSD based design, because once you start make your system smarter and implement advanced functions, the device latency start to be insignificant. If you need a very “raw” performance box, flash chips desing/PCIe card may be better choice but then a server local PCIe card will be even better…
Disk array side read caches are much less effective than you can expect due to a phenomena a friend of mine calls “Cache shadowing“. This happens because many if not most IO oriented applications have an application level cache and in addition you can frequently find at least one more level of cache (OS block/file system cache) before the read request reaches the disk array. The application and OS level enjoy several advantages over the disk array cache:
- They have knowledge about the application and therefore can be have smarter (i.e be more efficient) cache algorithms, and
- The aggregate size of the memory used for caching by all client may be much larger than the disk array cache even if the disk array is equipped with a very large cache, and most important,
- The application level cache and OS cache handle the read request before the disk array has a chance to see it. This allows them to exploit the best of the spatial locality and the temporal locality that all cache algorithms rely on (see Locality of reference).
The above points may lead to a situation where the read requests that do reach the disk array after passing through the external (from the disk array point of view) cache level, are very hard to cache, or in other words the application/server caches shadow the disk array cache (a nice metaphor, isn’t it?). In this post I want to discuss how all-flash disk arrays affect this cache shadowing phenomena, and to suggest situations were cache shadowing is less dominant.
First of all, all flash disk arrays add another factor against the read cache – the flash media is so fast (i.e. has low latency) that you don’t need read cache! Remember that read caches are invented mainly to hide the latency of the (slow) HDD media. As the read latency from flash media may be as low as 50 usec (or lower), the benefit of hiding this latency is minimized, or even eliminated.
So it is the end of the read cache as we know it? Yes and no. Yes, straight forward data caches are much less required anymore. No becuase other types of read caches, such as content aware cache are still effective.
Content aware caches are caches that cache blocks by their content and not by their (logical or physical) addresses. Such caches can be efficient when the disk array encounters a use case where the same content is read though large number of addresses. Sound complex? Here is an example: lets say the disk array stores a VMFS LUN with 50 Win7 VMs (full clones), and all VMs are booted in parallel (i.e. a “boot storm”). Most IOs during the boot process are read IOs (see “understanding how…”) and each VM reads its own set of (OS) files from its own set of locations (this is not the case in linked clone, but lets put that aside for a moment). You may be not very surprised to know that the content of these OS files are almost the same across all VMs. Normal address based cache is not be very efficient in such use case because the aggregate amount of data and the number of data block locations read during this boot storm may be very large, but content aware cache ignores the addresses and consider only the content which repeat itself across the VMs. In such case the disk array content aware cache has “unfair advantage” over the local severs’ cache (The Win7 VMs’ cache in this example) and therefore can be very effective.
Of course such content aware caches are not very common in the current generation of disk arrays, but this is going to change in the next generation disk arrays.