Category Archives: Enterprise Storage

Are SSD based arrays a bad idea?

In the not so new post  Why SSD-based arrays are a bad idea, Robin Harris wonders what is the right form factor for flash-based disk arraysMy take on this subject is that form factor matters, but the right form factor is derived from your main design target. If for example your main target is latency then you are better off using DRAM/PCIe interfaces as the SAS/SATA interface introduce some latency and limits your control over the IO path. This applies to tier 0 systems. If you are more in the tier 1 area where some latency penalty can be traded for enterprise level functionalities such as cost, dedup, snapshots and replication, then SSD is probably your best choice. Why is that? let’s go over Robin’s arguments list one by one:

  • Latency – not as important, the extra ~100 usec is not significant for most use cases
  • Bandwidth – even though SSD are not the best form factor to drain the juice out of the flash chips, it is meaningless for most tier 1 systems because the bottleneck is not within the SSD level! As the array becomes smarter, most bottlenecks are shifted to the compute elements, memory buses, and IO buses in the system.
  • Reliability – here I have to disagree with Robin. It is true that DIMMs are more reliable than disks and probably also more than SSDs, but Robin assume that a system that uses flash chips instead of SSDs is more reliable. This is not necessarily true! flash chips do not handle many (if not most) flash related issues in the chip level. They rely on an external controller and other components to perform critical tasks such as wear leveling, bad block handling and media scrubbing. Moreover, implementing such controllers by yourself assume that you are smarter than the SSD manufacturers and/or can produce some gains out of your low-level control. Personally I think believe in both assumptions.  Anyhow, once you start implementing a flash controller you are getting to the same problems as SSD systems and would also get to the same level of reliability. There is a small part where Robin may have a point – if you don’t work with SSDs you can pass the local SCSI stack. But even that is questionable because not everything in the SCSI stack is wrong…
  • Flexibility – as a SSD based disk array designer I can tell you that the SSD nature of the SSD didn’t cause us so many problems as you may assume even though we designed everything from scratch. This is because flash is still a block access media, and that’s what really count.
  • Cost – as I already wrote, flash chips require flash controllers and other resources (RAM, compute) so the comparison Robin did to DRAM is not really apples to apples comparison. That said, it is possible that you can reduce the cost of the flash control sub-system, but as enterprise level SSDs  starts to commodity, the large numbers economy is against such approach.

In fact I am willing to claim that even for tier 0 kind of systems it is not trivial to assume that flash chips/PCIe based design is better than SSD based design, because once you start make your system smarter and implement advanced functions, the device latency start to be insignificant. If you need a very “raw” performance box, flash chips desing/PCIe card may be better choice but then a server local PCIe card will be even better…


Leave a comment

Filed under all-flash disk array, Enterprise Storage, ssd

Cache In The Shadows

Disk array side read caches are much less effective than you can expect due to a phenomena a friend of mine calls “Cache shadowing“.  This happens because many if not most IO oriented applications have an application level cache and in addition you can frequently find at least one more level of cache (OS block/file system cache) before the read request reaches the disk array. The application and OS level enjoy several advantages over the disk array cache:

  1. They have knowledge about the application and therefore can be have smarter (i.e be more efficient) cache algorithms, and
  2. The aggregate size of the memory used for caching by all client may be much larger than the disk array cache even if the disk array is equipped with a very large cache, and most important,
  3. The application level cache and OS cache handle the read request before the disk array has a chance to see it. This allows them to exploit the best of the spatial locality and the temporal locality that all cache algorithms rely on (see Locality of reference).

The above points may lead to a situation where the read requests that do reach the disk array after passing through the external (from the disk array point of view) cache level, are very hard to cache, or in other words the application/server caches shadow the disk array cache (a nice metaphor, isn’t it?).  In this post I want to discuss how all-flash disk arrays affect this cache shadowing phenomena, and to suggest situations were cache shadowing is less dominant.

First of all, all flash disk arrays add another factor against the read cache – the flash media is so fast (i.e. has low latency) that you don’t need read cache! Remember that read caches are invented mainly to hide the latency of the (slow) HDD media. As the read latency from flash media may be as low as 50 usec (or lower), the benefit of hiding this latency is minimized, or even eliminated.

So it is the end of the read cache as we know it? Yes and no. Yes, straight forward data caches are much less required anymore. No becuase other types of read caches, such as content aware cache are still effective.

Content aware caches are caches that cache blocks by their content and not by their (logical or physical) addresses. Such caches can be efficient when  the disk array encounters a use case where the same content is read though large number of addresses. Sound complex? Here is an example: lets say the disk array stores a VMFS LUN with 50 Win7 VMs (full clones), and all VMs are booted in parallel (i.e. a “boot storm”). Most IOs during the boot process are read IOs (see “understanding how…”) and each VM reads its own set of (OS) files from its own set of  locations (this is not the case in linked clone, but lets put that aside for a moment). You may be not very surprised to know that the content of these OS files are almost the same across all VMs. Normal address based cache is not be very efficient in such use case because the aggregate amount of data and the number of data block  locations read during this boot storm may be very large,  but content aware cache ignores the addresses and consider only the content which repeat itself across the VMs. In such case the disk array content aware cache has “unfair advantage” over the local severs’ cache (The Win7 VMs’ cache in this example) and therefore can be very effective.

Of course such  content aware caches are not very common in the current generation of disk arrays, but this is going to change in the next generation disk arrays.

Leave a comment

Filed under all-flash disk array, Enterprise Storage, Storage architectures

VAAI is great, not just for VMware!

VMWare’s disk array offloading verbs (VAAI) seems to be a major success – any self respecting storage vendor is implementing these verbs so in a year or two it will be pretty common. I think that a very important fact about VAAI is that VMware’s VAAI uses standard T10 SCSI commands (note that in vSphere4 you would need a vendor specific plugin, but in vSphere5 the t10 verbs are supported without any plugin). As T10 verbs are just standard SCSI  commands, nothing limits the use of  these verbs to VMWare environments. This make the four existing verbs very useful for many use cases, not VMWare related:

Extended copy (xcopy): a server side copy mechanism. By itself xcopy is not new to the SCSI standard, but its general form (the asynchronous) one is so complex that it is hardly implemented and therefore hardly used. VMWare was brilliant enough to find a way to simplify xcopy by using the hidden synchronous version of it (this version is hidden so well that I had to read the spec several times to convince my self that such mode exists). The result is that now every VAAI able array has a very useful and simple server side copy verb that can be used for things such as:

  • User file data copy – required help from the file system but can offload the entire data operation to the disk array!
  • Snapshot copy on write – if the COW grains are relatively large, XCOPY may offload much of the overhead of the COW operation.
  • Volume mirroring/BCV style copies- during (resync) and other

Write same: is the storage form of memset(). Used by vmware to initialize storage spaces to zero. There are many similar cases in general purpose systems that can use for initialization or similar tasks.

Compare and Write (ATS): the storage form of compare and swap. This is a very cool verb because is opens the world of  “lockless” synchronization algorithms to any distributed application or system.  “Lockless” algorithms are much more efficient than the current lockfull reserve/release or persistent reservation mechanisms. I really hope distributed file systems, clustering software, data bases and other applications will use this verb.

Unmap (“trim”): this verbs tells a thin provisioning capabale storage (and most today’s storage system are) that a specific area is not used by the file system or other application. Without it, the entire idea of thin provisioing is a bit pointless if a filesystem is used on top of the volume – overtime the filesystem writes to the entire volume space which forces the storage to allocate space for it, and the space saving is lost. The concept that the file system should inform the volume beneath it that it is not using a specific storage area is already well known and accepted: NTFS and ext4 (maybe other file systems too) can send TRIM commands if they know that they are working above an SSD. This is exactly what is needed also for any thin provisioning capable storage. I have high hopes that implementing such UNMAP support is already in the todo lists of many file system developers. (BTW, I am not claiming that TRIM and UNMAP are the same. I know they are completely different. I am claiming that from the filesystem’s view they are the same).

And additional note: even within VMWare system, VAAI verbs can be used in much more places that they are today. I hope to write an additional post on such cases.

Leave a comment

Filed under Enterprise Storage, Virtualization

SSD Dedup and VDI

I found this nice Symantec blog about the SSD+Dedup+VDI issues in the DCIG site. Basically I agree with its main claim that SSD+Dedup is a good match for VDI. On the other side, I think that the 3 potential “pitfalls” mentioned in the post are probably relevant for a naive storage system, and much less for an enterprise level disk array. Here is why (the blue parts are citations from the original post):

  • Write I/O performance to SSDs is not nearly as good as read I/Os. SSD read I/O performance is measured in microseconds. So while SSD write I/O performance is still faster than writes to hard disk drives (HDDs), writes to SSDs will not deliver nearly the same performance boost as read I/Os plus write I/O performance on SSDs is known to degrade over time.
This claim is true only for non enterprise level SSDs. Enterprise level SSDs write performance suffer much less from performance degradation and due its internal NVRAM, the write latency is as good as read latency, if not better. Furthermore most disk arrays have non trivial logic and enough resources to handle these issues even if the SSDs cannot.
  • SSDs are still 10x the cost of HDDs. Even with the benefits provided by deduplication an organization may still not be able to justify completely replacing HDDs with SSDs which leads to a third problem.
There is no doubt that SSDs are at least 10x more expensive than HDDs in terms of GB/$. But when comparing the complete solution cost the outcome is different. In many VDI systems the real main storage constrain is IOPS and not capacity. This means that a  HDD based solution  may need to over provision the system capacity and/or use small disks such that you will have enough (HDD) spindles to satisfy the IOPS requirements. In this case, the real game is IOP/$ where SSDs win big time. Together with the Dedup oriented space reduction, the total solution’s cost maybe very attractive.
  • Using deduplication can result in fragmentation. As new data is ingested and deduplicated, data is placed further and further apart. While fragmentation may not matter when all data is stored on SSDs, if HDDs are still used as part of the solution, this can result in reads taking longer to complete.

Basically I agree, but again the disk array logic may mitigate at least some of the problem. Of course 100% SSD solution is better (much better is some cases). but the problem is that such solutions are still very rare if at all.

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures, VDI, Virtualization

Storage dinosaurs – beware, here comes the SSD!

Magnetic spinning disks (hard drives – HDDs) are the most dominant primary storage device  in the storage world for at least 50 years. This era is over! Within few years (more or less) the SSD will replace the HDD as the primary storage device. Moreover, already in the near future we will be able to see real SSD based storage systems (i.e. systems that use SSD as a primary device and not as a cache/tier-zero device). When these systems will be available, some very fundamental ground rules of storage and storage performance thinking will have to change.  In this post I want to present several common conventions and areas that are heavily affected by “HDD thinking” and may change when SSD systems will start to impact the storage market.

Storage devices are sloooooow!

Comparing to any other major computer component, HDDs are slow. Their bandwidth is modest and their latency is disastrous (see also the sequential/random access section below). This makes HDD as the major primary storage device to be the major performance  bottleneck of many systems, and the major performance bottleneck of most storage systems. Many storage related and some non storage related sub-systems are under optimized simply because the HDD bottlenecks conceal other sub-systems’ performance problems.

The SDD effect:

SSDs are at least order of magnitude faster than HDDs. They are still not comparable to RAM, but connect several enterprise level SSDs to your system, and you will have enough “juice” to hit other system component bottlenecks (memory bus, interface bus, interrupts handling, pure CPU issues, etc.). This changes the entire system performance balance of many systems.

Random vs. Sequential

Due its physical design, a typical HDD can efficiently access continues (sequential) areas on the media, while access to any non successive (random) areas is very inefficient. A typical HDD can stream about 100MB/sec (or more) of successive data, but can access no more then about 200 different areas on the media per second, such that accessing random 4k blocks reduces the outcome bandwidth to about 800KB/sec!

This bi-modal behavior affected the storage system “thinking” very deeply to a point that almost every application that accesses storage resources is tuned to this behavior. This means that:

  • Most applications attempt to minimize the random storage accesses and prefer sequential accesses if they can, and/or attempt to access approximately “close” areas.
  • Applications that need good random bandwidth attempt to access the storage using big blocks. This helps because accessing random 512 bytes “costs” about the same as accessing 32KB of data because the data transfer time from/to an HDD  is relatively small comparing to the seek time (movement of the disk’s head) and the rotational latency (the time the disk has to wait until the disk plate(s) rotates such that the data block is under the head).

The SDD effect:

The media used to build SSD is mostly RAM or flash. Both provide very good random access latency and bandwidth  that is comparable (but still lower) than the SSD’s sequential access latency and bandwidth. Flash media has other limitations (no direct rewrite ability) that force SSD designers to implement complex redirect on write schemes. In most cases, a sequential write access is much easier to handle than random write access, so the bi-modal behavior is somehow retained. On the other hand, read operations are much less affected from this complexity and enterprise level SSDs are built in a way that minimize the sequential/random access performance gap. The sequential write access performance of a typical desktop SSD is about order of magnitude better than its random write performance. For enterprise level SSDs, the performance factor gap may narrow down to about 2, and in many cases the random performance is good enough (i.e. it is not the performance bottleneck of the system). More specifically, random access patterns using small blocks are not performance horrors anymore.

Storage related data structures

When applications needs access to data on HDDs they tend to organize it in a way that is optimized to the random/sequential bi-modality. For example data chunks that have good change to be accessed together or at least during a short period of time, are stored close one to each other (or in storage talk – identify “locality of reference” and transform temporal locality to spatial locality). Applications also tend to use data structures that are optimized for such locality of reference, (such as B-Trees) and avoid data structures that are not (such as hash tables). Such data-structures by themselves may introduce additional overheads for using random/sparse access patterns, and by that create a magical circle where the motivation to use sequential accesses is getting bigger and bigger.

The SSD effect:

Data structures that are meant to exploit the data’s locality of reference still works for SSDs. But as the sequential/random access gap is much smaller, such data structures may cease to be the best structures for many applications, as there are other (application depended) issues that should be the focus of the data structure optimizations instead of the common storage locality of reference. For example, sparse data structures such as hash tables may be much more applicable for some use cases than the current used data structures.

Read and write caches

HDD’s random access latencies are huge comparing to most other computer sub-systems. For example, a typical round-trip time of a packet in a modern 1Gb Ethernet LAN  is about 100 usec. The typical latency of a random 4k HDD IO (read or write) operation is about 4-10 msec (i.e. 40-100 times slower). Read and write caches attempt to reduce or at least hide some of the HDD random IO latency:

  • Read caches keep an copy of the frequently accessed data blocks using some faster media, most commonly RAM. During the read IO flow, the requested read (offset) is looked up in the read cache and if it is found the slow IO operation is completely avoided.
  • Write caches keep last written data in much faster media (e.g. RAM/NVRAM) and write it back (“destage”) to the HDD layer in the background, after the user write is acknowledged. This “hides” most of the write latency and let the system optimize the write (“destage”) bandwidth by applying write-coalescing (known also as “”write combining“”) and write-reordering  (e.g. “elevator algorithm“) techniques.

Of course the effect of both caches is limited:

  • Read caches are effective only if the user data access pattern has enough locality of reference and the total dataset is small enough
  • Write caches are effective only when the destage bandwidth is higher then the user data write bandwidth. For most HDD systems this is not the case, so write caches are good to handle short spikes of writes, but when the cache buffer filles up, the user write latency drops back to the HDD write latency (or to be more exact, to the destage latency).

The SSD effect:

For some SSD based systems the traditional read cache shoulde be much less important comparing to HDD systems. The read operation latency in many SSDs is so low (about 50 usec) that it ceases to be the dominant part of the system’s read operation latency, so the reasoning behind the read cache is much weaker.

Regarding write caches, most SSDs have internal  (battery or otherwise backed up) write caches and in addition, the base write latency is much lower than the HDD write latency, making the write cache to be much less important too. Still, as the SSD’s write latency is relatively high comparing to the read latency, a write behind buffer can be used to hide this write latency. Furthermore, unlike HDD systems,  it should be relatively easy to build a system such that the destage bandwidth is high enough to hide at the media write latency from the user even during very long full bandwidth writes.


In the early 90’s Mendel RosenblumJ. K. Ousterhout published a very important article named “The_Design_and_Implementation_of_a_Log-Structured_File_System” where in short they claimed the following: read latency can be effectively handled by read caches. This leaves the write latency and throughput as the major problem for many storage systems. The articles suggested the following: instead of having a traditional mapping structure to map logical locations to physical locations such as B-Trees and, use DB oriented technique called logging. When logging is used, each new user write operation writes its data to a continues buffer on the disk, ignoring the existing location of the user data on the physical media. This replaces the commons write-in-place technique with relocate-on-write technique. The motivation is to exploit the relatively fast HDD  sequential write (on the expense of read efficiency,  complexity and defragmentation problems). This technique has many advantages over traditional write in place schemes, such as better consistency, fast recovery, RAID optimizations (i.e. the ability of writing only full RAID stripes), snapshots implementations, etc. (It has also some major drawbacks).

The SSD effect:

The original motivation for logging is not valid anymore: random write performance is not as big problem for SSDs as it is for HDDs, such that the additional complexity may not be worth the performance gain. Still logging techniques may be relevant for SSD based systems in the following cases:

  • Storage systems that directly manage flash components may use logging to solve (also) several flash problems, most importantly the no direct re-write operation support problem and the wear leveling problem.
  • Storage systems may use logging to implement better RAID schemes, and as base for advanced functionalities (snapshots, replication, etc.)

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures

SSD’s internal complexity – a challenge or an opportunity?

SSDs are much more complex than the magnetic disks they are ought to replace. By complex I mean that they use much more complex logic. This is required to overcome the fundamental problems of the flash media:

  • Flash media do not support the simple read block/rewrite block semantics as magnetic disks do. Instead it uses a read page, write page (but no rewrite), and delete block (block >> page) semantics. A pretty complex mapping and buffering techniques must be used to simulate read/rewrite semantics.
  • A Flash page can be modified (== erased and programmed) only a limited number of times. This requires a “wear leveling” logic that attempts to spread the modifications on the entire block/page set.
  • Flash media suffers from the various sources of data related errors (including errors due to read operations, writes to a physically close locations, etc.). Advanced recovery mechanisms must be implemented to overcome these errors.

As already, nothing comes without a price. The cost of this complexity is:

  • It consumes more resources
  • It requires much more efforts to develop and test SSDs and accordingly more time to stabilize/productize
  • It makes SSD’s performance to be much more complex to analyze and predict
On the other hand, the fact that SSDs have much more resources than a common magnetic disk and that SSDs are geared with advanced data structures and mechanisms makes them also an enabler for new advanced functions and features. Indeed, you can already see features such as encryption and compression implemented within desktop class SSDs (e.g. Intel’s 320 SSD).
These features are only the start. I believe that SSDs can implement “hardware” assists to offload some of the storage features implementation load and that such assists may be required to build the next gen SSD based storage systems.

For example, because a single  magnetic disk (HDD) provides only (up to) 250 IOPS,  HDD  based RAID systems implements the RAID logic above the disks, namely in a “RAID controller”. But then a single  (enterprise level) SSD provides tens of thousands of IOPS making such architecture much less reasonable.  50K IOPS (4k blocks) SSDs in a 16 disks RAID-6 system require that the RAID controller processes 50 * 16 * 4 = 3.2 GB/s of data (mix of reads, writes and XORs). These numbers are far above the current RAID controllers performance envelope (See for example this post: Accelerating System Performance with SSD RAID Arrays)

If the RAID controller could leverage the SSDs to do most of the data crunching it would need to be “only ” a flow manager and coordinator. The same logic can be relevant for other storage functions too: snapshots, data transfers, copy on write, compare and write, etc. There are also some  down side to it: such architecture is much more complex than the current layering architecture and may need new disk/bus side abilities such as disk to disk transfers ability and others. Furthermore, such SSD side assists must be standardized to enable various RAID controllers vendors to use it. Still, I believe that this architecture is much more cost-effective and scalable.
To sum this up, I don’t know if SSD with storage function assists will ever exists. I hope they will because I think it is the correct and economical way to go.

1 Comment

Filed under Enterprise Storage, ssd