Are SSD based arrays a bad idea?

In the not so new post  Why SSD-based arrays are a bad idea, Robin Harris wonders what is the right form factor for flash-based disk arraysMy take on this subject is that form factor matters, but the right form factor is derived from your main design target. If for example your main target is latency then you are better off using DRAM/PCIe interfaces as the SAS/SATA interface introduce some latency and limits your control over the IO path. This applies to tier 0 systems. If you are more in the tier 1 area where some latency penalty can be traded for enterprise level functionalities such as cost, dedup, snapshots and replication, then SSD is probably your best choice. Why is that? let’s go over Robin’s arguments list one by one:

  • Latency – not as important, the extra ~100 usec is not significant for most use cases
  • Bandwidth – even though SSD are not the best form factor to drain the juice out of the flash chips, it is meaningless for most tier 1 systems because the bottleneck is not within the SSD level! As the array becomes smarter, most bottlenecks are shifted to the compute elements, memory buses, and IO buses in the system.
  • Reliability – here I have to disagree with Robin. It is true that DIMMs are more reliable than disks and probably also more than SSDs, but Robin assume that a system that uses flash chips instead of SSDs is more reliable. This is not necessarily true! flash chips do not handle many (if not most) flash related issues in the chip level. They rely on an external controller and other components to perform critical tasks such as wear leveling, bad block handling and media scrubbing. Moreover, implementing such controllers by yourself assume that you are smarter than the SSD manufacturers and/or can produce some gains out of your low-level control. Personally I think believe in both assumptions.  Anyhow, once you start implementing a flash controller you are getting to the same problems as SSD systems and would also get to the same level of reliability. There is a small part where Robin may have a point – if you don’t work with SSDs you can pass the local SCSI stack. But even that is questionable because not everything in the SCSI stack is wrong…
  • Flexibility – as a SSD based disk array designer I can tell you that the SSD nature of the SSD didn’t cause us so many problems as you may assume even though we designed everything from scratch. This is because flash is still a block access media, and that’s what really count.
  • Cost – as I already wrote, flash chips require flash controllers and other resources (RAM, compute) so the comparison Robin did to DRAM is not really apples to apples comparison. That said, it is possible that you can reduce the cost of the flash control sub-system, but as enterprise level SSDs  starts to commodity, the large numbers economy is against such approach.

In fact I am willing to claim that even for tier 0 kind of systems it is not trivial to assume that flash chips/PCIe based design is better than SSD based design, because once you start make your system smarter and implement advanced functions, the device latency start to be insignificant. If you need a very “raw” performance box, flash chips desing/PCIe card may be better choice but then a server local PCIe card will be even better…


Leave a comment

Filed under all-flash disk array, Enterprise Storage, ssd

Cache In The Shadows

Disk array side read caches are much less effective than you can expect due to a phenomena a friend of mine calls “Cache shadowing“.  This happens because many if not most IO oriented applications have an application level cache and in addition you can frequently find at least one more level of cache (OS block/file system cache) before the read request reaches the disk array. The application and OS level enjoy several advantages over the disk array cache:

  1. They have knowledge about the application and therefore can be have smarter (i.e be more efficient) cache algorithms, and
  2. The aggregate size of the memory used for caching by all client may be much larger than the disk array cache even if the disk array is equipped with a very large cache, and most important,
  3. The application level cache and OS cache handle the read request before the disk array has a chance to see it. This allows them to exploit the best of the spatial locality and the temporal locality that all cache algorithms rely on (see Locality of reference).

The above points may lead to a situation where the read requests that do reach the disk array after passing through the external (from the disk array point of view) cache level, are very hard to cache, or in other words the application/server caches shadow the disk array cache (a nice metaphor, isn’t it?).  In this post I want to discuss how all-flash disk arrays affect this cache shadowing phenomena, and to suggest situations were cache shadowing is less dominant.

First of all, all flash disk arrays add another factor against the read cache – the flash media is so fast (i.e. has low latency) that you don’t need read cache! Remember that read caches are invented mainly to hide the latency of the (slow) HDD media. As the read latency from flash media may be as low as 50 usec (or lower), the benefit of hiding this latency is minimized, or even eliminated.

So it is the end of the read cache as we know it? Yes and no. Yes, straight forward data caches are much less required anymore. No becuase other types of read caches, such as content aware cache are still effective.

Content aware caches are caches that cache blocks by their content and not by their (logical or physical) addresses. Such caches can be efficient when  the disk array encounters a use case where the same content is read though large number of addresses. Sound complex? Here is an example: lets say the disk array stores a VMFS LUN with 50 Win7 VMs (full clones), and all VMs are booted in parallel (i.e. a “boot storm”). Most IOs during the boot process are read IOs (see “understanding how…”) and each VM reads its own set of (OS) files from its own set of  locations (this is not the case in linked clone, but lets put that aside for a moment). You may be not very surprised to know that the content of these OS files are almost the same across all VMs. Normal address based cache is not be very efficient in such use case because the aggregate amount of data and the number of data block  locations read during this boot storm may be very large,  but content aware cache ignores the addresses and consider only the content which repeat itself across the VMs. In such case the disk array content aware cache has “unfair advantage” over the local severs’ cache (The Win7 VMs’ cache in this example) and therefore can be very effective.

Of course such  content aware caches are not very common in the current generation of disk arrays, but this is going to change in the next generation disk arrays.

Leave a comment

Filed under all-flash disk array, Enterprise Storage, Storage architectures

VDI IO, Deep dive

A friend of mine just sent me a writeup I was not aware of before – Window 7 OPS for VDI a Deep Dive. I think it is a great writeup and is completely inline with most (if not all) of my own observations. The most important issues in my opinion are:

  • Desktops are not playing nice with each other, generating storage spikes
  • Boot, login and even applications’ load can generate very high IO spikes (thousands of IOPS for short periods)
  • Non persistent desktops are very complex to setup and maintain
  • Storage planing for VDI tend be an underestimation of the real needs
I am peculiarly happy that the writeup is backed up by real customer side experience. For example I am glad to find out that many enterprises consider non-persistent desktops to be too complex. As I wrote here before, I feel that the VDI stack, especially when it comes to non-persistent desktops, is far too complex. Note also that the VM boot/application load analysis results are very similar to the results I posted here.
Regarding the solution, I have of course a different view, but that comes with the territory…

Leave a comment

Filed under Uncategorized

Clones are not (writable) snapshots!

Everyone that has ever used server or desktop virtualization probably has used clones. Even though “clone” is not a well defined storage term, in most cases it is used to describe a data (image) copy. Technically this “copy” can be achieved using several technologies:  copy clones, snapshots based, or mirror based (BCV), etc. VMWare is using the term “Full clone” to describe a “copy clone”, while clones that use delta/copy on write (snapshot) mechanism are named linked clones. Some people treat clones only as linked clones and/or writable snapshots. Netapp has a feature called Flex-clone that is just a writable snapshot.

My view on this is that the term “clone” (as it is used in virtualization systems) should describe the use case and not the technology. Even though snapshots and clones may use the same underlying technology, their use cases and use patterns are not the same. For example, under many systems the snapshots’ source volume is more important to the user than its snapshots and has a preferred status over them (backup scenario). Technically, the source volume  is often full provisioned and has strict space accounting and manual removal policy while the snapshots are likely to be thin-provisioned (“space efficient”), may have automatic removal (expiration/exhaustion) policy, and soft/heuristic space management (see XIV for example).

This preferred source scheme will not work for clones;  in many cases the source of the clones is just a template that is never used by itself, so you can store it on much less powerful storage (tier), and once you finished generating the clones, you can delete it if you want. The outcome of the clone is much more important from the template, so if space runs out you may delete few old templates, but you wont remove the clones if they are in use – each of them is a standalone VM image.

This can be demonstrated by VMware’s linked clones that are implemented as writable snapshots on top of  a readonly base. When you generate linked clone pool using VMWare View Manager, the manager creates a readonly full  clone (the “replica”) and takes snapshots from it. This (clever) way hides the snapshot source and in most cases you don’t directly manage or use replicas. The base template has no role after the cloning ends and can be deleted.

Another major difference is the different creation patterns: snapshot creating events tend to be periodic (backup/data set separation scenarios), and cloning creation (at least for VDI use cases) tend to be bursty, meaning each time clones are created from a base (template/replica) many of them are created at once.

This means that if you build a source-snapshot(or clone) creation over time graph, a typical snapshot graph will be dense and long tree (see below) while the equivalent typical clone tree will be very shallow but with big span out factor (maybe it should be called clone bush 😉 ). The following diagrams depict such graphs:

Typical snapshots tree

Typical clones tree

Typical clones tree

Due to these differences, even though under the hood snapshots and (linked) clones may be implemented using the same technologies, it is bad and in fact they shouldn’t be implemented in the same way as many (if not most) implementation assumptions for snapshots are not valid for clones and vice versa!

A very good example for such assumption is the span out level – many snapshots are implemented as follows: the source has it own guaranteed space, and each snapshot has its own delta space. When a block in the source is modified, the old block is copied to the snapshot delta  spaces (copy on write). This common technique  is very efficient for (primary) source and (secondary) snapshot scheme but on the other side it also assumes that the span out level is low – because the modified block has to be copied to each snapshot delta space. Image what will happen if you have 1000 (snapshot based) clones created from the same source!

If we go back to the VMWare’s linked clone case, the read only replica is enabling VMWare to generate many writable snapshots on top of a single source. The original snapshot mechanism cannot do that!

To sum this post up I want to claim that:

  1. Clones (even linked) are not snapshots
  2. Most (even all) storage systems are not implementing the clone use case, but rather just the snapshot use case
  3. It is time that storage systems will implement clones

Leave a comment

Filed under VDI, Virtualization

VAAI is great, not just for VMware!

VMWare’s disk array offloading verbs (VAAI) seems to be a major success – any self respecting storage vendor is implementing these verbs so in a year or two it will be pretty common. I think that a very important fact about VAAI is that VMware’s VAAI uses standard T10 SCSI commands (note that in vSphere4 you would need a vendor specific plugin, but in vSphere5 the t10 verbs are supported without any plugin). As T10 verbs are just standard SCSI  commands, nothing limits the use of  these verbs to VMWare environments. This make the four existing verbs very useful for many use cases, not VMWare related:

Extended copy (xcopy): a server side copy mechanism. By itself xcopy is not new to the SCSI standard, but its general form (the asynchronous) one is so complex that it is hardly implemented and therefore hardly used. VMWare was brilliant enough to find a way to simplify xcopy by using the hidden synchronous version of it (this version is hidden so well that I had to read the spec several times to convince my self that such mode exists). The result is that now every VAAI able array has a very useful and simple server side copy verb that can be used for things such as:

  • User file data copy – required help from the file system but can offload the entire data operation to the disk array!
  • Snapshot copy on write – if the COW grains are relatively large, XCOPY may offload much of the overhead of the COW operation.
  • Volume mirroring/BCV style copies- during (resync) and other

Write same: is the storage form of memset(). Used by vmware to initialize storage spaces to zero. There are many similar cases in general purpose systems that can use for initialization or similar tasks.

Compare and Write (ATS): the storage form of compare and swap. This is a very cool verb because is opens the world of  “lockless” synchronization algorithms to any distributed application or system.  “Lockless” algorithms are much more efficient than the current lockfull reserve/release or persistent reservation mechanisms. I really hope distributed file systems, clustering software, data bases and other applications will use this verb.

Unmap (“trim”): this verbs tells a thin provisioning capabale storage (and most today’s storage system are) that a specific area is not used by the file system or other application. Without it, the entire idea of thin provisioing is a bit pointless if a filesystem is used on top of the volume – overtime the filesystem writes to the entire volume space which forces the storage to allocate space for it, and the space saving is lost. The concept that the file system should inform the volume beneath it that it is not using a specific storage area is already well known and accepted: NTFS and ext4 (maybe other file systems too) can send TRIM commands if they know that they are working above an SSD. This is exactly what is needed also for any thin provisioning capable storage. I have high hopes that implementing such UNMAP support is already in the todo lists of many file system developers. (BTW, I am not claiming that TRIM and UNMAP are the same. I know they are completely different. I am claiming that from the filesystem’s view they are the same).

And additional note: even within VMWare system, VAAI verbs can be used in much more places that they are today. I hope to write an additional post on such cases.

Leave a comment

Filed under Enterprise Storage, Virtualization

Desktop virtualization (VDI): is it too complex?

I am following VDI technologies and solutions right from the days people started to talk about it (around 2003) and even participated in VDI technologies development in my days in Qumrant. After 8 years, I am reviewing the current VDI solutions, and I have one very clear observation: It is far too complex. With the complexity comes also high operations costs (OPEX) and expensive setups are required (== high CAPEX). I think that something very wrong happened with VDI along the way. Just to be clear, I am not criticizing a specific solution. I think that the dominate VDI architecture is just wrong, regardless of the vendor. As I see it VDI solutions are built like that:

Take a server virtualization technology, use it to run many desktops on each physical host, add a decent remoting protocol, multimedia acceleration (optionally also WAN acceleration),  desktop to user broker, user (login) portal and/or other access control, several provisioning mechanisms, several update/patch mechanisms, several image cleanup mechanisms, application virtualization, profile virtualization, application streaming, user data redirection, antivirus accelerator, a management console to manage pools, another one to manage applications, storage solution for the storage storms and network solution for the network storms. If I didn’t miss something critical (and I am sure I did), you have a VDI solution. Oops! I totally forgot the OS, the system utilities, and the applications (but they are old news ;-))…

The above seems to be a good base for another Carlin style gig (see Modern man) but it can’t be a good basis for a solid enterprise level solution.

I have many thoughts on why is this so, and what is the solution for it, but this has to wait to another post.

Leave a comment

Filed under VDI

SSD Dedup and VDI

I found this nice Symantec blog about the SSD+Dedup+VDI issues in the DCIG site. Basically I agree with its main claim that SSD+Dedup is a good match for VDI. On the other side, I think that the 3 potential “pitfalls” mentioned in the post are probably relevant for a naive storage system, and much less for an enterprise level disk array. Here is why (the blue parts are citations from the original post):

  • Write I/O performance to SSDs is not nearly as good as read I/Os. SSD read I/O performance is measured in microseconds. So while SSD write I/O performance is still faster than writes to hard disk drives (HDDs), writes to SSDs will not deliver nearly the same performance boost as read I/Os plus write I/O performance on SSDs is known to degrade over time.
This claim is true only for non enterprise level SSDs. Enterprise level SSDs write performance suffer much less from performance degradation and due its internal NVRAM, the write latency is as good as read latency, if not better. Furthermore most disk arrays have non trivial logic and enough resources to handle these issues even if the SSDs cannot.
  • SSDs are still 10x the cost of HDDs. Even with the benefits provided by deduplication an organization may still not be able to justify completely replacing HDDs with SSDs which leads to a third problem.
There is no doubt that SSDs are at least 10x more expensive than HDDs in terms of GB/$. But when comparing the complete solution cost the outcome is different. In many VDI systems the real main storage constrain is IOPS and not capacity. This means that a  HDD based solution  may need to over provision the system capacity and/or use small disks such that you will have enough (HDD) spindles to satisfy the IOPS requirements. In this case, the real game is IOP/$ where SSDs win big time. Together with the Dedup oriented space reduction, the total solution’s cost maybe very attractive.
  • Using deduplication can result in fragmentation. As new data is ingested and deduplicated, data is placed further and further apart. While fragmentation may not matter when all data is stored on SSDs, if HDDs are still used as part of the solution, this can result in reads taking longer to complete.

Basically I agree, but again the disk array logic may mitigate at least some of the problem. Of course 100% SSD solution is better (much better is some cases). but the problem is that such solutions are still very rare if at all.

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures, VDI, Virtualization