Category Archives: Virtualization

Clones are not (writable) snapshots!

Everyone that has ever used server or desktop virtualization probably has used clones. Even though “clone” is not a well defined storage term, in most cases it is used to describe a data (image) copy. Technically this “copy” can be achieved using several technologies:  copy clones, snapshots based, or mirror based (BCV), etc. VMWare is using the term “Full clone” to describe a “copy clone”, while clones that use delta/copy on write (snapshot) mechanism are named linked clones. Some people treat clones only as linked clones and/or writable snapshots. Netapp has a feature called Flex-clone that is just a writable snapshot.

My view on this is that the term “clone” (as it is used in virtualization systems) should describe the use case and not the technology. Even though snapshots and clones may use the same underlying technology, their use cases and use patterns are not the same. For example, under many systems the snapshots’ source volume is more important to the user than its snapshots and has a preferred status over them (backup scenario). Technically, the source volume  is often full provisioned and has strict space accounting and manual removal policy while the snapshots are likely to be thin-provisioned (“space efficient”), may have automatic removal (expiration/exhaustion) policy, and soft/heuristic space management (see XIV for example).

This preferred source scheme will not work for clones;  in many cases the source of the clones is just a template that is never used by itself, so you can store it on much less powerful storage (tier), and once you finished generating the clones, you can delete it if you want. The outcome of the clone is much more important from the template, so if space runs out you may delete few old templates, but you wont remove the clones if they are in use – each of them is a standalone VM image.

This can be demonstrated by VMware’s linked clones that are implemented as writable snapshots on top of  a readonly base. When you generate linked clone pool using VMWare View Manager, the manager creates a readonly full  clone (the “replica”) and takes snapshots from it. This (clever) way hides the snapshot source and in most cases you don’t directly manage or use replicas. The base template has no role after the cloning ends and can be deleted.

Another major difference is the different creation patterns: snapshot creating events tend to be periodic (backup/data set separation scenarios), and cloning creation (at least for VDI use cases) tend to be bursty, meaning each time clones are created from a base (template/replica) many of them are created at once.

This means that if you build a source-snapshot(or clone) creation over time graph, a typical snapshot graph will be dense and long tree (see below) while the equivalent typical clone tree will be very shallow but with big span out factor (maybe it should be called clone bush 😉 ). The following diagrams depict such graphs:

Typical snapshots tree

Typical clones tree

Typical clones tree

Due to these differences, even though under the hood snapshots and (linked) clones may be implemented using the same technologies, it is bad and in fact they shouldn’t be implemented in the same way as many (if not most) implementation assumptions for snapshots are not valid for clones and vice versa!

A very good example for such assumption is the span out level – many snapshots are implemented as follows: the source has it own guaranteed space, and each snapshot has its own delta space. When a block in the source is modified, the old block is copied to the snapshot delta  spaces (copy on write). This common technique  is very efficient for (primary) source and (secondary) snapshot scheme but on the other side it also assumes that the span out level is low – because the modified block has to be copied to each snapshot delta space. Image what will happen if you have 1000 (snapshot based) clones created from the same source!

If we go back to the VMWare’s linked clone case, the read only replica is enabling VMWare to generate many writable snapshots on top of a single source. The original snapshot mechanism cannot do that!

To sum this post up I want to claim that:

  1. Clones (even linked) are not snapshots
  2. Most (even all) storage systems are not implementing the clone use case, but rather just the snapshot use case
  3. It is time that storage systems will implement clones

Leave a comment

Filed under VDI, Virtualization

VAAI is great, not just for VMware!

VMWare’s disk array offloading verbs (VAAI) seems to be a major success – any self respecting storage vendor is implementing these verbs so in a year or two it will be pretty common. I think that a very important fact about VAAI is that VMware’s VAAI uses standard T10 SCSI commands (note that in vSphere4 you would need a vendor specific plugin, but in vSphere5 the t10 verbs are supported without any plugin). As T10 verbs are just standard SCSI  commands, nothing limits the use of  these verbs to VMWare environments. This make the four existing verbs very useful for many use cases, not VMWare related:

Extended copy (xcopy): a server side copy mechanism. By itself xcopy is not new to the SCSI standard, but its general form (the asynchronous) one is so complex that it is hardly implemented and therefore hardly used. VMWare was brilliant enough to find a way to simplify xcopy by using the hidden synchronous version of it (this version is hidden so well that I had to read the spec several times to convince my self that such mode exists). The result is that now every VAAI able array has a very useful and simple server side copy verb that can be used for things such as:

  • User file data copy – required help from the file system but can offload the entire data operation to the disk array!
  • Snapshot copy on write – if the COW grains are relatively large, XCOPY may offload much of the overhead of the COW operation.
  • Volume mirroring/BCV style copies- during (resync) and other

Write same: is the storage form of memset(). Used by vmware to initialize storage spaces to zero. There are many similar cases in general purpose systems that can use for initialization or similar tasks.

Compare and Write (ATS): the storage form of compare and swap. This is a very cool verb because is opens the world of  “lockless” synchronization algorithms to any distributed application or system.  “Lockless” algorithms are much more efficient than the current lockfull reserve/release or persistent reservation mechanisms. I really hope distributed file systems, clustering software, data bases and other applications will use this verb.

Unmap (“trim”): this verbs tells a thin provisioning capabale storage (and most today’s storage system are) that a specific area is not used by the file system or other application. Without it, the entire idea of thin provisioing is a bit pointless if a filesystem is used on top of the volume – overtime the filesystem writes to the entire volume space which forces the storage to allocate space for it, and the space saving is lost. The concept that the file system should inform the volume beneath it that it is not using a specific storage area is already well known and accepted: NTFS and ext4 (maybe other file systems too) can send TRIM commands if they know that they are working above an SSD. This is exactly what is needed also for any thin provisioning capable storage. I have high hopes that implementing such UNMAP support is already in the todo lists of many file system developers. (BTW, I am not claiming that TRIM and UNMAP are the same. I know they are completely different. I am claiming that from the filesystem’s view they are the same).

And additional note: even within VMWare system, VAAI verbs can be used in much more places that they are today. I hope to write an additional post on such cases.

Leave a comment

Filed under Enterprise Storage, Virtualization

SSD Dedup and VDI

I found this nice Symantec blog about the SSD+Dedup+VDI issues in the DCIG site. Basically I agree with its main claim that SSD+Dedup is a good match for VDI. On the other side, I think that the 3 potential “pitfalls” mentioned in the post are probably relevant for a naive storage system, and much less for an enterprise level disk array. Here is why (the blue parts are citations from the original post):

  • Write I/O performance to SSDs is not nearly as good as read I/Os. SSD read I/O performance is measured in microseconds. So while SSD write I/O performance is still faster than writes to hard disk drives (HDDs), writes to SSDs will not deliver nearly the same performance boost as read I/Os plus write I/O performance on SSDs is known to degrade over time.
This claim is true only for non enterprise level SSDs. Enterprise level SSDs write performance suffer much less from performance degradation and due its internal NVRAM, the write latency is as good as read latency, if not better. Furthermore most disk arrays have non trivial logic and enough resources to handle these issues even if the SSDs cannot.
  • SSDs are still 10x the cost of HDDs. Even with the benefits provided by deduplication an organization may still not be able to justify completely replacing HDDs with SSDs which leads to a third problem.
There is no doubt that SSDs are at least 10x more expensive than HDDs in terms of GB/$. But when comparing the complete solution cost the outcome is different. In many VDI systems the real main storage constrain is IOPS and not capacity. This means that a  HDD based solution  may need to over provision the system capacity and/or use small disks such that you will have enough (HDD) spindles to satisfy the IOPS requirements. In this case, the real game is IOP/$ where SSDs win big time. Together with the Dedup oriented space reduction, the total solution’s cost maybe very attractive.
  • Using deduplication can result in fragmentation. As new data is ingested and deduplicated, data is placed further and further apart. While fragmentation may not matter when all data is stored on SSDs, if HDDs are still used as part of the solution, this can result in reads taking longer to complete.

Basically I agree, but again the disk array logic may mitigate at least some of the problem. Of course 100% SSD solution is better (much better is some cases). but the problem is that such solutions are still very rare if at all.

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures, VDI, Virtualization

A day in the life of virtualized storage request

Storage Flow - VM to Disk ArrayServer/Desktop virtualization systems have so many advantages that in few years it may be almost impossible even to imagine  how did one build data centers without them. However, nothing comes without a price. In this post I want to describe the virtualized storage flows complexity. Following is a (simplified) virtualized storage request flow, starting from the application and ending in the physical storage device (disk or whatever). This flow match a real flow done within a VMWare ESX host connected to a shared disk array via FC-SAN (depicted in the attached illustration). The equivalent flow for other storage architectures (iSCSI, NAS) and for other virtualization systems (Xen, Hyper-V) is similar (but not identical).

  1. An application within the VM is issuing a storage related request. For example READ a chunk of data,  located at offset X in file Y.
  2. The VM’s OS processes this request using its internal (kernel side) file system logic. In this layer the request’s parameters are used to find the actual allocation unit, check permissions, manage the data cache, etc. Assuming the data is not in the cache, a new READ request is formed to read (part of) the allocation unit from the underlying block layer.
  3. The block layer typically includes a caching layer, queuing system and even some storage virtualization features if logic volume managers (LVM) are used (very common for Unix systems, less for Windows systems). Assuming the block is not found in the block level cache, it is mapped to the “real” device and offset (if LVM is used) and then it is sent to the appropriate disk/device driver.
  4. The disk driver is typically a para-virtualized SCSI/IDE driver. This driver uses hypervisor specific methods to efficiently “push” the request to the virtualized SCSI/IDE bus (not depicted in the illustration).
  5. The hypervisor passes the request to the virtual disk logic (VMDK in VMWare). Such logic may be a trivial 1:1 mapping if the virtual disk is a simple “raw” disk. On the other hand, if  the disk is a snapshot disk or a delta disk it needs to use  pretty complex copy on write (or redirect on write), thin provisioning and remapping logic. In the later case another layer of mapping has to be used and the original offset is translated to the “physical” offset. Then, with or without translation, the request is sent to the shared storage management layer – the VMFS in VMWare’s case. Note that data caching is problematic in this level. Read caching is OK as long as you know your source is read-only or managed by this host. Write caching (“write behind”) is problematic due to crash consistency and is avoided in most cases.
  6. VMFS is in fact a simple clustered LVM with a file system personality. VMWare is using this “file system” to implement a very simple file based image repository (very elegant, IMHO). Once the request is processed by VMFS, it is mapped to an extent within a LUN (by default extents are 1MB data blocks). The READ request is therefore translated to the target LUN and an offset (LBA) within it. The translated request is queued into the target LUN queue (here I simplifies things a bit).
  7. The multipath driver virtualizes several “paths” to the destination LUN to implement “path” fault tolerance and “path” load balancing. In our case it examines the request and decides to what path this request is going to be queued according to some predefined policies and polling mechanisms (this issue deserves few books of its own, so lets leave it in this level). Each path is a physical and/or logical route in the SAN from the virtualization host’s HBA to target storage system (not depicted in the illustration).
  8. The HBA driver interfaces with the physical HBA device to queue the request.
  9. The request is send via the SAN to the target disk array (port). Lets ignore the entire queuing systems and SAN’s side virtualization systems – we have enough on our plate.
  10. The target disk array is by itself a very complex system with many virtualization mechanisms to hide its internal storage devices, share its resources and implements a wealth of advances storage functionalities/copy services, depending on the vendor and the specific model. Anyhow, somehow the request is being processed, the data is read from the cache or from a physical storage device or devices and then the request is replied.
  11. The reply is doing the via dolorosa but in the other direction.


  1. The above described Fibre-Channel deployment . There are many other deployments.
  2. All layers are greatly simplified. The reality is much more complex.

Here is a summary table:

 Entity Layer Mapping/
Caching Snapshots/Copy services Allocation/Thin provisioning
Guest File system Yes Yes (md + data) Yes (limited) Yes, fine granularity (~4k)
Block layer No (LVM: extend mapping) Yes No (LVM: snapshot, mirroring, replication) No (LVM: extend granularity ~1MB)
Disk Driver No No No No
Hypervisor Virtual Disk Logic Yes (non “raw” formats) No Yes (snapshot, deltas) Yes (vmdk uses 64k)
Shared storage logic (VMFS) Yes (extent mapping) No No (but still possible) Yes (extent based)
Multipath Yes (virtual device to several devices representing paths) No No No
LBA driver No (short of NPIV, PCI-V) No No No
SAN ? ? ? ? ?
Disk Array Disk Array logic Yes Yes Yes Yes

I don’t know what you think, but I am always amazed that it works.

If you really want to understand the real complexity, you can read one of the reference storage architectures documents. For example:

“NetApp and VMware vSphere Storage Best Practices”

A very good blog that has many posts about this subject:

For example this post: presentation-on-vmware-vsphere-4-1-storage-features


For example: mythbusters-esxesxi-caching-io.

Leave a comment

Filed under Storage architectures, Virtualization