Tag Archives: VMWare

Clones are not (writable) snapshots!

Everyone that has ever used server or desktop virtualization probably has used clones. Even though “clone” is not a well defined storage term, in most cases it is used to describe a data (image) copy. Technically this “copy” can be achieved using several technologies:  copy clones, snapshots based, or mirror based (BCV), etc. VMWare is using the term “Full clone” to describe a “copy clone”, while clones that use delta/copy on write (snapshot) mechanism are named linked clones. Some people treat clones only as linked clones and/or writable snapshots. Netapp has a feature called Flex-clone that is just a writable snapshot.

My view on this is that the term “clone” (as it is used in virtualization systems) should describe the use case and not the technology. Even though snapshots and clones may use the same underlying technology, their use cases and use patterns are not the same. For example, under many systems the snapshots’ source volume is more important to the user than its snapshots and has a preferred status over them (backup scenario). Technically, the source volume  is often full provisioned and has strict space accounting and manual removal policy while the snapshots are likely to be thin-provisioned (“space efficient”), may have automatic removal (expiration/exhaustion) policy, and soft/heuristic space management (see XIV for example).

This preferred source scheme will not work for clones;  in many cases the source of the clones is just a template that is never used by itself, so you can store it on much less powerful storage (tier), and once you finished generating the clones, you can delete it if you want. The outcome of the clone is much more important from the template, so if space runs out you may delete few old templates, but you wont remove the clones if they are in use – each of them is a standalone VM image.

This can be demonstrated by VMware’s linked clones that are implemented as writable snapshots on top of  a readonly base. When you generate linked clone pool using VMWare View Manager, the manager creates a readonly full  clone (the “replica”) and takes snapshots from it. This (clever) way hides the snapshot source and in most cases you don’t directly manage or use replicas. The base template has no role after the cloning ends and can be deleted.

Another major difference is the different creation patterns: snapshot creating events tend to be periodic (backup/data set separation scenarios), and cloning creation (at least for VDI use cases) tend to be bursty, meaning each time clones are created from a base (template/replica) many of them are created at once.

This means that if you build a source-snapshot(or clone) creation over time graph, a typical snapshot graph will be dense and long tree (see below) while the equivalent typical clone tree will be very shallow but with big span out factor (maybe it should be called clone bush 😉 ). The following diagrams depict such graphs:

Typical snapshots tree

Typical clones tree

Typical clones tree

Due to these differences, even though under the hood snapshots and (linked) clones may be implemented using the same technologies, it is bad and in fact they shouldn’t be implemented in the same way as many (if not most) implementation assumptions for snapshots are not valid for clones and vice versa!

A very good example for such assumption is the span out level – many snapshots are implemented as follows: the source has it own guaranteed space, and each snapshot has its own delta space. When a block in the source is modified, the old block is copied to the snapshot delta  spaces (copy on write). This common technique  is very efficient for (primary) source and (secondary) snapshot scheme but on the other side it also assumes that the span out level is low – because the modified block has to be copied to each snapshot delta space. Image what will happen if you have 1000 (snapshot based) clones created from the same source!

If we go back to the VMWare’s linked clone case, the read only replica is enabling VMWare to generate many writable snapshots on top of a single source. The original snapshot mechanism cannot do that!

To sum this post up I want to claim that:

  1. Clones (even linked) are not snapshots
  2. Most (even all) storage systems are not implementing the clone use case, but rather just the snapshot use case
  3. It is time that storage systems will implement clones
Advertisements

Leave a comment

Filed under VDI, Virtualization

A day in the life of virtualized storage request

Storage Flow - VM to Disk ArrayServer/Desktop virtualization systems have so many advantages that in few years it may be almost impossible even to imagine  how did one build data centers without them. However, nothing comes without a price. In this post I want to describe the virtualized storage flows complexity. Following is a (simplified) virtualized storage request flow, starting from the application and ending in the physical storage device (disk or whatever). This flow match a real flow done within a VMWare ESX host connected to a shared disk array via FC-SAN (depicted in the attached illustration). The equivalent flow for other storage architectures (iSCSI, NAS) and for other virtualization systems (Xen, Hyper-V) is similar (but not identical).

  1. An application within the VM is issuing a storage related request. For example READ a chunk of data,  located at offset X in file Y.
  2. The VM’s OS processes this request using its internal (kernel side) file system logic. In this layer the request’s parameters are used to find the actual allocation unit, check permissions, manage the data cache, etc. Assuming the data is not in the cache, a new READ request is formed to read (part of) the allocation unit from the underlying block layer.
  3. The block layer typically includes a caching layer, queuing system and even some storage virtualization features if logic volume managers (LVM) are used (very common for Unix systems, less for Windows systems). Assuming the block is not found in the block level cache, it is mapped to the “real” device and offset (if LVM is used) and then it is sent to the appropriate disk/device driver.
  4. The disk driver is typically a para-virtualized SCSI/IDE driver. This driver uses hypervisor specific methods to efficiently “push” the request to the virtualized SCSI/IDE bus (not depicted in the illustration).
  5. The hypervisor passes the request to the virtual disk logic (VMDK in VMWare). Such logic may be a trivial 1:1 mapping if the virtual disk is a simple “raw” disk. On the other hand, if  the disk is a snapshot disk or a delta disk it needs to use  pretty complex copy on write (or redirect on write), thin provisioning and remapping logic. In the later case another layer of mapping has to be used and the original offset is translated to the “physical” offset. Then, with or without translation, the request is sent to the shared storage management layer – the VMFS in VMWare’s case. Note that data caching is problematic in this level. Read caching is OK as long as you know your source is read-only or managed by this host. Write caching (“write behind”) is problematic due to crash consistency and is avoided in most cases.
  6. VMFS is in fact a simple clustered LVM with a file system personality. VMWare is using this “file system” to implement a very simple file based image repository (very elegant, IMHO). Once the request is processed by VMFS, it is mapped to an extent within a LUN (by default extents are 1MB data blocks). The READ request is therefore translated to the target LUN and an offset (LBA) within it. The translated request is queued into the target LUN queue (here I simplifies things a bit).
  7. The multipath driver virtualizes several “paths” to the destination LUN to implement “path” fault tolerance and “path” load balancing. In our case it examines the request and decides to what path this request is going to be queued according to some predefined policies and polling mechanisms (this issue deserves few books of its own, so lets leave it in this level). Each path is a physical and/or logical route in the SAN from the virtualization host’s HBA to target storage system (not depicted in the illustration).
  8. The HBA driver interfaces with the physical HBA device to queue the request.
  9. The request is send via the SAN to the target disk array (port). Lets ignore the entire queuing systems and SAN’s side virtualization systems – we have enough on our plate.
  10. The target disk array is by itself a very complex system with many virtualization mechanisms to hide its internal storage devices, share its resources and implements a wealth of advances storage functionalities/copy services, depending on the vendor and the specific model. Anyhow, somehow the request is being processed, the data is read from the cache or from a physical storage device or devices and then the request is replied.
  11. The reply is doing the via dolorosa but in the other direction.

Notes:

  1. The above described Fibre-Channel deployment . There are many other deployments.
  2. All layers are greatly simplified. The reality is much more complex.

Here is a summary table:

 Entity Layer Mapping/
Virtualization
Caching Snapshots/Copy services Allocation/Thin provisioning
Guest File system Yes Yes (md + data) Yes (limited) Yes, fine granularity (~4k)
Block layer No (LVM: extend mapping) Yes No (LVM: snapshot, mirroring, replication) No (LVM: extend granularity ~1MB)
Disk Driver No No No No
Hypervisor Virtual Disk Logic Yes (non “raw” formats) No Yes (snapshot, deltas) Yes (vmdk uses 64k)
Shared storage logic (VMFS) Yes (extent mapping) No No (but still possible) Yes (extent based)
Multipath Yes (virtual device to several devices representing paths) No No No
LBA driver No (short of NPIV, PCI-V) No No No
SAN ? ? ? ? ?
Disk Array Disk Array logic Yes Yes Yes Yes

I don’t know what you think, but I am always amazed that it works.

If you really want to understand the real complexity, you can read one of the reference storage architectures documents. For example:

“NetApp and VMware vSphere Storage Best Practices”

A very good blog that has many posts about this subject:

blog.scottlowe.org

For example this post: presentation-on-vmware-vsphere-4-1-storage-features

And yellow-bricks.com,

For example: mythbusters-esxesxi-caching-io.

Leave a comment

Filed under Storage architectures, Virtualization