Category Archives: Storage architectures

Cache In The Shadows

Disk array side read caches are much less effective than you can expect due to a phenomena a friend of mine calls “Cache shadowing“.  This happens because many if not most IO oriented applications have an application level cache and in addition you can frequently find at least one more level of cache (OS block/file system cache) before the read request reaches the disk array. The application and OS level enjoy several advantages over the disk array cache:

  1. They have knowledge about the application and therefore can be have smarter (i.e be more efficient) cache algorithms, and
  2. The aggregate size of the memory used for caching by all client may be much larger than the disk array cache even if the disk array is equipped with a very large cache, and most important,
  3. The application level cache and OS cache handle the read request before the disk array has a chance to see it. This allows them to exploit the best of the spatial locality and the temporal locality that all cache algorithms rely on (see Locality of reference).

The above points may lead to a situation where the read requests that do reach the disk array after passing through the external (from the disk array point of view) cache level, are very hard to cache, or in other words the application/server caches shadow the disk array cache (a nice metaphor, isn’t it?).  In this post I want to discuss how all-flash disk arrays affect this cache shadowing phenomena, and to suggest situations were cache shadowing is less dominant.

First of all, all flash disk arrays add another factor against the read cache – the flash media is so fast (i.e. has low latency) that you don’t need read cache! Remember that read caches are invented mainly to hide the latency of the (slow) HDD media. As the read latency from flash media may be as low as 50 usec (or lower), the benefit of hiding this latency is minimized, or even eliminated.

So it is the end of the read cache as we know it? Yes and no. Yes, straight forward data caches are much less required anymore. No becuase other types of read caches, such as content aware cache are still effective.

Content aware caches are caches that cache blocks by their content and not by their (logical or physical) addresses. Such caches can be efficient when  the disk array encounters a use case where the same content is read though large number of addresses. Sound complex? Here is an example: lets say the disk array stores a VMFS LUN with 50 Win7 VMs (full clones), and all VMs are booted in parallel (i.e. a “boot storm”). Most IOs during the boot process are read IOs (see “understanding how…”) and each VM reads its own set of (OS) files from its own set of  locations (this is not the case in linked clone, but lets put that aside for a moment). You may be not very surprised to know that the content of these OS files are almost the same across all VMs. Normal address based cache is not be very efficient in such use case because the aggregate amount of data and the number of data block  locations read during this boot storm may be very large,  but content aware cache ignores the addresses and consider only the content which repeat itself across the VMs. In such case the disk array content aware cache has “unfair advantage” over the local severs’ cache (The Win7 VMs’ cache in this example) and therefore can be very effective.

Of course such  content aware caches are not very common in the current generation of disk arrays, but this is going to change in the next generation disk arrays.

Advertisements

Leave a comment

Filed under all-flash disk array, Enterprise Storage, Storage architectures

SSD Dedup and VDI

I found this nice Symantec blog about the SSD+Dedup+VDI issues in the DCIG site. Basically I agree with its main claim that SSD+Dedup is a good match for VDI. On the other side, I think that the 3 potential “pitfalls” mentioned in the post are probably relevant for a naive storage system, and much less for an enterprise level disk array. Here is why (the blue parts are citations from the original post):

  • Write I/O performance to SSDs is not nearly as good as read I/Os. SSD read I/O performance is measured in microseconds. So while SSD write I/O performance is still faster than writes to hard disk drives (HDDs), writes to SSDs will not deliver nearly the same performance boost as read I/Os plus write I/O performance on SSDs is known to degrade over time.
This claim is true only for non enterprise level SSDs. Enterprise level SSDs write performance suffer much less from performance degradation and due its internal NVRAM, the write latency is as good as read latency, if not better. Furthermore most disk arrays have non trivial logic and enough resources to handle these issues even if the SSDs cannot.
  • SSDs are still 10x the cost of HDDs. Even with the benefits provided by deduplication an organization may still not be able to justify completely replacing HDDs with SSDs which leads to a third problem.
There is no doubt that SSDs are at least 10x more expensive than HDDs in terms of GB/$. But when comparing the complete solution cost the outcome is different. In many VDI systems the real main storage constrain is IOPS and not capacity. This means that a  HDD based solution  may need to over provision the system capacity and/or use small disks such that you will have enough (HDD) spindles to satisfy the IOPS requirements. In this case, the real game is IOP/$ where SSDs win big time. Together with the Dedup oriented space reduction, the total solution’s cost maybe very attractive.
  • Using deduplication can result in fragmentation. As new data is ingested and deduplicated, data is placed further and further apart. While fragmentation may not matter when all data is stored on SSDs, if HDDs are still used as part of the solution, this can result in reads taking longer to complete.

Basically I agree, but again the disk array logic may mitigate at least some of the problem. Of course 100% SSD solution is better (much better is some cases). but the problem is that such solutions are still very rare if at all.

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures, VDI, Virtualization

Storage dinosaurs – beware, here comes the SSD!

Magnetic spinning disks (hard drives – HDDs) are the most dominant primary storage device  in the storage world for at least 50 years. This era is over! Within few years (more or less) the SSD will replace the HDD as the primary storage device. Moreover, already in the near future we will be able to see real SSD based storage systems (i.e. systems that use SSD as a primary device and not as a cache/tier-zero device). When these systems will be available, some very fundamental ground rules of storage and storage performance thinking will have to change.  In this post I want to present several common conventions and areas that are heavily affected by “HDD thinking” and may change when SSD systems will start to impact the storage market.

Storage devices are sloooooow!

Comparing to any other major computer component, HDDs are slow. Their bandwidth is modest and their latency is disastrous (see also the sequential/random access section below). This makes HDD as the major primary storage device to be the major performance  bottleneck of many systems, and the major performance bottleneck of most storage systems. Many storage related and some non storage related sub-systems are under optimized simply because the HDD bottlenecks conceal other sub-systems’ performance problems.

The SDD effect:

SSDs are at least order of magnitude faster than HDDs. They are still not comparable to RAM, but connect several enterprise level SSDs to your system, and you will have enough “juice” to hit other system component bottlenecks (memory bus, interface bus, interrupts handling, pure CPU issues, etc.). This changes the entire system performance balance of many systems.

Random vs. Sequential

Due its physical design, a typical HDD can efficiently access continues (sequential) areas on the media, while access to any non successive (random) areas is very inefficient. A typical HDD can stream about 100MB/sec (or more) of successive data, but can access no more then about 200 different areas on the media per second, such that accessing random 4k blocks reduces the outcome bandwidth to about 800KB/sec!

This bi-modal behavior affected the storage system “thinking” very deeply to a point that almost every application that accesses storage resources is tuned to this behavior. This means that:

  • Most applications attempt to minimize the random storage accesses and prefer sequential accesses if they can, and/or attempt to access approximately “close” areas.
  • Applications that need good random bandwidth attempt to access the storage using big blocks. This helps because accessing random 512 bytes “costs” about the same as accessing 32KB of data because the data transfer time from/to an HDD  is relatively small comparing to the seek time (movement of the disk’s head) and the rotational latency (the time the disk has to wait until the disk plate(s) rotates such that the data block is under the head).

The SDD effect:

The media used to build SSD is mostly RAM or flash. Both provide very good random access latency and bandwidth  that is comparable (but still lower) than the SSD’s sequential access latency and bandwidth. Flash media has other limitations (no direct rewrite ability) that force SSD designers to implement complex redirect on write schemes. In most cases, a sequential write access is much easier to handle than random write access, so the bi-modal behavior is somehow retained. On the other hand, read operations are much less affected from this complexity and enterprise level SSDs are built in a way that minimize the sequential/random access performance gap. The sequential write access performance of a typical desktop SSD is about order of magnitude better than its random write performance. For enterprise level SSDs, the performance factor gap may narrow down to about 2, and in many cases the random performance is good enough (i.e. it is not the performance bottleneck of the system). More specifically, random access patterns using small blocks are not performance horrors anymore.

Storage related data structures

When applications needs access to data on HDDs they tend to organize it in a way that is optimized to the random/sequential bi-modality. For example data chunks that have good change to be accessed together or at least during a short period of time, are stored close one to each other (or in storage talk – identify “locality of reference” and transform temporal locality to spatial locality). Applications also tend to use data structures that are optimized for such locality of reference, (such as B-Trees) and avoid data structures that are not (such as hash tables). Such data-structures by themselves may introduce additional overheads for using random/sparse access patterns, and by that create a magical circle where the motivation to use sequential accesses is getting bigger and bigger.

The SSD effect:

Data structures that are meant to exploit the data’s locality of reference still works for SSDs. But as the sequential/random access gap is much smaller, such data structures may cease to be the best structures for many applications, as there are other (application depended) issues that should be the focus of the data structure optimizations instead of the common storage locality of reference. For example, sparse data structures such as hash tables may be much more applicable for some use cases than the current used data structures.

Read and write caches

HDD’s random access latencies are huge comparing to most other computer sub-systems. For example, a typical round-trip time of a packet in a modern 1Gb Ethernet LAN  is about 100 usec. The typical latency of a random 4k HDD IO (read or write) operation is about 4-10 msec (i.e. 40-100 times slower). Read and write caches attempt to reduce or at least hide some of the HDD random IO latency:

  • Read caches keep an copy of the frequently accessed data blocks using some faster media, most commonly RAM. During the read IO flow, the requested read (offset) is looked up in the read cache and if it is found the slow IO operation is completely avoided.
  • Write caches keep last written data in much faster media (e.g. RAM/NVRAM) and write it back (“destage”) to the HDD layer in the background, after the user write is acknowledged. This “hides” most of the write latency and let the system optimize the write (“destage”) bandwidth by applying write-coalescing (known also as “”write combining“”) and write-reordering  (e.g. “elevator algorithm“) techniques.

Of course the effect of both caches is limited:

  • Read caches are effective only if the user data access pattern has enough locality of reference and the total dataset is small enough
  • Write caches are effective only when the destage bandwidth is higher then the user data write bandwidth. For most HDD systems this is not the case, so write caches are good to handle short spikes of writes, but when the cache buffer filles up, the user write latency drops back to the HDD write latency (or to be more exact, to the destage latency).

The SSD effect:

For some SSD based systems the traditional read cache shoulde be much less important comparing to HDD systems. The read operation latency in many SSDs is so low (about 50 usec) that it ceases to be the dominant part of the system’s read operation latency, so the reasoning behind the read cache is much weaker.

Regarding write caches, most SSDs have internal  (battery or otherwise backed up) write caches and in addition, the base write latency is much lower than the HDD write latency, making the write cache to be much less important too. Still, as the SSD’s write latency is relatively high comparing to the read latency, a write behind buffer can be used to hide this write latency. Furthermore, unlike HDD systems,  it should be relatively easy to build a system such that the destage bandwidth is high enough to hide at the media write latency from the user even during very long full bandwidth writes.

Logging

In the early 90’s Mendel RosenblumJ. K. Ousterhout published a very important article named “The_Design_and_Implementation_of_a_Log-Structured_File_System” where in short they claimed the following: read latency can be effectively handled by read caches. This leaves the write latency and throughput as the major problem for many storage systems. The articles suggested the following: instead of having a traditional mapping structure to map logical locations to physical locations such as B-Trees and, use DB oriented technique called logging. When logging is used, each new user write operation writes its data to a continues buffer on the disk, ignoring the existing location of the user data on the physical media. This replaces the commons write-in-place technique with relocate-on-write technique. The motivation is to exploit the relatively fast HDD  sequential write (on the expense of read efficiency,  complexity and defragmentation problems). This technique has many advantages over traditional write in place schemes, such as better consistency, fast recovery, RAID optimizations (i.e. the ability of writing only full RAID stripes), snapshots implementations, etc. (It has also some major drawbacks).

The SSD effect:

The original motivation for logging is not valid anymore: random write performance is not as big problem for SSDs as it is for HDDs, such that the additional complexity may not be worth the performance gain. Still logging techniques may be relevant for SSD based systems in the following cases:

  • Storage systems that directly manage flash components may use logging to solve (also) several flash problems, most importantly the no direct re-write operation support problem and the wear leveling problem.
  • Storage systems may use logging to implement better RAID schemes, and as base for advanced functionalities (snapshots, replication, etc.)

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures

A day in the life of virtualized storage request

Storage Flow - VM to Disk ArrayServer/Desktop virtualization systems have so many advantages that in few years it may be almost impossible even to imagine  how did one build data centers without them. However, nothing comes without a price. In this post I want to describe the virtualized storage flows complexity. Following is a (simplified) virtualized storage request flow, starting from the application and ending in the physical storage device (disk or whatever). This flow match a real flow done within a VMWare ESX host connected to a shared disk array via FC-SAN (depicted in the attached illustration). The equivalent flow for other storage architectures (iSCSI, NAS) and for other virtualization systems (Xen, Hyper-V) is similar (but not identical).

  1. An application within the VM is issuing a storage related request. For example READ a chunk of data,  located at offset X in file Y.
  2. The VM’s OS processes this request using its internal (kernel side) file system logic. In this layer the request’s parameters are used to find the actual allocation unit, check permissions, manage the data cache, etc. Assuming the data is not in the cache, a new READ request is formed to read (part of) the allocation unit from the underlying block layer.
  3. The block layer typically includes a caching layer, queuing system and even some storage virtualization features if logic volume managers (LVM) are used (very common for Unix systems, less for Windows systems). Assuming the block is not found in the block level cache, it is mapped to the “real” device and offset (if LVM is used) and then it is sent to the appropriate disk/device driver.
  4. The disk driver is typically a para-virtualized SCSI/IDE driver. This driver uses hypervisor specific methods to efficiently “push” the request to the virtualized SCSI/IDE bus (not depicted in the illustration).
  5. The hypervisor passes the request to the virtual disk logic (VMDK in VMWare). Such logic may be a trivial 1:1 mapping if the virtual disk is a simple “raw” disk. On the other hand, if  the disk is a snapshot disk or a delta disk it needs to use  pretty complex copy on write (or redirect on write), thin provisioning and remapping logic. In the later case another layer of mapping has to be used and the original offset is translated to the “physical” offset. Then, with or without translation, the request is sent to the shared storage management layer – the VMFS in VMWare’s case. Note that data caching is problematic in this level. Read caching is OK as long as you know your source is read-only or managed by this host. Write caching (“write behind”) is problematic due to crash consistency and is avoided in most cases.
  6. VMFS is in fact a simple clustered LVM with a file system personality. VMWare is using this “file system” to implement a very simple file based image repository (very elegant, IMHO). Once the request is processed by VMFS, it is mapped to an extent within a LUN (by default extents are 1MB data blocks). The READ request is therefore translated to the target LUN and an offset (LBA) within it. The translated request is queued into the target LUN queue (here I simplifies things a bit).
  7. The multipath driver virtualizes several “paths” to the destination LUN to implement “path” fault tolerance and “path” load balancing. In our case it examines the request and decides to what path this request is going to be queued according to some predefined policies and polling mechanisms (this issue deserves few books of its own, so lets leave it in this level). Each path is a physical and/or logical route in the SAN from the virtualization host’s HBA to target storage system (not depicted in the illustration).
  8. The HBA driver interfaces with the physical HBA device to queue the request.
  9. The request is send via the SAN to the target disk array (port). Lets ignore the entire queuing systems and SAN’s side virtualization systems – we have enough on our plate.
  10. The target disk array is by itself a very complex system with many virtualization mechanisms to hide its internal storage devices, share its resources and implements a wealth of advances storage functionalities/copy services, depending on the vendor and the specific model. Anyhow, somehow the request is being processed, the data is read from the cache or from a physical storage device or devices and then the request is replied.
  11. The reply is doing the via dolorosa but in the other direction.

Notes:

  1. The above described Fibre-Channel deployment . There are many other deployments.
  2. All layers are greatly simplified. The reality is much more complex.

Here is a summary table:

 Entity Layer Mapping/
Virtualization
Caching Snapshots/Copy services Allocation/Thin provisioning
Guest File system Yes Yes (md + data) Yes (limited) Yes, fine granularity (~4k)
Block layer No (LVM: extend mapping) Yes No (LVM: snapshot, mirroring, replication) No (LVM: extend granularity ~1MB)
Disk Driver No No No No
Hypervisor Virtual Disk Logic Yes (non “raw” formats) No Yes (snapshot, deltas) Yes (vmdk uses 64k)
Shared storage logic (VMFS) Yes (extent mapping) No No (but still possible) Yes (extent based)
Multipath Yes (virtual device to several devices representing paths) No No No
LBA driver No (short of NPIV, PCI-V) No No No
SAN ? ? ? ? ?
Disk Array Disk Array logic Yes Yes Yes Yes

I don’t know what you think, but I am always amazed that it works.

If you really want to understand the real complexity, you can read one of the reference storage architectures documents. For example:

“NetApp and VMware vSphere Storage Best Practices”

A very good blog that has many posts about this subject:

blog.scottlowe.org

For example this post: presentation-on-vmware-vsphere-4-1-storage-features

And yellow-bricks.com,

For example: mythbusters-esxesxi-caching-io.

Leave a comment

Filed under Storage architectures, Virtualization