Storage dinosaurs – beware, here comes the SSD!

Magnetic spinning disks (hard drives – HDDs) are the most dominant primary storage device  in the storage world for at least 50 years. This era is over! Within few years (more or less) the SSD will replace the HDD as the primary storage device. Moreover, already in the near future we will be able to see real SSD based storage systems (i.e. systems that use SSD as a primary device and not as a cache/tier-zero device). When these systems will be available, some very fundamental ground rules of storage and storage performance thinking will have to change.  In this post I want to present several common conventions and areas that are heavily affected by “HDD thinking” and may change when SSD systems will start to impact the storage market.

Storage devices are sloooooow!

Comparing to any other major computer component, HDDs are slow. Their bandwidth is modest and their latency is disastrous (see also the sequential/random access section below). This makes HDD as the major primary storage device to be the major performance  bottleneck of many systems, and the major performance bottleneck of most storage systems. Many storage related and some non storage related sub-systems are under optimized simply because the HDD bottlenecks conceal other sub-systems’ performance problems.

The SDD effect:

SSDs are at least order of magnitude faster than HDDs. They are still not comparable to RAM, but connect several enterprise level SSDs to your system, and you will have enough “juice” to hit other system component bottlenecks (memory bus, interface bus, interrupts handling, pure CPU issues, etc.). This changes the entire system performance balance of many systems.

Random vs. Sequential

Due its physical design, a typical HDD can efficiently access continues (sequential) areas on the media, while access to any non successive (random) areas is very inefficient. A typical HDD can stream about 100MB/sec (or more) of successive data, but can access no more then about 200 different areas on the media per second, such that accessing random 4k blocks reduces the outcome bandwidth to about 800KB/sec!

This bi-modal behavior affected the storage system “thinking” very deeply to a point that almost every application that accesses storage resources is tuned to this behavior. This means that:

  • Most applications attempt to minimize the random storage accesses and prefer sequential accesses if they can, and/or attempt to access approximately “close” areas.
  • Applications that need good random bandwidth attempt to access the storage using big blocks. This helps because accessing random 512 bytes “costs” about the same as accessing 32KB of data because the data transfer time from/to an HDD  is relatively small comparing to the seek time (movement of the disk’s head) and the rotational latency (the time the disk has to wait until the disk plate(s) rotates such that the data block is under the head).

The SDD effect:

The media used to build SSD is mostly RAM or flash. Both provide very good random access latency and bandwidth  that is comparable (but still lower) than the SSD’s sequential access latency and bandwidth. Flash media has other limitations (no direct rewrite ability) that force SSD designers to implement complex redirect on write schemes. In most cases, a sequential write access is much easier to handle than random write access, so the bi-modal behavior is somehow retained. On the other hand, read operations are much less affected from this complexity and enterprise level SSDs are built in a way that minimize the sequential/random access performance gap. The sequential write access performance of a typical desktop SSD is about order of magnitude better than its random write performance. For enterprise level SSDs, the performance factor gap may narrow down to about 2, and in many cases the random performance is good enough (i.e. it is not the performance bottleneck of the system). More specifically, random access patterns using small blocks are not performance horrors anymore.

Storage related data structures

When applications needs access to data on HDDs they tend to organize it in a way that is optimized to the random/sequential bi-modality. For example data chunks that have good change to be accessed together or at least during a short period of time, are stored close one to each other (or in storage talk – identify “locality of reference” and transform temporal locality to spatial locality). Applications also tend to use data structures that are optimized for such locality of reference, (such as B-Trees) and avoid data structures that are not (such as hash tables). Such data-structures by themselves may introduce additional overheads for using random/sparse access patterns, and by that create a magical circle where the motivation to use sequential accesses is getting bigger and bigger.

The SSD effect:

Data structures that are meant to exploit the data’s locality of reference still works for SSDs. But as the sequential/random access gap is much smaller, such data structures may cease to be the best structures for many applications, as there are other (application depended) issues that should be the focus of the data structure optimizations instead of the common storage locality of reference. For example, sparse data structures such as hash tables may be much more applicable for some use cases than the current used data structures.

Read and write caches

HDD’s random access latencies are huge comparing to most other computer sub-systems. For example, a typical round-trip time of a packet in a modern 1Gb Ethernet LAN  is about 100 usec. The typical latency of a random 4k HDD IO (read or write) operation is about 4-10 msec (i.e. 40-100 times slower). Read and write caches attempt to reduce or at least hide some of the HDD random IO latency:

  • Read caches keep an copy of the frequently accessed data blocks using some faster media, most commonly RAM. During the read IO flow, the requested read (offset) is looked up in the read cache and if it is found the slow IO operation is completely avoided.
  • Write caches keep last written data in much faster media (e.g. RAM/NVRAM) and write it back (“destage”) to the HDD layer in the background, after the user write is acknowledged. This “hides” most of the write latency and let the system optimize the write (“destage”) bandwidth by applying write-coalescing (known also as “”write combining“”) and write-reordering  (e.g. “elevator algorithm“) techniques.

Of course the effect of both caches is limited:

  • Read caches are effective only if the user data access pattern has enough locality of reference and the total dataset is small enough
  • Write caches are effective only when the destage bandwidth is higher then the user data write bandwidth. For most HDD systems this is not the case, so write caches are good to handle short spikes of writes, but when the cache buffer filles up, the user write latency drops back to the HDD write latency (or to be more exact, to the destage latency).

The SSD effect:

For some SSD based systems the traditional read cache shoulde be much less important comparing to HDD systems. The read operation latency in many SSDs is so low (about 50 usec) that it ceases to be the dominant part of the system’s read operation latency, so the reasoning behind the read cache is much weaker.

Regarding write caches, most SSDs have internal  (battery or otherwise backed up) write caches and in addition, the base write latency is much lower than the HDD write latency, making the write cache to be much less important too. Still, as the SSD’s write latency is relatively high comparing to the read latency, a write behind buffer can be used to hide this write latency. Furthermore, unlike HDD systems,  it should be relatively easy to build a system such that the destage bandwidth is high enough to hide at the media write latency from the user even during very long full bandwidth writes.

Logging

In the early 90’s Mendel RosenblumJ. K. Ousterhout published a very important article named “The_Design_and_Implementation_of_a_Log-Structured_File_System” where in short they claimed the following: read latency can be effectively handled by read caches. This leaves the write latency and throughput as the major problem for many storage systems. The articles suggested the following: instead of having a traditional mapping structure to map logical locations to physical locations such as B-Trees and, use DB oriented technique called logging. When logging is used, each new user write operation writes its data to a continues buffer on the disk, ignoring the existing location of the user data on the physical media. This replaces the commons write-in-place technique with relocate-on-write technique. The motivation is to exploit the relatively fast HDD  sequential write (on the expense of read efficiency,  complexity and defragmentation problems). This technique has many advantages over traditional write in place schemes, such as better consistency, fast recovery, RAID optimizations (i.e. the ability of writing only full RAID stripes), snapshots implementations, etc. (It has also some major drawbacks).

The SSD effect:

The original motivation for logging is not valid anymore: random write performance is not as big problem for SSDs as it is for HDDs, such that the additional complexity may not be worth the performance gain. Still logging techniques may be relevant for SSD based systems in the following cases:

  • Storage systems that directly manage flash components may use logging to solve (also) several flash problems, most importantly the no direct re-write operation support problem and the wear leveling problem.
  • Storage systems may use logging to implement better RAID schemes, and as base for advanced functionalities (snapshots, replication, etc.)
Advertisements

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures

SSD Benchmarking and Performance Tuning: pitfalls and recommendations

SSDs are complex devices (see this post). Even though most SSDs have HDD like interfaces and can be used as an HDD replacement, they are very different from HDDs, and there are some issues and pitfalls you need to be aware of when you benchmark an SSD. In this post I want to cover some the these issues. Some graphs of real SSD benchmarks are attached to depict the described issues.

I. SSD (write) performance is not constant overtime

Most SSDs change their write performance overtime. This happens mainly due to the internal data mapping and allocation schemes. When the disk is empty (i.e it is trimmed or secure erased), the placement logic has little difficulties to find a place for the new data. After a while the disk fills up and the placement logic may need to do some cleaning work to make room for the data.

GSkill's write performance 16k span from 80G yo 120G

GSkill's FALCON write performance @ 16k, span from 80G yo 120G

The actual sustained write performance is affected by the writing pattern (random or sequential), write block sizes, cleaning/allocation logic, and most importantly the write span (the relative area of the disk that the benchmark uses). In general, most (non enterprise level) SSDs perform better as you write on smaller span. The graph to the right demonstrate how the write performance degrades overtime and stabilizes on a different base line for each span. The disk is a ~120G disk, and the tested spans are 80G to 119G.

II. SSD’s read and write performance is asymmetric

Basically, HDD’s read and write performance is symmetric. This is not the case for SSDs mainly due to the following reasons:

  1. The flash media read write performance is asymmetric – reads are much faster than writes (~50 usec for a single page read operation vs. ~800 usec of a single page write operation on the average).
  2. The flash media does not support re-write operation. You have to erase the entire block (~ 128 pages area) before you can rewrite a single page. As the erase operation is relatively slow (~2 msec) and using it requires costly read/modify/write operation, most SSDs avoid such RMW operations by using large write behind buffers, write combining and complex mapping, allocation and cleaning logic.
  3. The internal SSD’s write and read logic is asymmetric by design. For example, due to the internal mapping layer, the read logic only use the mapping meta-data, while the write may need (and in most cases have) to change the mapping.
  4. Due to the placement/cleaning logic, a single user write may require several internal writes to complete (this is the famous “write amplification”).
GSkill's read test

GSkill's read test

GSkill's write test

GSkill's write test

Even worse, the read write combination (mix) is even more complex due to many internal read/write arbitration issues (that are beyond the scope of this post). Still, as many applications are doing reads and writes at the same time, the read/write mix patterns may be very important, sometimes much more important than the pure read or pure write patterns. [Click on the read and write test thumbnails pictures to display the full images].

III. SSD’s performance is stateful

In addition to the data placement problems mentioned above, the SSD’s complex internal data mapping can affect the performance results in many other ways. For example the sequential writes performance of a disk that is filled with random (access) writes may be different comparing to the exact same benchmark done on a disk that is filled with sequential writes. In fact, some SSD’s sequential write performance may be reduced to the random access write performance if the disk is filled with random writes! . Another small issue is that reading an unmapped block (i.e. a block that was never written or that was trimmed) is different from reading a mapped one.

IV. SSDs are parallel devices

Unlike HDDs that have a single moving arm and therefore can only serve one request at a time, most SSDs have multiple parallel data channels (4-10 channels in most desktop oriented SSD’s). This means in order to reach the SSD’s maximal performance you may need to queue several requests on the disk queue. Note that the SSD’s firmware may utilize its internal channels in many ways so it hard to predict what will be the results of queuing more requests. For many SSDs, it is enough to use a relatively small number of parallel/flight requests (4-8 requests).

V. SSDs may suffer from HDD oriented optimizations

The entire storage stack in your computer/server is HDD oriented. This means that many related mechanisms and even hardware devices are tuned for HDDs. You have to ensure that these optimizations do not harm SSD performance. For example most OSes do read-ahead/write coalescing, and/or reorder writes to minimize the HDD’s arm movements. Most of these optimizations are not relevant to SSDs, and in some cases can even reduce the SSD’s performance.

In addition, most HDDs have 512 bytes sector, and most SSD’s have 4k bytes sector. You have to ensure that the benchmark tool and the OS do know to send properly (offset and size) aligned data requests.

Another issue are the RAID controllers. Most of them are tuned to HDD’s performance and behavior. In many cases they may be a performance bottleneck.

Benchmarking/tuning recommendations

  1. Ensure that your SSD’s firmware is up-to-date.
  2. Ensure that your benchmark tool, OS and hardware are suited for SSD benchmarking:
    1. Make sure AHCI is ON (probably requires BIOS configuration). Without it the OS would not be able to queue several requests at once (i.e. use NCQ).
    2. Make sure that the disk’s queue depth is at least 32.
    3. For Linux (and other OSes) make sure the IO scheduler is off (change it to “noop”).
    4. Use direct IO to avoid caching effects (both read and write caching).
    5. Most SSD’s are 4k block devices. Keep your benchmark’s load aligned (for example IOMeter load is 512 aligned by default).
    6. The SATA/SAS controller may affect the result. Make sure it is not a bottleneck (this is especially important for multiple disks benchmarks).
    7. Avoid RAID controllers (unless it is critical for your application).
  3. Always trim or secure erase the entire disk (!) before the test.
  4. After the trim, fill the disk with random 4k data, even if you want to benchmark reads.
  5. Ensure that the benchmark duration is long enough to identify write logic changes. As a rule I would test a disk for at least couple of hours (not including the initial trim/fill phases).
  6. Adjust the used space span to your need. In most case to have to balance capacity vs. performance.
  7. Adjust the benchmark parallelism to suite your load.
  8. Remember to test mixed read/write patterns.

General recommendations:

  1. If you benchmark the disk to evaluate it for a specific use (case), try to understand what is the relevant pattern or patterns for your use case and focus on these patterns. Try to understand what are your relevant block sizes, alignments, randomness, read vs. write ratio, parallelism, etc.
  2. SSD’s performance may be very different from a model to model, even if they are from the same manufacturer, and/or using the same internal controller.
An end note: the attached graphs are just for demonstration. This is not a GSkill related post and it is not a disk review – the demonstrated behavior is common for many SSDs.

Interesting links:

Leave a comment

Filed under ssd

SSD’s internal complexity – a challenge or an opportunity?

SSDs are much more complex than the magnetic disks they are ought to replace. By complex I mean that they use much more complex logic. This is required to overcome the fundamental problems of the flash media:

  • Flash media do not support the simple read block/rewrite block semantics as magnetic disks do. Instead it uses a read page, write page (but no rewrite), and delete block (block >> page) semantics. A pretty complex mapping and buffering techniques must be used to simulate read/rewrite semantics.
  • A Flash page can be modified (== erased and programmed) only a limited number of times. This requires a “wear leveling” logic that attempts to spread the modifications on the entire block/page set.
  • Flash media suffers from the various sources of data related errors (including errors due to read operations, writes to a physically close locations, etc.). Advanced recovery mechanisms must be implemented to overcome these errors.

As already, nothing comes without a price. The cost of this complexity is:

  • It consumes more resources
  • It requires much more efforts to develop and test SSDs and accordingly more time to stabilize/productize
  • It makes SSD’s performance to be much more complex to analyze and predict
On the other hand, the fact that SSDs have much more resources than a common magnetic disk and that SSDs are geared with advanced data structures and mechanisms makes them also an enabler for new advanced functions and features. Indeed, you can already see features such as encryption and compression implemented within desktop class SSDs (e.g. Intel’s 320 SSD).
These features are only the start. I believe that SSDs can implement “hardware” assists to offload some of the storage features implementation load and that such assists may be required to build the next gen SSD based storage systems.

For example, because a single  magnetic disk (HDD) provides only (up to) 250 IOPS,  HDD  based RAID systems implements the RAID logic above the disks, namely in a “RAID controller”. But then a single  (enterprise level) SSD provides tens of thousands of IOPS making such architecture much less reasonable.  50K IOPS (4k blocks) SSDs in a 16 disks RAID-6 system require that the RAID controller processes 50 * 16 * 4 = 3.2 GB/s of data (mix of reads, writes and XORs). These numbers are far above the current RAID controllers performance envelope (See for example this post: Accelerating System Performance with SSD RAID Arrays)

If the RAID controller could leverage the SSDs to do most of the data crunching it would need to be “only ” a flow manager and coordinator. The same logic can be relevant for other storage functions too: snapshots, data transfers, copy on write, compare and write, etc. There are also some  down side to it: such architecture is much more complex than the current layering architecture and may need new disk/bus side abilities such as disk to disk transfers ability and others. Furthermore, such SSD side assists must be standardized to enable various RAID controllers vendors to use it. Still, I believe that this architecture is much more cost-effective and scalable.
To sum this up, I don’t know if SSD with storage function assists will ever exists. I hope they will because I think it is the correct and economical way to go.

1 Comment

Filed under Enterprise Storage, ssd

SSD in Enterprise Storage

SSD only based enterprise storage systems are still uncommon. Here is one of the first materials I found discussing the different model of using SSD in enterprise storage. I liked this presentation as it covers the most important issues about this subject. All issues are handled in highlights level, but that’s a nice start.

Here are the four categories mentioned:

  • Server attached PCI flash
  • Array with flash cache
  • Array with flash tier
  • All flash LUN or array
Here is the link to the full presentation (SNIA education, by Matt Kixmoeller)

Leveraging flash memory in enterprise storage

Leave a comment

Filed under Enterprise Storage, ssd

A day in the life of virtualized storage request

Storage Flow - VM to Disk ArrayServer/Desktop virtualization systems have so many advantages that in few years it may be almost impossible even to imagine  how did one build data centers without them. However, nothing comes without a price. In this post I want to describe the virtualized storage flows complexity. Following is a (simplified) virtualized storage request flow, starting from the application and ending in the physical storage device (disk or whatever). This flow match a real flow done within a VMWare ESX host connected to a shared disk array via FC-SAN (depicted in the attached illustration). The equivalent flow for other storage architectures (iSCSI, NAS) and for other virtualization systems (Xen, Hyper-V) is similar (but not identical).

  1. An application within the VM is issuing a storage related request. For example READ a chunk of data,  located at offset X in file Y.
  2. The VM’s OS processes this request using its internal (kernel side) file system logic. In this layer the request’s parameters are used to find the actual allocation unit, check permissions, manage the data cache, etc. Assuming the data is not in the cache, a new READ request is formed to read (part of) the allocation unit from the underlying block layer.
  3. The block layer typically includes a caching layer, queuing system and even some storage virtualization features if logic volume managers (LVM) are used (very common for Unix systems, less for Windows systems). Assuming the block is not found in the block level cache, it is mapped to the “real” device and offset (if LVM is used) and then it is sent to the appropriate disk/device driver.
  4. The disk driver is typically a para-virtualized SCSI/IDE driver. This driver uses hypervisor specific methods to efficiently “push” the request to the virtualized SCSI/IDE bus (not depicted in the illustration).
  5. The hypervisor passes the request to the virtual disk logic (VMDK in VMWare). Such logic may be a trivial 1:1 mapping if the virtual disk is a simple “raw” disk. On the other hand, if  the disk is a snapshot disk or a delta disk it needs to use  pretty complex copy on write (or redirect on write), thin provisioning and remapping logic. In the later case another layer of mapping has to be used and the original offset is translated to the “physical” offset. Then, with or without translation, the request is sent to the shared storage management layer – the VMFS in VMWare’s case. Note that data caching is problematic in this level. Read caching is OK as long as you know your source is read-only or managed by this host. Write caching (“write behind”) is problematic due to crash consistency and is avoided in most cases.
  6. VMFS is in fact a simple clustered LVM with a file system personality. VMWare is using this “file system” to implement a very simple file based image repository (very elegant, IMHO). Once the request is processed by VMFS, it is mapped to an extent within a LUN (by default extents are 1MB data blocks). The READ request is therefore translated to the target LUN and an offset (LBA) within it. The translated request is queued into the target LUN queue (here I simplifies things a bit).
  7. The multipath driver virtualizes several “paths” to the destination LUN to implement “path” fault tolerance and “path” load balancing. In our case it examines the request and decides to what path this request is going to be queued according to some predefined policies and polling mechanisms (this issue deserves few books of its own, so lets leave it in this level). Each path is a physical and/or logical route in the SAN from the virtualization host’s HBA to target storage system (not depicted in the illustration).
  8. The HBA driver interfaces with the physical HBA device to queue the request.
  9. The request is send via the SAN to the target disk array (port). Lets ignore the entire queuing systems and SAN’s side virtualization systems – we have enough on our plate.
  10. The target disk array is by itself a very complex system with many virtualization mechanisms to hide its internal storage devices, share its resources and implements a wealth of advances storage functionalities/copy services, depending on the vendor and the specific model. Anyhow, somehow the request is being processed, the data is read from the cache or from a physical storage device or devices and then the request is replied.
  11. The reply is doing the via dolorosa but in the other direction.

Notes:

  1. The above described Fibre-Channel deployment . There are many other deployments.
  2. All layers are greatly simplified. The reality is much more complex.

Here is a summary table:

 Entity Layer Mapping/
Virtualization
Caching Snapshots/Copy services Allocation/Thin provisioning
Guest File system Yes Yes (md + data) Yes (limited) Yes, fine granularity (~4k)
Block layer No (LVM: extend mapping) Yes No (LVM: snapshot, mirroring, replication) No (LVM: extend granularity ~1MB)
Disk Driver No No No No
Hypervisor Virtual Disk Logic Yes (non “raw” formats) No Yes (snapshot, deltas) Yes (vmdk uses 64k)
Shared storage logic (VMFS) Yes (extent mapping) No No (but still possible) Yes (extent based)
Multipath Yes (virtual device to several devices representing paths) No No No
LBA driver No (short of NPIV, PCI-V) No No No
SAN ? ? ? ? ?
Disk Array Disk Array logic Yes Yes Yes Yes

I don’t know what you think, but I am always amazed that it works.

If you really want to understand the real complexity, you can read one of the reference storage architectures documents. For example:

“NetApp and VMware vSphere Storage Best Practices”

A very good blog that has many posts about this subject:

blog.scottlowe.org

For example this post: presentation-on-vmware-vsphere-4-1-storage-features

And yellow-bricks.com,

For example: mythbusters-esxesxi-caching-io.

Leave a comment

Filed under Storage architectures, Virtualization

Different types of SSDs

I have seen the word SSD (Solid State Disk/Device) used as a generic term, but also as a product type. This may be confusing. Here is my attempt to sort things out:

SSD by technology:

RAM based: mostly a battery backed up RAM. Excellent performance (100k IOPS and above), very expensive, relatively small capacity. No endurance issues (unless backed up by flash). Used mainly for acceleration devices. Examples: http://www.ramsan.com/products/4http://www.ddrdrive.com/

MLC based: NAND flash multi level cell (each cell has at least 2 bits). The least expensive of all SSD technologies (~2.5$ per GB), endurance is low (~3 thousands times per block or less), random performance is fair (typically few thousands of IOPS sustained), good streaming (100MB/s and above). capacity is typically 100-400 GB. Used mainly as mobiles/desktop/low-end server disk replacement. Examples: Intel X-25M,  http://ssd-reviews.com/ (MLC entries)

SLC based: NAND flash single level cell, relatively expensive (~10$ per GB), good performance (tens of thousands of IOPS), small capacities (32-200GB), good endurance (~70-100K writes per block). Used mainly for servers and tier-zero disks in disk arrays. Examples: Intel X25-e.

Enterprise level MLC: similar to regular MLC but with better data protection/error correction mechanisms. By most aspects (price, performance, endurance) they are almost as good as SLC (but typically not as good). Such devices are relatively new on the market, but I belive they will be the new enterprise level SSD standard. Examples: STEC MACH 16 MLC SSD , Fusion-IO DUO MLC.

Others: less relevant – NOR based (very expensive), hybrid (HDD + SSD, NOR + NAND, MLC + SLC).

SSD by form factor:

PCI-e: typically performance oriented SSDs. Very expensive (~20$/GB), excellent performance. Main uses: local server cache/accelerator. Examples: Fusion-io.

SATA: typically desktop/mobile oriented SSDs, MLC based (see above).

SAS: typically servers, disk arrays oriented SSDs, SLC and/or enterprise level MLC (see above).

External (SAN): typically SAN acceleration devices, RAM and/or SLC based. Examples: Violin Memory, Texas Memory

SSDs by use case:

Mobile/desktop disk replacement: typically low power small form factor (1.8″ or 2.5″) SATA MLC devices.

Server acceleration: typically PCI-e devices (see above).

Server disk replacement: typically enterprise level MLC or SLC SAS devices.

Cache/Tier zero disks: typically SLC SAS devices, enterprise level MLC is expected to be relevant too.

SAN accelerations: typically proprietary MLC/SLC/RAM combinations.

A very good site with a lot of SSD related information:

http://www.storagesearch.com/ssd-analysts.html

Leave a comment

Filed under ssd