Category Archives: ssd

Are SSD based arrays a bad idea?

In the not so new post  Why SSD-based arrays are a bad idea, Robin Harris wonders what is the right form factor for flash-based disk arraysMy take on this subject is that form factor matters, but the right form factor is derived from your main design target. If for example your main target is latency then you are better off using DRAM/PCIe interfaces as the SAS/SATA interface introduce some latency and limits your control over the IO path. This applies to tier 0 systems. If you are more in the tier 1 area where some latency penalty can be traded for enterprise level functionalities such as cost, dedup, snapshots and replication, then SSD is probably your best choice. Why is that? let’s go over Robin’s arguments list one by one:

  • Latency – not as important, the extra ~100 usec is not significant for most use cases
  • Bandwidth – even though SSD are not the best form factor to drain the juice out of the flash chips, it is meaningless for most tier 1 systems because the bottleneck is not within the SSD level! As the array becomes smarter, most bottlenecks are shifted to the compute elements, memory buses, and IO buses in the system.
  • Reliability – here I have to disagree with Robin. It is true that DIMMs are more reliable than disks and probably also more than SSDs, but Robin assume that a system that uses flash chips instead of SSDs is more reliable. This is not necessarily true! flash chips do not handle many (if not most) flash related issues in the chip level. They rely on an external controller and other components to perform critical tasks such as wear leveling, bad block handling and media scrubbing. Moreover, implementing such controllers by yourself assume that you are smarter than the SSD manufacturers and/or can produce some gains out of your low-level control. Personally I think believe in both assumptions.  Anyhow, once you start implementing a flash controller you are getting to the same problems as SSD systems and would also get to the same level of reliability. There is a small part where Robin may have a point – if you don’t work with SSDs you can pass the local SCSI stack. But even that is questionable because not everything in the SCSI stack is wrong…
  • Flexibility – as a SSD based disk array designer I can tell you that the SSD nature of the SSD didn’t cause us so many problems as you may assume even though we designed everything from scratch. This is because flash is still a block access media, and that’s what really count.
  • Cost – as I already wrote, flash chips require flash controllers and other resources (RAM, compute) so the comparison Robin did to DRAM is not really apples to apples comparison. That said, it is possible that you can reduce the cost of the flash control sub-system, but as enterprise level SSDs  starts to commodity, the large numbers economy is against such approach.

In fact I am willing to claim that even for tier 0 kind of systems it is not trivial to assume that flash chips/PCIe based design is better than SSD based design, because once you start make your system smarter and implement advanced functions, the device latency start to be insignificant. If you need a very “raw” performance box, flash chips desing/PCIe card may be better choice but then a server local PCIe card will be even better…

Advertisements

Leave a comment

Filed under all-flash disk array, Enterprise Storage, ssd

SSD Dedup and VDI

I found this nice Symantec blog about the SSD+Dedup+VDI issues in the DCIG site. Basically I agree with its main claim that SSD+Dedup is a good match for VDI. On the other side, I think that the 3 potential “pitfalls” mentioned in the post are probably relevant for a naive storage system, and much less for an enterprise level disk array. Here is why (the blue parts are citations from the original post):

  • Write I/O performance to SSDs is not nearly as good as read I/Os. SSD read I/O performance is measured in microseconds. So while SSD write I/O performance is still faster than writes to hard disk drives (HDDs), writes to SSDs will not deliver nearly the same performance boost as read I/Os plus write I/O performance on SSDs is known to degrade over time.
This claim is true only for non enterprise level SSDs. Enterprise level SSDs write performance suffer much less from performance degradation and due its internal NVRAM, the write latency is as good as read latency, if not better. Furthermore most disk arrays have non trivial logic and enough resources to handle these issues even if the SSDs cannot.
  • SSDs are still 10x the cost of HDDs. Even with the benefits provided by deduplication an organization may still not be able to justify completely replacing HDDs with SSDs which leads to a third problem.
There is no doubt that SSDs are at least 10x more expensive than HDDs in terms of GB/$. But when comparing the complete solution cost the outcome is different. In many VDI systems the real main storage constrain is IOPS and not capacity. This means that a  HDD based solution  may need to over provision the system capacity and/or use small disks such that you will have enough (HDD) spindles to satisfy the IOPS requirements. In this case, the real game is IOP/$ where SSDs win big time. Together with the Dedup oriented space reduction, the total solution’s cost maybe very attractive.
  • Using deduplication can result in fragmentation. As new data is ingested and deduplicated, data is placed further and further apart. While fragmentation may not matter when all data is stored on SSDs, if HDDs are still used as part of the solution, this can result in reads taking longer to complete.

Basically I agree, but again the disk array logic may mitigate at least some of the problem. Of course 100% SSD solution is better (much better is some cases). but the problem is that such solutions are still very rare if at all.

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures, VDI, Virtualization

Storage dinosaurs – beware, here comes the SSD!

Magnetic spinning disks (hard drives – HDDs) are the most dominant primary storage device  in the storage world for at least 50 years. This era is over! Within few years (more or less) the SSD will replace the HDD as the primary storage device. Moreover, already in the near future we will be able to see real SSD based storage systems (i.e. systems that use SSD as a primary device and not as a cache/tier-zero device). When these systems will be available, some very fundamental ground rules of storage and storage performance thinking will have to change.  In this post I want to present several common conventions and areas that are heavily affected by “HDD thinking” and may change when SSD systems will start to impact the storage market.

Storage devices are sloooooow!

Comparing to any other major computer component, HDDs are slow. Their bandwidth is modest and their latency is disastrous (see also the sequential/random access section below). This makes HDD as the major primary storage device to be the major performance  bottleneck of many systems, and the major performance bottleneck of most storage systems. Many storage related and some non storage related sub-systems are under optimized simply because the HDD bottlenecks conceal other sub-systems’ performance problems.

The SDD effect:

SSDs are at least order of magnitude faster than HDDs. They are still not comparable to RAM, but connect several enterprise level SSDs to your system, and you will have enough “juice” to hit other system component bottlenecks (memory bus, interface bus, interrupts handling, pure CPU issues, etc.). This changes the entire system performance balance of many systems.

Random vs. Sequential

Due its physical design, a typical HDD can efficiently access continues (sequential) areas on the media, while access to any non successive (random) areas is very inefficient. A typical HDD can stream about 100MB/sec (or more) of successive data, but can access no more then about 200 different areas on the media per second, such that accessing random 4k blocks reduces the outcome bandwidth to about 800KB/sec!

This bi-modal behavior affected the storage system “thinking” very deeply to a point that almost every application that accesses storage resources is tuned to this behavior. This means that:

  • Most applications attempt to minimize the random storage accesses and prefer sequential accesses if they can, and/or attempt to access approximately “close” areas.
  • Applications that need good random bandwidth attempt to access the storage using big blocks. This helps because accessing random 512 bytes “costs” about the same as accessing 32KB of data because the data transfer time from/to an HDD  is relatively small comparing to the seek time (movement of the disk’s head) and the rotational latency (the time the disk has to wait until the disk plate(s) rotates such that the data block is under the head).

The SDD effect:

The media used to build SSD is mostly RAM or flash. Both provide very good random access latency and bandwidth  that is comparable (but still lower) than the SSD’s sequential access latency and bandwidth. Flash media has other limitations (no direct rewrite ability) that force SSD designers to implement complex redirect on write schemes. In most cases, a sequential write access is much easier to handle than random write access, so the bi-modal behavior is somehow retained. On the other hand, read operations are much less affected from this complexity and enterprise level SSDs are built in a way that minimize the sequential/random access performance gap. The sequential write access performance of a typical desktop SSD is about order of magnitude better than its random write performance. For enterprise level SSDs, the performance factor gap may narrow down to about 2, and in many cases the random performance is good enough (i.e. it is not the performance bottleneck of the system). More specifically, random access patterns using small blocks are not performance horrors anymore.

Storage related data structures

When applications needs access to data on HDDs they tend to organize it in a way that is optimized to the random/sequential bi-modality. For example data chunks that have good change to be accessed together or at least during a short period of time, are stored close one to each other (or in storage talk – identify “locality of reference” and transform temporal locality to spatial locality). Applications also tend to use data structures that are optimized for such locality of reference, (such as B-Trees) and avoid data structures that are not (such as hash tables). Such data-structures by themselves may introduce additional overheads for using random/sparse access patterns, and by that create a magical circle where the motivation to use sequential accesses is getting bigger and bigger.

The SSD effect:

Data structures that are meant to exploit the data’s locality of reference still works for SSDs. But as the sequential/random access gap is much smaller, such data structures may cease to be the best structures for many applications, as there are other (application depended) issues that should be the focus of the data structure optimizations instead of the common storage locality of reference. For example, sparse data structures such as hash tables may be much more applicable for some use cases than the current used data structures.

Read and write caches

HDD’s random access latencies are huge comparing to most other computer sub-systems. For example, a typical round-trip time of a packet in a modern 1Gb Ethernet LAN  is about 100 usec. The typical latency of a random 4k HDD IO (read or write) operation is about 4-10 msec (i.e. 40-100 times slower). Read and write caches attempt to reduce or at least hide some of the HDD random IO latency:

  • Read caches keep an copy of the frequently accessed data blocks using some faster media, most commonly RAM. During the read IO flow, the requested read (offset) is looked up in the read cache and if it is found the slow IO operation is completely avoided.
  • Write caches keep last written data in much faster media (e.g. RAM/NVRAM) and write it back (“destage”) to the HDD layer in the background, after the user write is acknowledged. This “hides” most of the write latency and let the system optimize the write (“destage”) bandwidth by applying write-coalescing (known also as “”write combining“”) and write-reordering  (e.g. “elevator algorithm“) techniques.

Of course the effect of both caches is limited:

  • Read caches are effective only if the user data access pattern has enough locality of reference and the total dataset is small enough
  • Write caches are effective only when the destage bandwidth is higher then the user data write bandwidth. For most HDD systems this is not the case, so write caches are good to handle short spikes of writes, but when the cache buffer filles up, the user write latency drops back to the HDD write latency (or to be more exact, to the destage latency).

The SSD effect:

For some SSD based systems the traditional read cache shoulde be much less important comparing to HDD systems. The read operation latency in many SSDs is so low (about 50 usec) that it ceases to be the dominant part of the system’s read operation latency, so the reasoning behind the read cache is much weaker.

Regarding write caches, most SSDs have internal  (battery or otherwise backed up) write caches and in addition, the base write latency is much lower than the HDD write latency, making the write cache to be much less important too. Still, as the SSD’s write latency is relatively high comparing to the read latency, a write behind buffer can be used to hide this write latency. Furthermore, unlike HDD systems,  it should be relatively easy to build a system such that the destage bandwidth is high enough to hide at the media write latency from the user even during very long full bandwidth writes.

Logging

In the early 90’s Mendel RosenblumJ. K. Ousterhout published a very important article named “The_Design_and_Implementation_of_a_Log-Structured_File_System” where in short they claimed the following: read latency can be effectively handled by read caches. This leaves the write latency and throughput as the major problem for many storage systems. The articles suggested the following: instead of having a traditional mapping structure to map logical locations to physical locations such as B-Trees and, use DB oriented technique called logging. When logging is used, each new user write operation writes its data to a continues buffer on the disk, ignoring the existing location of the user data on the physical media. This replaces the commons write-in-place technique with relocate-on-write technique. The motivation is to exploit the relatively fast HDD  sequential write (on the expense of read efficiency,  complexity and defragmentation problems). This technique has many advantages over traditional write in place schemes, such as better consistency, fast recovery, RAID optimizations (i.e. the ability of writing only full RAID stripes), snapshots implementations, etc. (It has also some major drawbacks).

The SSD effect:

The original motivation for logging is not valid anymore: random write performance is not as big problem for SSDs as it is for HDDs, such that the additional complexity may not be worth the performance gain. Still logging techniques may be relevant for SSD based systems in the following cases:

  • Storage systems that directly manage flash components may use logging to solve (also) several flash problems, most importantly the no direct re-write operation support problem and the wear leveling problem.
  • Storage systems may use logging to implement better RAID schemes, and as base for advanced functionalities (snapshots, replication, etc.)

Leave a comment

Filed under Enterprise Storage, ssd, Storage architectures

SSD Benchmarking and Performance Tuning: pitfalls and recommendations

SSDs are complex devices (see this post). Even though most SSDs have HDD like interfaces and can be used as an HDD replacement, they are very different from HDDs, and there are some issues and pitfalls you need to be aware of when you benchmark an SSD. In this post I want to cover some the these issues. Some graphs of real SSD benchmarks are attached to depict the described issues.

I. SSD (write) performance is not constant overtime

Most SSDs change their write performance overtime. This happens mainly due to the internal data mapping and allocation schemes. When the disk is empty (i.e it is trimmed or secure erased), the placement logic has little difficulties to find a place for the new data. After a while the disk fills up and the placement logic may need to do some cleaning work to make room for the data.

GSkill's write performance 16k span from 80G yo 120G

GSkill's FALCON write performance @ 16k, span from 80G yo 120G

The actual sustained write performance is affected by the writing pattern (random or sequential), write block sizes, cleaning/allocation logic, and most importantly the write span (the relative area of the disk that the benchmark uses). In general, most (non enterprise level) SSDs perform better as you write on smaller span. The graph to the right demonstrate how the write performance degrades overtime and stabilizes on a different base line for each span. The disk is a ~120G disk, and the tested spans are 80G to 119G.

II. SSD’s read and write performance is asymmetric

Basically, HDD’s read and write performance is symmetric. This is not the case for SSDs mainly due to the following reasons:

  1. The flash media read write performance is asymmetric – reads are much faster than writes (~50 usec for a single page read operation vs. ~800 usec of a single page write operation on the average).
  2. The flash media does not support re-write operation. You have to erase the entire block (~ 128 pages area) before you can rewrite a single page. As the erase operation is relatively slow (~2 msec) and using it requires costly read/modify/write operation, most SSDs avoid such RMW operations by using large write behind buffers, write combining and complex mapping, allocation and cleaning logic.
  3. The internal SSD’s write and read logic is asymmetric by design. For example, due to the internal mapping layer, the read logic only use the mapping meta-data, while the write may need (and in most cases have) to change the mapping.
  4. Due to the placement/cleaning logic, a single user write may require several internal writes to complete (this is the famous “write amplification”).
GSkill's read test

GSkill's read test

GSkill's write test

GSkill's write test

Even worse, the read write combination (mix) is even more complex due to many internal read/write arbitration issues (that are beyond the scope of this post). Still, as many applications are doing reads and writes at the same time, the read/write mix patterns may be very important, sometimes much more important than the pure read or pure write patterns. [Click on the read and write test thumbnails pictures to display the full images].

III. SSD’s performance is stateful

In addition to the data placement problems mentioned above, the SSD’s complex internal data mapping can affect the performance results in many other ways. For example the sequential writes performance of a disk that is filled with random (access) writes may be different comparing to the exact same benchmark done on a disk that is filled with sequential writes. In fact, some SSD’s sequential write performance may be reduced to the random access write performance if the disk is filled with random writes! . Another small issue is that reading an unmapped block (i.e. a block that was never written or that was trimmed) is different from reading a mapped one.

IV. SSDs are parallel devices

Unlike HDDs that have a single moving arm and therefore can only serve one request at a time, most SSDs have multiple parallel data channels (4-10 channels in most desktop oriented SSD’s). This means in order to reach the SSD’s maximal performance you may need to queue several requests on the disk queue. Note that the SSD’s firmware may utilize its internal channels in many ways so it hard to predict what will be the results of queuing more requests. For many SSDs, it is enough to use a relatively small number of parallel/flight requests (4-8 requests).

V. SSDs may suffer from HDD oriented optimizations

The entire storage stack in your computer/server is HDD oriented. This means that many related mechanisms and even hardware devices are tuned for HDDs. You have to ensure that these optimizations do not harm SSD performance. For example most OSes do read-ahead/write coalescing, and/or reorder writes to minimize the HDD’s arm movements. Most of these optimizations are not relevant to SSDs, and in some cases can even reduce the SSD’s performance.

In addition, most HDDs have 512 bytes sector, and most SSD’s have 4k bytes sector. You have to ensure that the benchmark tool and the OS do know to send properly (offset and size) aligned data requests.

Another issue are the RAID controllers. Most of them are tuned to HDD’s performance and behavior. In many cases they may be a performance bottleneck.

Benchmarking/tuning recommendations

  1. Ensure that your SSD’s firmware is up-to-date.
  2. Ensure that your benchmark tool, OS and hardware are suited for SSD benchmarking:
    1. Make sure AHCI is ON (probably requires BIOS configuration). Without it the OS would not be able to queue several requests at once (i.e. use NCQ).
    2. Make sure that the disk’s queue depth is at least 32.
    3. For Linux (and other OSes) make sure the IO scheduler is off (change it to “noop”).
    4. Use direct IO to avoid caching effects (both read and write caching).
    5. Most SSD’s are 4k block devices. Keep your benchmark’s load aligned (for example IOMeter load is 512 aligned by default).
    6. The SATA/SAS controller may affect the result. Make sure it is not a bottleneck (this is especially important for multiple disks benchmarks).
    7. Avoid RAID controllers (unless it is critical for your application).
  3. Always trim or secure erase the entire disk (!) before the test.
  4. After the trim, fill the disk with random 4k data, even if you want to benchmark reads.
  5. Ensure that the benchmark duration is long enough to identify write logic changes. As a rule I would test a disk for at least couple of hours (not including the initial trim/fill phases).
  6. Adjust the used space span to your need. In most case to have to balance capacity vs. performance.
  7. Adjust the benchmark parallelism to suite your load.
  8. Remember to test mixed read/write patterns.

General recommendations:

  1. If you benchmark the disk to evaluate it for a specific use (case), try to understand what is the relevant pattern or patterns for your use case and focus on these patterns. Try to understand what are your relevant block sizes, alignments, randomness, read vs. write ratio, parallelism, etc.
  2. SSD’s performance may be very different from a model to model, even if they are from the same manufacturer, and/or using the same internal controller.
An end note: the attached graphs are just for demonstration. This is not a GSkill related post and it is not a disk review – the demonstrated behavior is common for many SSDs.

Interesting links:

Leave a comment

Filed under ssd

SSD’s internal complexity – a challenge or an opportunity?

SSDs are much more complex than the magnetic disks they are ought to replace. By complex I mean that they use much more complex logic. This is required to overcome the fundamental problems of the flash media:

  • Flash media do not support the simple read block/rewrite block semantics as magnetic disks do. Instead it uses a read page, write page (but no rewrite), and delete block (block >> page) semantics. A pretty complex mapping and buffering techniques must be used to simulate read/rewrite semantics.
  • A Flash page can be modified (== erased and programmed) only a limited number of times. This requires a “wear leveling” logic that attempts to spread the modifications on the entire block/page set.
  • Flash media suffers from the various sources of data related errors (including errors due to read operations, writes to a physically close locations, etc.). Advanced recovery mechanisms must be implemented to overcome these errors.

As already, nothing comes without a price. The cost of this complexity is:

  • It consumes more resources
  • It requires much more efforts to develop and test SSDs and accordingly more time to stabilize/productize
  • It makes SSD’s performance to be much more complex to analyze and predict
On the other hand, the fact that SSDs have much more resources than a common magnetic disk and that SSDs are geared with advanced data structures and mechanisms makes them also an enabler for new advanced functions and features. Indeed, you can already see features such as encryption and compression implemented within desktop class SSDs (e.g. Intel’s 320 SSD).
These features are only the start. I believe that SSDs can implement “hardware” assists to offload some of the storage features implementation load and that such assists may be required to build the next gen SSD based storage systems.

For example, because a single  magnetic disk (HDD) provides only (up to) 250 IOPS,  HDD  based RAID systems implements the RAID logic above the disks, namely in a “RAID controller”. But then a single  (enterprise level) SSD provides tens of thousands of IOPS making such architecture much less reasonable.  50K IOPS (4k blocks) SSDs in a 16 disks RAID-6 system require that the RAID controller processes 50 * 16 * 4 = 3.2 GB/s of data (mix of reads, writes and XORs). These numbers are far above the current RAID controllers performance envelope (See for example this post: Accelerating System Performance with SSD RAID Arrays)

If the RAID controller could leverage the SSDs to do most of the data crunching it would need to be “only ” a flow manager and coordinator. The same logic can be relevant for other storage functions too: snapshots, data transfers, copy on write, compare and write, etc. There are also some  down side to it: such architecture is much more complex than the current layering architecture and may need new disk/bus side abilities such as disk to disk transfers ability and others. Furthermore, such SSD side assists must be standardized to enable various RAID controllers vendors to use it. Still, I believe that this architecture is much more cost-effective and scalable.
To sum this up, I don’t know if SSD with storage function assists will ever exists. I hope they will because I think it is the correct and economical way to go.

1 Comment

Filed under Enterprise Storage, ssd

SSD in Enterprise Storage

SSD only based enterprise storage systems are still uncommon. Here is one of the first materials I found discussing the different model of using SSD in enterprise storage. I liked this presentation as it covers the most important issues about this subject. All issues are handled in highlights level, but that’s a nice start.

Here are the four categories mentioned:

  • Server attached PCI flash
  • Array with flash cache
  • Array with flash tier
  • All flash LUN or array
Here is the link to the full presentation (SNIA education, by Matt Kixmoeller)

Leveraging flash memory in enterprise storage

Leave a comment

Filed under Enterprise Storage, ssd

Different types of SSDs

I have seen the word SSD (Solid State Disk/Device) used as a generic term, but also as a product type. This may be confusing. Here is my attempt to sort things out:

SSD by technology:

RAM based: mostly a battery backed up RAM. Excellent performance (100k IOPS and above), very expensive, relatively small capacity. No endurance issues (unless backed up by flash). Used mainly for acceleration devices. Examples: http://www.ramsan.com/products/4http://www.ddrdrive.com/

MLC based: NAND flash multi level cell (each cell has at least 2 bits). The least expensive of all SSD technologies (~2.5$ per GB), endurance is low (~3 thousands times per block or less), random performance is fair (typically few thousands of IOPS sustained), good streaming (100MB/s and above). capacity is typically 100-400 GB. Used mainly as mobiles/desktop/low-end server disk replacement. Examples: Intel X-25M,  http://ssd-reviews.com/ (MLC entries)

SLC based: NAND flash single level cell, relatively expensive (~10$ per GB), good performance (tens of thousands of IOPS), small capacities (32-200GB), good endurance (~70-100K writes per block). Used mainly for servers and tier-zero disks in disk arrays. Examples: Intel X25-e.

Enterprise level MLC: similar to regular MLC but with better data protection/error correction mechanisms. By most aspects (price, performance, endurance) they are almost as good as SLC (but typically not as good). Such devices are relatively new on the market, but I belive they will be the new enterprise level SSD standard. Examples: STEC MACH 16 MLC SSD , Fusion-IO DUO MLC.

Others: less relevant – NOR based (very expensive), hybrid (HDD + SSD, NOR + NAND, MLC + SLC).

SSD by form factor:

PCI-e: typically performance oriented SSDs. Very expensive (~20$/GB), excellent performance. Main uses: local server cache/accelerator. Examples: Fusion-io.

SATA: typically desktop/mobile oriented SSDs, MLC based (see above).

SAS: typically servers, disk arrays oriented SSDs, SLC and/or enterprise level MLC (see above).

External (SAN): typically SAN acceleration devices, RAM and/or SLC based. Examples: Violin Memory, Texas Memory

SSDs by use case:

Mobile/desktop disk replacement: typically low power small form factor (1.8″ or 2.5″) SATA MLC devices.

Server acceleration: typically PCI-e devices (see above).

Server disk replacement: typically enterprise level MLC or SLC SAS devices.

Cache/Tier zero disks: typically SLC SAS devices, enterprise level MLC is expected to be relevant too.

SAN accelerations: typically proprietary MLC/SLC/RAM combinations.

A very good site with a lot of SSD related information:

http://www.storagesearch.com/ssd-analysts.html

Leave a comment

Filed under ssd