Magnetic spinning disks (hard drives – HDDs) are the most dominant primary storage device in the storage world for at least 50 years. This era is over! Within few years (more or less) the SSD will replace the HDD as the primary storage device. Moreover, already in the near future we will be able to see real SSD based storage systems (i.e. systems that use SSD as a primary device and not as a cache/tier-zero device). When these systems will be available, some very fundamental ground rules of storage and storage performance thinking will have to change. In this post I want to present several common conventions and areas that are heavily affected by “HDD thinking” and may change when SSD systems will start to impact the storage market.
Storage devices are sloooooow!
Comparing to any other major computer component, HDDs are slow. Their bandwidth is modest and their latency is disastrous (see also the sequential/random access section below). This makes HDD as the major primary storage device to be the major performance bottleneck of many systems, and the major performance bottleneck of most storage systems. Many storage related and some non storage related sub-systems are under optimized simply because the HDD bottlenecks conceal other sub-systems’ performance problems.
The SDD effect:
SSDs are at least order of magnitude faster than HDDs. They are still not comparable to RAM, but connect several enterprise level SSDs to your system, and you will have enough “juice” to hit other system component bottlenecks (memory bus, interface bus, interrupts handling, pure CPU issues, etc.). This changes the entire system performance balance of many systems.
Random vs. Sequential
Due its physical design, a typical HDD can efficiently access continues (sequential) areas on the media, while access to any non successive (random) areas is very inefficient. A typical HDD can stream about 100MB/sec (or more) of successive data, but can access no more then about 200 different areas on the media per second, such that accessing random 4k blocks reduces the outcome bandwidth to about 800KB/sec!
This bi-modal behavior affected the storage system “thinking” very deeply to a point that almost every application that accesses storage resources is tuned to this behavior. This means that:
- Most applications attempt to minimize the random storage accesses and prefer sequential accesses if they can, and/or attempt to access approximately “close” areas.
- Applications that need good random bandwidth attempt to access the storage using big blocks. This helps because accessing random 512 bytes “costs” about the same as accessing 32KB of data because the data transfer time from/to an HDD is relatively small comparing to the seek time (movement of the disk’s head) and the rotational latency (the time the disk has to wait until the disk plate(s) rotates such that the data block is under the head).
The SDD effect:
The media used to build SSD is mostly RAM or flash. Both provide very good random access latency and bandwidth that is comparable (but still lower) than the SSD’s sequential access latency and bandwidth. Flash media has other limitations (no direct rewrite ability) that force SSD designers to implement complex redirect on write schemes. In most cases, a sequential write access is much easier to handle than random write access, so the bi-modal behavior is somehow retained. On the other hand, read operations are much less affected from this complexity and enterprise level SSDs are built in a way that minimize the sequential/random access performance gap. The sequential write access performance of a typical desktop SSD is about order of magnitude better than its random write performance. For enterprise level SSDs, the performance factor gap may narrow down to about 2, and in many cases the random performance is good enough (i.e. it is not the performance bottleneck of the system). More specifically, random access patterns using small blocks are not performance horrors anymore.
Storage related data structures
When applications needs access to data on HDDs they tend to organize it in a way that is optimized to the random/sequential bi-modality. For example data chunks that have good change to be accessed together or at least during a short period of time, are stored close one to each other (or in storage talk – identify “locality of reference” and transform temporal locality to spatial locality). Applications also tend to use data structures that are optimized for such locality of reference, (such as B-Trees) and avoid data structures that are not (such as hash tables). Such data-structures by themselves may introduce additional overheads for using random/sparse access patterns, and by that create a magical circle where the motivation to use sequential accesses is getting bigger and bigger.
The SSD effect:
Data structures that are meant to exploit the data’s locality of reference still works for SSDs. But as the sequential/random access gap is much smaller, such data structures may cease to be the best structures for many applications, as there are other (application depended) issues that should be the focus of the data structure optimizations instead of the common storage locality of reference. For example, sparse data structures such as hash tables may be much more applicable for some use cases than the current used data structures.
Read and write caches
HDD’s random access latencies are huge comparing to most other computer sub-systems. For example, a typical round-trip time of a packet in a modern 1Gb Ethernet LAN is about 100 usec. The typical latency of a random 4k HDD IO (read or write) operation is about 4-10 msec (i.e. 40-100 times slower). Read and write caches attempt to reduce or at least hide some of the HDD random IO latency:
- Read caches keep an copy of the frequently accessed data blocks using some faster media, most commonly RAM. During the read IO flow, the requested read (offset) is looked up in the read cache and if it is found the slow IO operation is completely avoided.
- Write caches keep last written data in much faster media (e.g. RAM/NVRAM) and write it back (“destage”) to the HDD layer in the background, after the user write is acknowledged. This “hides” most of the write latency and let the system optimize the write (“destage”) bandwidth by applying write-coalescing (known also as “”write combining“”) and write-reordering (e.g. “elevator algorithm“) techniques.
Of course the effect of both caches is limited:
- Read caches are effective only if the user data access pattern has enough locality of reference and the total dataset is small enough
- Write caches are effective only when the destage bandwidth is higher then the user data write bandwidth. For most HDD systems this is not the case, so write caches are good to handle short spikes of writes, but when the cache buffer filles up, the user write latency drops back to the HDD write latency (or to be more exact, to the destage latency).
The SSD effect:
For some SSD based systems the traditional read cache shoulde be much less important comparing to HDD systems. The read operation latency in many SSDs is so low (about 50 usec) that it ceases to be the dominant part of the system’s read operation latency, so the reasoning behind the read cache is much weaker.
Regarding write caches, most SSDs have internal (battery or otherwise backed up) write caches and in addition, the base write latency is much lower than the HDD write latency, making the write cache to be much less important too. Still, as the SSD’s write latency is relatively high comparing to the read latency, a write behind buffer can be used to hide this write latency. Furthermore, unlike HDD systems, it should be relatively easy to build a system such that the destage bandwidth is high enough to hide at the media write latency from the user even during very long full bandwidth writes.
In the early 90’s Mendel Rosenblum, J. K. Ousterhout published a very important article named “The_Design_and_Implementation_of_a_Log-Structured_File_System” where in short they claimed the following: read latency can be effectively handled by read caches. This leaves the write latency and throughput as the major problem for many storage systems. The articles suggested the following: instead of having a traditional mapping structure to map logical locations to physical locations such as B-Trees and, use DB oriented technique called logging. When logging is used, each new user write operation writes its data to a continues buffer on the disk, ignoring the existing location of the user data on the physical media. This replaces the commons write-in-place technique with relocate-on-write technique. The motivation is to exploit the relatively fast HDD sequential write (on the expense of read efficiency, complexity and defragmentation problems). This technique has many advantages over traditional write in place schemes, such as better consistency, fast recovery, RAID optimizations (i.e. the ability of writing only full RAID stripes), snapshots implementations, etc. (It has also some major drawbacks).
The SSD effect:
The original motivation for logging is not valid anymore: random write performance is not as big problem for SSDs as it is for HDDs, such that the additional complexity may not be worth the performance gain. Still logging techniques may be relevant for SSD based systems in the following cases:
- Storage systems that directly manage flash components may use logging to solve (also) several flash problems, most importantly the no direct re-write operation support problem and the wear leveling problem.
- Storage systems may use logging to implement better RAID schemes, and as base for advanced functionalities (snapshots, replication, etc.)