NVMe vs AHCI: Another Win for PCIe

Improving performance is never just about hardware. Faster hardware can only help to reach the limits of software and ultimately more efficient software is needed to take full advantage of the faster hardware. This applies to SSDs as well. With PCIe the potential bandwidth increases dramatically and to take full advantage of the faster physical interface, we need a software interface that is optimized specifically for SSDs and PCIe.

AHCI (Advanced Host Controller Interface) dates back to 2004 and was designed with hard drives in mind. While that doesn't rule out SSDs, AHCI is more optimized for high latency rotating media than low latency non-volatile storage. As a result AHCI can't take full advantage of SSDs and since the future is in non-volatile storage (like NAND and MRAM), the industry had to develop a software interface that abolishes the limits of AHCI.

The result is NVMe, short for Non-Volatile Memory Express. It was developed by an industry consortium with over 80 members and the development was directed by giants like Intel, Samsung, and LSI. NVMe is built specifically for SSDs and PCIe and as software interfaces usually live for at least a decade before being replaced, NVMe was designed to be capable of meeting the industry needs as we move to future memory technologies (i.e. we'll likely see RRAM and MRAM enter the storage market before 2020).

  NVMe AHCI
Latency 2.8 µs 6.0 µs
Maximum Queue Depth Up to 64K queues with
64K commands each
Up to 1 queue with
32 commands each
Multicore Support Yes Limited
4KB Efficiency One 64B fetch Two serialized host
DRAM fetches required

Source: Intel

The biggest advantage of NVMe is its lower latency. This is mostly due to a streamlined storage stack and the fact that NVMe requires no register reads to issue a command. AHCI requires four uncachable register reads per command, which results in ~2.5µs of additional latency.

Another important improvement is support for multiple queues and higher queue depths. Multiple queues ensure that the CPU can be used to its full potential and that the IOPS is not bottlenecked by single core limitation.

Source: Microsoft

Obviously enterprise is the biggest beneficiary of NVMe because the workloads are so much heavier and SATA/AHCI can't provide the necessary performance. Nevertheless, the client market does benefit from NVMe but just not as much. As I explained in the previous page, even moderate improvements in performance result in increased battery life and that's what NVMe will offer. Thanks to lower latency the disk usage time will decrease, which results in more time spend at idle and thus increased battery life. There can also be corner cases when the better queue support helps with performance.

Source: Intel

With future non-volatile memory technologies and NVMe the overall latency can be cut to one fifth of the current ~100µs latency and that's an improvement that will be noticeable in everyday client usage too. Currently I don't think any of the client PCIe SSDs support NVMe (enterprise has been faster at adopting NVMe) but the SF-3700 will once it's released later this year. Driver support for both Windows and Linux exists already, so it's now up to SSD OEMs to release compatible SSDs.

Why We Need Faster SSDs Testing SATA Express
POST A COMMENT

131 Comments

View All Comments

  • SirKnobsworth - Thursday, March 13, 2014 - link

    Thunderbolt 2 is really PCIe x4 + DisplayPort in disguise, and you don't need DisplayPort to your SSD. Reply
  • MrSpadge - Thursday, March 13, 2014 - link

    Couldn't you build a nice M.2 to SATAe adapter in a 2.5" form factor and thereby reuse your existing M.2 designs for SATAe? Reply
  • Kristian Vättö - Thursday, March 13, 2014 - link

    Technically yes, but the problem is that M.2 is shaped differently. You could certainly fit a small M.2 drive with only few NAND packages in there but the longer, faster ones don't really fit inside 2.5". Reply
  • Kevin G - Thursday, March 13, 2014 - link

    "At 24 frames per second, uncompressed 4K video (3840x2160, 12-bit RGB color) requires about 450MB/s of bandwidth, which is still (barely) within the limits of SATA 6Gbps."

    This is incorrect:

    3840 * 2160 * 12 bit per channel * 3 channels / 8 bits per byte * 24 fps ~ 896 MByte/s

    And that figure is with with good byte packing. For raw recording, the algorithm may pack the 12 bits into two bytes for speed purposes meaning you'd need about 1.2 Gbyte/s of bandwidth. Jumping to 4096 x 2160 resolution at 12 bit color and 30 fps, the bandwidth need grows to about 1.6 Gbyte/s.

    The other thing worth noting is that uncompressed recording is going to take a lot of storage. A modern phone recording at the highest quality settings with 64 GB of storage would last less than 40 seconds before running out.
    Reply
  • Kristian Vättö - Thursday, March 13, 2014 - link

    Oh, you're absolutely right. I used the below calculator to calculate the bandwidth but accidentally left "interlaced" box ticked, which screwed up the results. Thanks for the heads up, fixing... Reply
  • Kristian Vättö - Thursday, March 13, 2014 - link

    And the calculator... http://web.forret.com/tools/video_fps.asp?width=38... Reply
  • JarredWalton - Thursday, March 13, 2014 - link

    Aren't there *four* channels, though? RGB and Alpha? Or is Alpha not used with 12-bit? Reply
  • Kevin G - Thursday, March 13, 2014 - link

    No real way to record with an Alpha channel value to my knowledge. Cameras and scanners etc all presume a flattened image as if everything were solid. The only exception to this would be direct frame buffer capture from video memory which can independently process an Alpha channel.

    Input media would generally be 36 bit. During the editing phase an Alpha channel can be added as part of compositing pipeline bringing the total bit depth to 48 bit. Final rendering can be done to a 48 bit RGBA file. Display output on screen will be reduced to 36 bit due to compositing for the frame buffer.
    Reply
  • Nightraptor - Thursday, March 13, 2014 - link

    When I saw the daughterboard Asus provided my instant thought was actually using this (in pcie 3.0 form) to somehow provide the option to add an external GPU to a tablet. I may be the outlier, but my dream would be to have and 11.6" 16:10 1920 x 1200 tablet with the ability to connect a keyboard dock to function as a laptop, or another dock with a discrete graphics card to function as a desktop for occasional gaming (1080p at high setting would be all I'd ask for - so pcie 3.0 4x should be sufficient). If you could somehow get a SATAe cable on a tablet I think this would do it. Reply
  • vladman - Thursday, March 13, 2014 - link

    If you want speed from storage, get a nice Areca PCIe RAID controller, attach 4 or more fast SSDs, do RAID 0, and you've got anywhere from 1.7 to 2GB/s of data transfer. Done deal. Reply

Log in

Don't have an account? Sign up now