So, storing data persistently takes about a million times longer than writing to main memory. What happens when main memory is inherently persistent? Something wonderful.
Although 20 years of open-source software has revolutionized the way we use computers, the hardware itself has had practically no deep architectural changes. Servers, desktops and blades are still CPU, RAM, hard disks and NICs, and monitors and keyboards. Not that that's a problem—we've all enjoyed more than a thousand times better performance and capacity, lower energy use and lower prices in that time frame.
But the next orders of magnitude better performance will be arriving not in the next 20 years, but in the next couple years. And the hardware changes will not merely make our machinery faster but will usher in radically different approaches to our programming paradigms, our device interaction and the operating system itself.
After 50-some years, what has been considered a fundamental architectural barrier, RAM volatility, is about to disappear.
Right now, in every PC, mainframe, tablet and smartphone, the handful of gigabytes of SRAM and DRAM and DDRAM/2/3 are all electrically dynamic. Turn off the power, and all the program text segments, all the computed data structures, all the transient user content, all the operating system state is lost. Any data to be referenced in the future requires persistence to “hard disk”.
But the difference between RAM and disks is vast. Writes to RAM take nanoseconds; writes to disk take milliseconds—a million times slower. Disk access patterns are serial, and disk writes are most efficient only with sizable buffers, rather than RAM's random-access patterns. And, disks are notoriously less reliable.
The difference is so vast that we architect our software systems around it. We put filesystems on disks and transient data in memory. Our programming paradigm is such that running applications and system state is transient, and there's no language designed to deal with persistence intrinsically.
So, we have architected several ways to get around this: caches, buffers, flushing I/O patterns, APIs and so on, all manifestly alerting us that persistence is costly and dangerous. For reliability, writes to disk must have a variety of exception handlers. We write extra code to compress writes; we don't write complex data structures to disk; rather, we write transaction logs and so on. In traditional UNIX/Linux, we don't even write general data to disk when we want to—it gets written back every 30 seconds or so when the OS feels like it. We write only file metadata to disk synchronously.
In terms of architectural fundamentals, think about the difference between open()/write()/close() and all the inherent buffering and explicit exception handling, and a statement like int x = 10;. We base our language and API designs on this dichotomy—RAM is fast, reliable but transient; disk writes are slow, error-prone, but persistent.
For the past decade, several technologies have been in development that provide the read and write latency and reliability of RAM with the transaction-level persistence of disks.
Here, I'll call this technology NVM for non-volatile memory. RAM, of course, means random access memory, which doesn't even mention its lack of persistence. Newer products use their own acronyms based on their particular technology. Some use the acronym NVRAM, but I don't need to belabor the random-access aspect, do I? Presumably, generations to come will just call it “memory”—after all, memory by definition is something that is persistent.
Now, global companies like IBM, Toshiba, HP and Samsung, and startups like Everspin, Crocus and Hynix, are all building and shipping NVM products, primarily used by embedded systems markets. Industries, such as automotive, aerospace and others require very reliable persistence in small form factors. And cameras, phones, RAID controllers, network routers and other tech manufacturers are using NVM under the covers.
NVM technologies include:
Magneto-Resistive RAM: stores bits as magnetic moments in ferromagnetic areas built in to transistor arrays. Rather than a hard drive, this is a random-access read/write magnetic media.
Spin-Transfer Torque RAM: stores bits as spin-aligned electrons rather than magnetically aligned atoms. Right now, this technology leads in the shortest write latencies.
Phase-Change RAM: stores bits by changing the crystalline structure of a germanium/antimony/tellurium alloy. Fortunately, that alloy crystal can be embedded in a small transistor and can change phase in 10s of nanoseconds.
Programmable Metallization RAM: stores bits by switching the ionization of atoms in an electrolyte between two nano-electrodes.
Resistive RAM (memristors): stores bits by changing the conductive properties of a dielectric cell.
These new chips are all nano-fast for writes and reads without any buffers and can store data without any active power supply. But how does that persistence compare to disk drive reliability? In fact, storage to NVM chips is almost as reliable as disk storage now, and it will become more reliable in ways that disks are unlikely to achieve. Current disks offer an endurance of about 1015 read/write cycles; Current SRAM/DRAM can handle 1016 cycles—without persistence. Right now, NVM technologies offer about 1014–1015, but should be able to hit 1016–1017 cycles and more. That's several decades of storage stability, matching and exceeding hard drives.
And NVMs should just get better over time. If you've ever seen the “Shouting in the Datacenter” video, you know just how error-prone multiplatter spinning-disks with tiny armatures frantically swinging back and forth are. Storing data in magnetic-moments or electron spins inside solid-state cells will be dramatically more reliable.
But, isn't Flash SSD a pretty good answer, available now? Yes, but this is a temporary technology, bringing some benefits now, but slated for quick obsolescence. Flash is better than a disk drive, but production Flash NAND gates are slow. Fast writes are possible only with a large RAM write cache stuck into the Flash memory cards. Worse, the gates are good for only 105–106 cycles—each electronic write damages the cell, so large Flash devices need additional circuitry for leveling writes across less-used cell areas.
Because of these properties and odd buffering policies, many application use cases simply don't get any performance benefit from writing to Flash. Flash manufacturers know this—in fact, Flash is marketed as an SSD (a solid-state “drive”) telling you that it is not competing with RAM.
But this is not a bash-Flash article; I'd be just as happy with Flash-based gigabytes of persistence, as long as it is RAM-fast and reliable. Whether the future is Flash or MR-RAM, STT-RAM, PC-RAM, PM-RAM, R-RAM or any of the other possibilities—Nanotube-RAM, SONOS, Racetrack memory and so on—the key feature is memory that is uniformly ultra-low-latency reads and writes, random-access, high-bandwidth, long-term persistent and capable of large-scale, cheap production.
(As an aside, battery-backed RAM could be just as good as NVM, if batteries didn't corrode, fail, require recharging or catch fire. Battery technology is so poor today that it is simply another point of failure, not a truly safe alternative for long-term storage.)
Although current shipping quantities of NVM are merely (!) millions of units with capacities of just megabytes, all of these manufacturers are committed to continuing scale-up to GB. Some of these larger units will start shipping in 2014/2015.
This is a big industry with many creative and agile startups pushing the envelope and the leading IT manufacturers shipping their own products. The production technologies are variants of existing fabrication processes. But let's move past arguing the pros and cons of the different implementation technologies. I've no idea which technology or which company will win out. But the breadth and commitment of the industry is such that the NVM future will arrive sooner than later. My motivation here is to talk about some of the implications of this hardware future and to help the software community think about what we'd like to do with it.
As mentioned, current uses of this are mainly in the industrial sector, but it will soon be more and more visible in existing appliances, such as network routers and other communications equipment. And, we'll quickly see better hybrid-drives and RAID devices. Hybrid-drives and RAIDs, which currently use Flash or battery-backed RAM to provide a persistent cache for a traditional hard drive, will be available with the much lower latencies and higher reliabilities of NVM. The benefit here is that our software architectures will be using these devices with little or no changes. That will help drive the market for more such devices, which in turn will make the manufacturing processes scale up more and more. But this is just more of the faster and better world that we expect from technology.
Our challenge is to think about Linux systems in general and a near-future world of NVM computer architecture where persistent data writes are RAM-fast. What will we do when we can have 8GB (and more) main-memory NVM coupled with a modern CPU?
The interesting changes will come with GB-sized NVM add-in cards. You can imagine a new BIOS recognizing that card and making it available as a distinct storage device to the Linux kernel. With that, it should be pretty trivial to port our standard filesystems, like ext2/3/4, xfs or btrfs. And they'll work fine and provide dramatic performance advantages.
But NVM is different from disks. With real random access, the algorithms for rotating disk-drive data placement are outdated. With RAM-like write latencies, we can eliminate most of the buffering and waits for error reports. And we can optimize file access made available via memory-mapping to virtually eliminate any API or OS buffering.
This new “NVMfs” should likely still have a transaction journal for metadata, not to queue up writes that haven't been written to disk, but to queue up CPU transactions and cache writes that may get interrupted. There's still a need to take care of this, but the RAM-speed of write latencies mean that the kernel can do much more “filesystem” activity directly and not have to wait for interrupt callbacks. Happily, all of this means the practical elimination of filesystem maintenance, such as defragging or optimizing file placement or lengthy boot-time repair of metadata corruption.
With this, many I/O-intensive applications and databases like MySQL will see magnitudes of performance improvement. For messaging systems like ActiveMQ, no longer will there be a trade-off between unreliable and guaranteed messaging. An application using NVM-optimized SQLite will be awesome. And distributed memory caches like Memcached won't have to skimp on persistence features.
Meanwhile, encryption and compression still can be an important feature of an NVMfs. Unfortunately, for hard drives, compress/encryption comes with little cost because of the mismatch between fast CPU/RAM speed and slow disk write latencies. With an NVMfs, the performance difference between file data stored plain versus encrypted will be obvious, though still faster than those stored on hard disks.
In the end, NVM-based filesystems probably will mean that all notebooks, desktop PCs and commodity servers will be totally solid-state systems by the end of the decade. We then can optimize our OSes to write-back memory pages to networked disk storage, not for persistence, but for distributed access and disaster recovery. Hard drives, which still will have an orders-of-magnitude advantage in total storage capacity for years to come, will be relegated to the data center and cloud where they can be cared for properly.
But if we can store filesystem data in NVM, we can store application data there too. One simple model could be for applications to ask to map in NVM memory, as is done now with mmap'd files. Of course, NVM memory regions need no backing disk store—they are inherently resilient. Many performant applications ignore persistence functionality, not even using a transaction log. NVM means that all sorts of applications can have persistent semantics, being able to use complex data structures in their programmatic idioms, without even a foo.save() required.
How should these NVM areas be named? How should they be secured?
Whereas current programming paradigms assume that variables are transient ipso facto, future languages or extensions may allow different semantics for different data. Rather than naming data regions and using some sort of brk() or mmap() call, certain language keywords or data-naming rules would enable automatic re-mapping to persisted NVM data. (Ironically, the “volatile” keyword in C/C++ may be required for transient data!)
For decades, virtual machine-based languages, such as Smalltalk and various Lisps, have had to have cumbersome “save world” commands to write out all the in-memory data structures and class or function definitions to disk. In an NVM world, we don't need a separate command for this—all VM use of NVM is persistent. A virtual machine world will be both dynamic, fast and long-lived. Modern VM languages like Java and also dynamic scripting languages like Ruby/Python/Tcl could enable an application or system to store all of its active data structures in NVM with no need for laborious serialization on and off slow disks.
Perhaps functional languages, such as Erlang and Haskell, with their immutable value design, could take the best advantage of NVM. Their clean, mathematical philosophy has never much liked the “side effect” of storage. Now they may be able to support persistence as a virtually free feature.
Of course, with automatic and pervasive persistence, comes the problem of transactional support. Although the data may be persistent, there is no guarantee that what data you're reading was all written out as a single, correct transaction. Take setting some array, map or string to some value. If some number of (persistent) stores or cache writes are interrupted, the array/map/string may be only half set correctly. Naively getting reconnected to that data segment won't tell you what portion of the data is correct. To help with this, applications could ask the kernel to set restore points for their data. A more sophisticated solution would be for VMs to provide software transaction semantics or for NVM hardware transactional memory to ensure atomicity and consistency.
As far as ACID (atomicity, consistency, isolation and durability) goes, NVM could eliminate worrying about “D”. But will this NVM data be truly resilient? Won't bullet-proof resilience as needed for financial transactions still exact enormous penalties? Probably not. If NVM itself isn't good enough, redundant boards should provide the needed availability. Better yet, HA systems with primary and secondary NVM still will work thousands of times faster than current HA disk-based systems.
Now that you're getting used to the idea of “RAM” being “NV”, let's go all the way down the stack to the operating system itself. What advantage could the Linux kernel take of NVM? It's not just that a Linux NVM system could boot in fractions of a second, but that having some (or most?) kernel state persisted at practically no extra cost in time opens up many interesting possibilities. The bootloader still can execute hardware power-on self-tests, but there's very little extra work required to get the kernel running when much of the kernel state and instruction space is magically still available.
During a transition period, when DRAM and NVM coexist in a system, the kernel process table could be modified to note which processes are running wholly in NVM. On reboot, the kernel process table (also in NVM) could ignore DRAM-based processes, while letting NVM processes get going as soon as system devices are initialized. And, as mentioned, the kernel could help with application data restore points.
An NVM kernel also may help with managing devices. The ability of RAM-fast data persistence could enable the kernel to remember the state of attached hardware to a far greater ability (or, to be fair, the devices may have their own NVM to help out).
NVM is coming. Without much work, it will provide an enormous benefit to applications and use cases where storage performance is a limiting factor. I've tried to outline some of the more revolutionary ways that we can take advantage of the technology. As RAM volatility has been a fundamental assumption of our computing architecture, it is hard to figure out what an NVM future could look like. What may have been the design principles, kernel semantics and language design in an alternate computing world where NVM was invented in, say, the 1960s? More radical notions could work in theory, but there may be no easy migration path from where we are today. It will be up to the global community to figure out answers with open source and Linux driving the way.