The kernel already tries to recognize when a CPU goes idle and migrate it to a power-saving state. In fact, several power-saving states are available, depending on how quickly the system will need to wake the CPU up again later. A deeper sleep means greater power savings but slower wakeup.
The menu governor uses various heuristics to guess how long a CPU is likely to remain idle and, thus, how deep of an idle state to put it in. However, as Rafael J. Wysocki pointed out recently, the existing menu governor was poorly designed, with a somewhat irrational decision-making process, even to the point of trying to trigger impossible actions.
So, he wanted to rewrite it. Unfortunately, this didn't seem entirely feasible. For certain workloads, optimizing the interactions with the menu governor is a first-class way to speed things up. And for any projects that need that, any replacement might slow things down until new optimizations could be figured out.
Rafael's idea, in light of this, was—at least for a while—to have two menu governors available side by side. The original and a new one, called the Timer Events Oriented (TEO) governor. For users who either didn't care or didn't generally need to optimize CPU idling, the TEO governor hopefully would provide a better and more predictable experience. And for users who needed a slower transition, they still could rely on the existing menu governor.
Rafael described the TEO governor's new heuristics, saying, "it tries to correlate the measured idle duration values with the available idle states and use that information to pick up the idle state that is most likely to 'match' the upcoming CPU idle interval." He added that the new code avoided using several data points, like the number of processes waiting for input, because those data points simply weren't relevant to the problem.
Several folks like Doug Smythies and Giovanni Gherdovich eagerly replied with benchmarks comparing the menu governor with the TEO governor. In some cases, these showed similar speeds between the two, although in some cases, the TEO governor appeared to perform much better than the menu governor.
In fact, maybe it was too much better! Some of the speed increases seemed to indicate to Rafael that the heuristics were perhaps too aggressive. The goal, after all, wasn't speed alone, but also power conservation. After seeing some of the benchmark results, Rafael said he'd tweak the code to be more energy efficient and see how much that would slow things down.
And so, development continued. Something like the menu governor always will be somewhat astrological, like many other aspects of resource allocation, simply because different workloads have different needs, and no one really knows what workloads are the common case. But at least for the TEO governor, there seems to be no real controversy, and Rafael's planned dual-governor situation seems like it has a good chance of adoption.
The printk() function is a subject of much ongoing consternation among kernel developers. Ostensibly, it's just an output routine for sending text to the console. But unlike a regular print routine, printk() has to be able to work even under extreme conditions, like when something horrible is going on and the system needs to utter a few last clues as it breathes its final breath.
It's a heroic function. And like most heroes, it has a lot of inner problems that need to be worked out over the course of many adventures. One of the entities sent down to battle those inner demons has been John Ogness, who posted a bunch of patches.
One of the problems with printk() is that it uses a global lock to protect its buffer. But this means any parts of the kernel that can't tolerate locks can't use printk(). Nonmasking interrupts and recursive contexts are two areas that have to defer printk() usage until execution context returns to normal space. If the kernel dies before that happens, it simply won't be able to say anything about what went wrong.
There were other problems—lots! Because of deferred execution, sometimes the buffer could grow really big and take a long time to empty out, making execution time hard to predict for any code that disliked uncertainty. Also, the timestamps could be wildly inaccurate for the same reason, making debugging efforts more annoying.
John wanted to address all this by re-implementing printk() to no longer require a lock. With analysis help from people like Peter Zijlstra, John had come up with an implementation that even could work deep in NMI context and anywhere else that couldn't tolerate waiting.
Additionally, instead of having timestamps arrive at the end of the process, John's code captured them at execution time, for a much more accurate debugging process.
His code also introduced a new idea—the possibility of an emergency situation, so that a given printk() invocation could bypass the entire buffer and write its message to the console immediately. Thus, hopefully, even the shortest of final breaths could be used to reveal the villain's identity.
Sergey Senozhatsky had an existential question: if the new printk() was going to be preemptible in order to tolerate execution in any context, then what would stop a crash from interrupting printk() in order to die?
John offered a technical explanation, which seemed to indicate that "panic() can write immediately to the guaranteed NMI-safe write_atomic console without having to first do anything with other CPUs (IPIs, NMIs, waiting, whatever) and without ignoring locks."
Specifically, John felt that his introduction of emergency printk() messages would handle the problem of messages failing to get out in time. And as he put it, "As long as all critical messages are printed directly and immediately to an emergency console, why is it a problem if the informational messages to consoles are sometimes delayed or lost?"
At some point, it came out that although John's reimplementation was intended to improve printk() in general, he said, "Really the big design change I make with my printk-kthread is that it is only for non-critical messages. For anything critical, users should rely on an emergency console."
The conversation did not go on very long, but it does seem as though John's new printk() implementation may end up being controversial. It eliminates some of the delays associated with the existing implementation, but only by relegating those delays to messages it regards as less important. I would guess it'll turn out to be hard to tell which messages are really more important than others.
Persistent memory is still sort of a specialty item in Linux—RAM that retains its state across boots. Dave Hansen recently remarked that it was a sorry state of affairs that user applications couldn't simply use persistent memory by default. They had to be specially coded to recognize and take advantage of it. Dave wanted the system to treat persistent memory as just regular old memory.
His solution was to write a new driver that would act as a conduit between the kernel and any available persistent memory devices, managing them like any other RAM chip on the system.
Jeff Moyer was skeptical. He pointed out that in 2018, Intel had announced memory modes for its Optane non-volatile memory. Memory modes would allow the system to access persistent memory as regular memory—apparently exactly what Dave was talking about.
But Keith Busch pointed out that Optane memory modes were architecture-specific, for Intel's Optane hardware, while Dave's code was generic, for any devices containing persistent memory.
Jeff accepted the correction, but he still pointed out that persistent memory was necessarily slower than regular RAM. If the goal of Dave's patch was to make persistent memory available to user code without modifying that code, then how would the kernel decide to give fast RAM or slow persistent memory to the user software? That would seem to be a crucial question, he said.
Keith replied that faster RAM would generally be given preference over the slower persistent memory. The goal was to have the slower memory available if needed.
Dave also remarked that Intel's memory mode was wonderful! He had no criticism of it, and he said there were plenty of advantages to using memory mode instead of his patches. But he, also felt that the patches were essentially complementary, and they could be used side by side on systems that supported memory mode.
He also added:
Here are a few reasons you might want this instead of memory mode:
1. Memory mode is all-or-nothing. Either 100% of your persistent memory is used for memory mode, or nothing is. With this set, you can (theoretically) have very granular (128MB) assignment of PMEM to either volatile or persistent uses. We have a few practical matters to fix to get us down to that 128MB value, but we can get there.
2. The capacity of memory mode is the size of your persistent memory. DRAM capacity is "lost" because it is used for cache. With this, you get PMEM+DRAM capacity for memory.
3. DRAM acts as a cache with memory mode, and caches can lead to unpredictable latencies. Since memory mode is all-or-nothing, your entire memory space is exposed to these unpredictable latencies. This solution lets you guarantee DRAM latencies if you need them.
4. The new "tier" of memory is exposed to software. That means that you can build tiered applications or infrastructure. A cloud provider could sell cheaper VMs that use more PMEM and more expensive ones that use DRAM. That's impossible with memory mode.
The discussion petered out inconclusively, but something like this patch inevitably will go into the kernel. System resources are becoming very diverse these days. The idea of hooking up a bunch of wonky hardware and expecting reasonable behavior is starting to be more and more of a serious idea. It all seems to be leading toward a more open-sourcey idea of the Internet of Things—a world where your phone and your laptop and your car and the chip in your head are all parts of a single general-purpose Linux system that hotplugs and unplugs elements based on availability in the moment, rather than the specific proprietary concepts of the companies selling the products.
Joel Fernandes submitted a module to export kernel headers through the /proc directory to make it easier for users to extend the kernel without necessarily having the source tree available. He said:
On Android and embedded systems, it is common to switch kernels but not have kernel headers available on the filesystem. Raw kernel headers also cannot be copied into the filesystem like they can be on other distros, due to licensing and other issues. There's no linux-headers package on Android. Further, once a different kernel is booted, any headers stored on the filesystem will no longer be useful. By storing the headers as a compressed archive within the kernel, we can avoid these issues that have been a hindrance for a long time.
Christoph Hellwig was unequivocal, saying, "This seems like a pretty horrible idea and waste of kernel memory. Just add support to kbuild to store a compressed archive in initramfs and unpack it in the right place."
But Greg Kroah-Hartman replied, "It's only a waste if you want it to be a waste—i.e., if you load the kernel module." And he pointed out that there was precedent for doing something like Joel's idea in the /proc/config.gz availability of the kernel configuration.
Meanwhile, Daniel Colascione was doing a little jig, saying that Joel's feature would make it much easier for him to play around with Berkeley Packet Filter. He suggested exporting the entire source tree, instead of just the kernel headers. But Joel said this would be too large to store in memory.
H. Peter Anvin, while affirming the value of exporting the kernel headers, had some issues about the right way to go about it. In particular, he said, "I see literally *no* problem, social or technical, you are solving by actually making it a kernel ELF object."
Instead, H. Peter though the whole project could be simplified into a simple mountable filesystem containing the header files.
There was a bit of a technical back and forth before the discussion petered out. It's clear that something along the lines of Joel's idea would be useful to various people, although the exact scope and implementation seem to be completely up in the air.
I'm very happy to celebrate Linux Journal's 25th anniversary. 1994 was a great year for Linux, with friends trading Slackware disks, developers experimenting with windowing systems and the new Mosaic graphics-based web browser, and everyone speculating on what Microsoft might do to try to bring the whole thing down. I had recently bought a book called UNIX System V, for lack of any Linux-specific books on the market, and I remember debating with myself over which tool to learn: perl or awk.
Amid all of that, an actual print magazine seemed to come out of nowhere that was all about Linux—filled with advice, analysis and even an interview with Linus Torvalds. My eyes were very big as I went over it page by page. It was like discovering someone who loved Linux and open source the way I did. Someone who had a lot to say and didn't mind if anyone listened.
A few years later, I was one of the people writing articles for Linux Journal, and I've been very proud to help out and contribute ever since. I always remembered how I felt opening that first issue, way back when.
So, happy anniversary, Linux Journal!
Note: if you're mentioned in this article and want to send a response, please send a message with your response text to ljeditor@linuxjournal.com and we'll run it in the next Letters section and post it on the website as an addendum to the original article.