diff -u

What's New in Kernel Development by Zack Brown

Chasing Archives

Kernel development is truly impossible to keep track of. The main mailing list alone is vast beyond belief. Then there are all the side lists and IRC channels, not to mention all the corporate mailing lists dedicated to kernel development that never see the light of day. In some ways, kernel development has become fundamentally mysterious.

Once in a while, some lunatic decides to try to reach back into the past and study as much of the corpus of kernel discussion as he or she can find. One such person is Joey Pabalinas, who recently wanted to gather everything together in Maildir format, so he could do searches, calculate statistics, generate pseudo-hacker AI bots and whatnot.

He couldn't find any existing giant corpus, so he tried to create his own by piecing together mail archived on various sites. It turned out to be more than a million separate files, which was too much to host on either GitHub or GitLab. He asked the linux kernel mailing list for suggestions on better hosting opportunities. Although he acknowledged, "It's possible I'm the only weirdo who finds this kind of thing useful, but I figured I should share it just in case I'm not."

Joe Perches suggested plumbing the archives at kernel.org/lore.html, which go back decades. But Joey said he'd tried that, and he found it all but impossible to convert those archives to the Mailbox format he wanted. Instead, he'd spent the previous several weeks scraping the lkml.org archive and scripting his own conversion routines.

Konstantin Ryabitsev remarked:

The maildir format is kind of terrible for LKML, because having millions of messages in a single directory is very hard on the underlying FS. If you break it up into multiple folders, then it becomes difficult to search. This is the main reason why we have chosen to go with the public-inbox format, which solves both of these problems and allows for a very efficient archive updating and replication using git.

Meanwhile, Jasper Spaans raised his eyebrows at Joey's statement that he'd gotten more than a million separate files by scraping lkml.org. Jasper said:

First of all, there are more than 3M messages stored in the lkml.org database, so I guess you've missed some messages or something is really broken. Besides, unless you figured out how to get to the raw data, you've just scraped a rendering which discards stuff like pgp signatures etc and has very incomplete headers. Unless you don't care for those of course.

Jasper added that he'd also been working on extracting Maildir-type data out of the lore website, and he sent Joey the code he'd been using to do that.

Eric Wong also sent Joey a script he'd been using to convert slrn threaded Usenet repositories to Maildir; although like others, he recommended against putting millions (and millions) of files into a single directory.

The discussion wasn't headed anywhere; it was just various people sharing knowledge and making judgment calls.

Once upon a time, and a very long time ago it was, I wanted to get a hold of the earliest archives of Linux kernel development discussions. I asked everyone where I could find them, and one of the developers replied that he had a lot of that stuff mixed up in his mail archives, along with all manner of other email messages. I wrote back and eagerly told him I'd love to get my hands on it. He wrote back again, explaining that there was just no way he could take the time to extract the private stuff from the public stuff. And, that was the end of that. I've always wondered why he responded to my initial email in the first place, if he was just going to say no at the end. And, that's the tale of how I came this close to writing up summaries of the very earliest Linux developments.

Considering Fresh C Extensions

Matthew Wilcox recently realized there might be a value in depending on C extensions provided by the Plan 9 variant of the C programming language. All it would require is using the -fplan9-extensions command-line argument when compiling the kernel. As Matthew pointed out, Plan 9 extensions have been supported in GCC as of version 4.6, which is the minimum version supported by the kernel. So theoretically, there would be no conflict.

Nick Desaulniers felt that any addition of -f compiler flags to any project always would need careful consideration. Depending on what the extensions are needed for, they could be either helpful or downright dangerous.

In the current case, Matthew wanted to use the Plan 9 extensions to shave precious bytes off of a cyclic memory allocation that needed to store a reference to the "next" value. Using the extensions, Matthew said, he could embed the "next" value without breaking various existing function calls.

Nick also suggested making any such extension dependencies optional, so that other compilers would continue to be able to compile the kernel.

It looked as though there would be some back and forth on the right way to proceed, but Linus Torvalds immediately jumped in to veto the entire concept, saying:

Please don't.

The subset of the plan9 extensions that are called just "ms" extensions is fine. That's a reasonable thing, and is a very natural expansion of the unnamed structures we already have—namely being able to pre-declare that unnamed structure/union.

But the full plan9 extensions are nasty, and makes it much too easy to write "convenient" code that is really hard to read as an outsider because of how the types are silently converted.

And I think what you want is explicitly that silent conversion.

So no. Don't do it. Use a macro or an inline function that makes the conversion explicit so that it's shown when grepping.

The "one extra argument" is not a strong argument for something that simply isn't that common. The upsides of a nonstandard feature like that needs to be pretty compelling.

We've used various gcc extensions since day #1 ("inline" being perhaps the biggest one that took _forever_ to become standard C), but these things need to have very strong arguments.

"One extra argument" that could be hidden by a macro or a helper inline is simply not a strong argument.

Nick was sympathetic to this point, and said:

Implicit conversions are the most pointed to "defect" in languages like JavaScript. I understand why they're convenient, but the resulting surprising bugs don't outweigh their cost (again, my opinion), and developers frequently can't keep these implicit conversion rules straight (whether we're talking about JavaScript or C).

There was no real conversation after Linus' denunciation. But, the question of which compiler extensions to use and which not to use is very interesting, especially for a project like Linux that's used everywhere and runs everything. With all the different environments it needs to support, that support needs to include the possibility of all sorts of development platforms. To some extent, Linus' argument that Matthew was trying to create hard-to-read code is relevant, but it's also relevant that other compilers and other build environments also need to be able to support Linux development.

Handling Complex Memory Situations

Jérôme Glisse felt that the time had come for the Linux kernel to address seriously the issue of having many different types of memory installed on a single running system. There was main system memory and device-specific memory, and associated hierarchies regarding which memory to use at which time and under which circumstances. This complicated new situation, Jérôme said, was actually now the norm, and it should be treated as such.

The physical connections between the various CPUs and devices and RAM chips—that is, the bus topology—also was relevant, because it could influence the various speeds of each of those components.

Jérôme wanted to be clear that his proposal went beyond existing efforts to handle heterogeneous RAM. He wanted to take account of the wide range of hardware and its topological relationships to eek out the absolute highest performance from a given system. He said:

One of the reasons for radical change is the advance of accelerator like GPU or FPGA means that CPU is no longer the only piece where computation happens. It is becoming more and more common for an application to use a mix and match of different accelerator to perform its computation. So we can no longer satisfy our self with a CPU centric and flat view of a system like NUMA and NUMA distance.

He posted some patches to accomplish several different things. First, he wanted to expose the bus topology and memory variety to userspace as a clear API, so that both the kernel and user applications could make the best possible use of the particular hardware configuration on a given system. A part of this, he said, would have to take account of the fact that not all memory on the system always would be equally available to all devices, CPUs or users.

To accomplish all this, his patches first identified four basic elements that could be used to construct an arbitrarily complex graph of CPU, memory and bus topology on a given system.

These included "targets", which were any sort of memory; "initiators", which were CPUs or any other device that might access memory; "links", which were any sort of bus-type connection between a target and an initiator; and "bridges", which could connect groups of initiators to remote targets.

Aspects like bandwidth and latency would be associated with their relevant links and bridges. And, the whole graph of the system would be exposed to userspace via files in the SysFS hierarchy.

In addition, Jérôme's patches provided a way to express memory policy. A system's memory policy is the mechanism it uses to decide which memory to use for a given task. For example, it might use faster memory first and slower memory only when fast memory is full. But, the kernel's current memory policy was organized on a per-CPU basis, which Jérôme felt was not good enough. But, he also acknowledged that trying to change that aspect of kernel infrastructure directly might break a lot of existing code. To deal with this, his patch added an entirely new memory policy API that new user code could take advantage of and old user code simply could ignore.

Aneesh Kumar responded to all of this, in particular praising Jérôme's approach of keeping the new API separate from the old. But Aneesh said, "that is also the drawback isn't it? We now have multiple entities tracking cpu and memory."

Aneesh also wanted to confirm that "once we have these different types of targets, ideally the system should be able to place them in the ideal location based on the affinity of the access. ie. we should automatically place the memory such that initiator can access the target optimally."

Jérôme seemed to agree with this in principle, but he also seemed to feel that making any of this automatic was still not guaranteed. The first step, he felt, was to expose the APIs and data structures, and then see what could be accomplished.

Meanwhile, Dave Hansen pointed out that there were existing elements of the kernel that dealt with heterogeneous memory. Dave said that HMAT (Heterogeneous Memory Attribute Table) existed in firmware specifically to express the topology to the kernel. Dave also said that NUMA (Non-Uniform Memory Access) was already part of the kernel, and wasn't lying fallow. Additionally, he pointed out that the ACPI (Advanced Configuration and Power Interface) specification had officially embraced NUMA, and there were Linux developers actively contributing patches to support this.

So, Dave was not immediately enthusiastic about ditching this ongoing momentum in one direction, in order to accept Jérôme's radical solution that went in an entirely new direction.

But, Jérôme replied that he was not trying to overthrow the existing work or any kernel patches that made use of HMAT. He said that all of that was still useful just on its own. But he added:

I do not see how to evolve NUMA to support device memory. [...] I can not expose device memory as NUMA node as device memory is not cache coherent on AMD and Intel platform today. [...] In some case that memory is not visible at all by the CPU which is not something you can express in the current NUMA node.

Somewhat mollified, Dave replied:

Yeah, our NUMA mechanisms are for managing memory that the kernel itself manages in the "normal" allocator and supports a full feature set on. That has a bunch of implications, like that the memory is cache coherent and accessible from everywhere.

The HMAT patches only comprehend this "normal" memory, which is why we're extending the existing /sys/devices/system/node infrastructure.

This series has a much more aggressive goal, which is comprehending the connections of every memory-target to every memory-initiator, no matter who is managing the memory, who can access it, or what it can be used for.

Theoretically, HMS could be used for everything that we're doing with /sys/devices/system/node, as long as it's tied back into the existing NUMA infrastructure somehow.

Jérôme agreed with all of the above, and Dave seemed to be on board with Jérôme's approach. But, he did have some practical objections. For one thing, he said:

We support 1024 NUMA nodes on x86. The ACPI HMAT expresses the connections between each node. Let's suppose that each node has some CPUs and some memory.

That means we'll have 1024 target directories in sysfs, 1024 initiator directories in sysfs, and 1024*1024 link directories. Or, would the kernel be responsible for "compiling" the firmware-provided information down into a more manageable number of links?

Some idiot made the mistake of having one sysfs directory per 128MB of memory way back when, and now we have hundreds of thousands of /sys/devices/system/memory/memoryX directories. That sucks to manage. Isn't this potentially repeating that mistake?

Dave also was worried that if Jérôme went forward with his patches, it could be four or five years before all the problems were solved, in which case, some portions of memory management would be bottle-necked waiting for those solutions, which could have been solved sooner using existing NUMA projects.

Additionally, Dave was curious how Jérôme's code would scale. He said, "It's quite easy to represent a laptop, but can this scale to the largest systems that we expect to encounter over the next 20 years that this ABI will live?"

At this point, Jérôme and Dave were joined by several other folks and began diving into the technical details, further objections and possible solutions that might come out of Jérôme's work. Ultimately, it seemed as if these patches did not represent a threat to existing approaches to memory, and that Jérôme would have support—or at least tolerance—from NUMA-related projects.

The things I love about this discussion are, first of all, that one developer got a big idea that seemed to go against current thinking, but that solved problems he saw as real. Second, that a developer on the other side of the issue was actually interested in the new approach and willing to take it seriously rather than tear it down.

Also, the whole direction of hardware resources is really becoming so strange. The kernel tries to eek out absolutely everything it can from the various devices on the system—even to the point of going beyond the ways in which those devices thought they would be used! And then once the kernel starts using them that way, other devices come out that use the kernel's new infrastructure. And so we end up with some kind of crazy situation requiring crazy solutions like what Jérôme has proposed.

Note: if you're mentioned in this article and want to send a response, please send a message with your response text to ljeditor@linuxjournal.com and we'll run it in the next Letters section and post it on the website as an addendum to the original article.

About the Author

Zack Brown is a tech journalist at Linux Journal and Linux Magazine, and is a former author of the "Kernel Traffic" weekly newsletter and the "Learn Plover" stenographic typing tutorials. He first installed Slackware Linux in 1993 on his 386 with 8 megs of RAM and had his mind permanently blown by the Open Source community. He is the inventor of the Crumble pure strategy board game, which you can make yourself with a few pieces of cardboard. He also enjoys writing fiction, attempting animation, reforming Labanotation, designing and sewing his own clothes, learning French and spending time with friends'n'family.

Zack Brown