diff -u

What's New in Kernel Development

This month Zack looks at Linus' posting habits, speeding up the un-speed-up-able and Intel design flaw fallout. By Zack Brown

Linus' Posting Habits

Linus Torvalds sometimes is criticized for bombastically cursing out kernel developers. He does do this, but it's not his default behavior, and I think the real nature of when and how he posts to the mailing list is interesting. For example, he stayed out of the whole discussion of how to replace the BitKeeper revision control system for a long time, letting various projects guess frustratingly at his desires, before he finally took a break from Linux development to design and implement git.

In other cases, he's allowed developers to lambaste each other savagely for days or longer over key elements of the kernel or the behaviors of new hardware peripherals, only to enter the debate later on, generally to propose a simpler solution that neither camp had thought of.

Sometimes he'll enter a discussion for apparently no other reason than that a particular bug or piece of code interests him, and he works with whoever posted a given patch to get the kinks out or track down underlying problems.

In general, Linus tends to stay out of most discussions, coming in primarily only on controversial issues after he's developed a position of his own.

But yes, sometimes he comes down pretty hard on developers, generally saying they already should know better than to do a particular thing, or when he feels the developer is hawking a particular corporate goal that goes against the spirit of open-source development—although I doubt he'd put it that way himself. He doesn't seem to like making overtly evangelical statements and tends to stick to the "if it works, I like it" attitude that he took when adopting BitKeeper.

Occasionally, Linus gets a technical issue wrong—he'll claim something about the kernel as being true that isn't. An example of that happened recently, when Amir Goldstein posted a patch to alter a string hashing function. It was a small patch that preserved some dropped bits in the 64-bit case and left the 32-bit case virtually unchanged. Amir asked for Linus' advice because he couldn't find a maintainer for the file, and Linus had been one of the people most recently to change the code.

Linus didn't agree that Amir's code left the 32-bit case unchanged. And he singled out a call to the __hash_32() function as being particularly time-consuming. He rejected Amir's patch, saying, "the patch as-is doesn't seem to buy anything, and only adds cost."

But when Amir pointed out that his patch hadn't actually added the call to __hash_32(), that the call had been there already, Linus took another look and replied, "Oh, duh. My bad. It was indeed there already. Let me go back and look at the history of this thing."

He later replied to his own post, saying:

After having looked more at it, I take back all my complaints about the patch, you were right and I was mis-reading things or just being stupid.

I also don't worry too much about the possible performance impact of this on 64-bit, since most architectures that actually care about performance end up not using this very much (the dcache code is the most performance-critical, but the word-at-a-time case uses its own hashing anyway).

So this ends up being mostly used for filesystems that do their own degraded hashing (usually because they want a case-insensitive comparison function).

A _tiny_ worry remains, in that not everybody uses DCACHE_WORD_ACCESS, and then this potentially makes things more expensive on 64-bit architectures with slow or lacking multipliers even for the normal case.

That said, realistically the only such architecture I can think of is PA-RISC. Nobody really cares about performance on that, it's more of a 'look ma, I've got warts^W an odd machine' platform.

So I think your patch is fine, and all my initial worries were just misplaced from not looking at this properly.

Sorry.

To me the interesting details about this kind of post are that Linus never seems to drag out an admission of error. He won't cling to a mistaken idea out of embarrassment; he also avoids making excuses for why he got the thing wrong or dismissing the opposing side as unimportant. He also seems to put in some real work in developing a proper understanding of the situation after seeing that he made a mistake and moves directly into dealing with whatever the new technical details might be.

I personally like all that. And for me, it adds context to the times when he gets angry at developers. It seems to happen mostly—or only—when he feels that developers should have seen and acknowledged their own error by a given point.

Speeding Up the Un-Speed-Up-able

Sometimes kernel developers can work in parallel for years without realizing it. It's one of the inefficiencies of a distributed system that tends to work out as a benefit when you have enough contributors to be able to spare the extra labor—it's sort of a "with enough eyeballs, all bugs are shallow" kind of thing.

This time Kirill A. Shutemov wanted to scratch a particular itch in the memory encryption code. Memory encryption occurs at such a heavily used place in the kernel that a difference of just a couple opcodes can make a noticeable speed difference to a running system. Anything short of introducing security holes would be worth considering, in order to shave off those few Assembly instructions.

But Kirill had a problem—his code made things worse. He wanted to convert the __PHYSICAL_MASK variable into what he called a "patchable constant"—a value that had the simplicity of a constant at runtime, but that could be set dynamically by configuration options or at the bootloader command line before booting the system.

So far, he had come up with a patch that did this—and that would be applicable to other variables that might be faster to implement as constants. Unfortunately, as implemented, although it achieved the feature he wanted, it caused GCC to produce code that was less efficient than what it had produced before. Instead of speeding things up, this resulted in a 0.2% slowdown on his test system.

He posted his patch, along with a request for ideas, to the linux-kernel mailing list.

Linus Torvalds replied with an analysis of the opcodes produced by GCC, showing why they were slower and why Kirill's general approach always would suffer from the slowdown issue. In particular, Linus pointed out that Kirill moved constant values into registers, which never would be optimal, and his code also included a movabsq instruction, which he felt was rarely needed.

Linus said he actually really wanted this feature in spite of the complex overhead required to accomplish it; in fact, he preferred an even more complex approach, along the lines of something H. Peter Anvin had attempted about 18 months earlier. Of that approach, he said:

It's rather more complex, but it actually gives a much bigger win. The code itself will be much better, and smaller. The *infrastructure* for the code gets pretty hairy, though.

Peter replied that he'd actually been working on this very project for some time from a secret, undisclosed location at the bottom of a well in a town with no name. In fact, his current approach was yet more ambitious (and complex) than what he'd tried 18 months earlier. On the other hand, as he put it, he was now about 85% done with the whole thing, and the code "mostly needs cleanup, testing, and breaking into reasonable chunks." He added:

The main reason I haven't submitted it yet is that I got a bit overly ambitious and wanted to implement a whole bunch of more complex subcases, such as 64-bit shifts on a 32-bit kernel. The win in that case is actually quite huge, but it requires data-dependent code patching and not just immediate patching, which requires augmentation of the alternatives framework.

So it looks like Kirill's itch is about to be scratched...by Peter. An element of the kernel so deep and tangled that it seemed intended never to be touched is being tugged apart beyond its apparent natural limits. And, all of this is achieved by the mere addition of a big honking pile of insanely complex code that even seasoned developers balked at touching.

Intel Design Flaw Fallout

For weeks, the world's been talking about severe Intel design flaws affecting many CPUs and forcing operating systems to look for sometimes costly workarounds.

Linux patches for these issues are in a state of ongoing development. Security is always the first priority, at the expense of any other feature. Next would probably be the general speed of a running system for the average user. After that, the developers might begin piecing together any features that had been pulled as part of the initial security fix.

But while this effort goes on, the kernel developers seem fairly angry at Intel, especially when they feel that Intel is not doing enough to fix the problems in future processors.

In response to one set of patches, for example, Linus Torvalds burst out with, "All of this is pure garbage. Is Intel really planning on making this shit architectural? Has anybody talked to them and told them they are f*cking insane?" He went on, "the IBRS garbage implies that Intel is _not_ planning on doing the right thing for the indirect branch speculation. Honestly, that's completely unacceptable." And then he said:

The whole IBRS_ALL feature to me very clearly says "Intel is not serious about this, we'll have an ugly hack that will be so expensive that we don't want to enable it by default, because that would look bad in benchmarks". So instead they try to push the garbage down to us. And they are doing it entirely wrong.

He went on, even more disturbingly, to say:

The patches do things like add the garbage MSR writes to the kernel entry/exit points. That's insane. That says "we're trying to protect the kernel". We already have retpoline there, with less overhead.

So somebody isn't telling the truth here. Somebody is pushing complete garbage for unclear reasons. Sorry for having to point that out....As it is, the patches are COMPLETE AND UTTER GARBAGE....WHAT THE F*CK IS GOING ON?

At one point, David Woodhouse offered a helpful technical summary of the whole situation for those of us on the edge of our seats:

This is all about Spectre variant 2, where the CPU can be tricked into mispredicting the target of an indirect branch. And I'm specifically looking at what we can do on *current* hardware, where we're limited to the hacks they can manage to add in the microcode.

The new microcode from Intel and AMD adds three new features.

One new feature (IBPB) is a complete barrier for branch prediction. After frobbing this, no branch targets learned earlier are going to be used. It's kind of expensive (order of magnitude ~4000 cycles).

The second (STIBP) protects a hyperthread sibling from following branch predictions which were learned on another sibling. You *might* want this when running unrelated processes in userspace, for example. Or different VM guests running on HT siblings.

The third feature (IBRS) is more complicated. It's designed to be set when you enter a more privileged execution mode (i.e. the kernel). It prevents branch targets learned in a less-privileged execution mode, BEFORE IT WAS MOST RECENTLY SET, from taking effect. But it's not just a "set-and-forget" feature, it also has barrier-like semantics and needs to be set on *each* entry into the kernel (from userspace or a VM guest). It's *also* expensive. And a vile hack, but for a while it was the only option we had.

Even with IBRS, the CPU cannot tell the difference between different userspace processes, and between different VM guests. So in addition to IBRS to protect the kernel, we need the full IBPB barrier on context switch and vmexit. And maybe STIBP while they're running.

Then along came Paul with the cunning plan of "oh, indirect branches can be exploited? Screw it, let's not have any of *those* then", which is retpoline. And it's a *lot* faster than frobbing IBRS on every entry into the kernel. It's a massive performance win.

So now we *mostly* don't need IBRS. We build with retpoline, use IBPB on context switches/vmexit (which is in the first part of this patch series before IBRS is added), and we're safe. We even refactored the patch series to put retpoline first.

But wait, why did I say "mostly"? Well, not everyone has a retpoline compiler yet...but OK, screw them; they need to update.

Then there's Skylake, and that generation of CPU cores. For complicated reasons they actually end up being vulnerable not just on indirect branches, but also on a "ret" in some circumstances (such as 16+ CALLs in a deep chain).

The IBRS solution, ugly though it is, did address that. Retpoline doesn't. There are patches being floated to detect and prevent deep stacks, and deal with some of the other special cases that bite on SKL, but those are icky too. And in fact IBRS performance isn't anywhere near as bad on this generation of CPUs as it is on earlier CPUs *anyway*, which makes it not quite so insane to *contemplate* using it as Intel proposed.

That's why my initial idea, as implemented in this RFC patchset, was to stick with IBRS on Skylake, and use retpoline everywhere else. I'll give you 'garbage patches', but they weren't being 'just mindlessly sent around'. If we're going to drop IBRS support and accept the caveats, then let's do it as a conscious decision having seen what it would look like, not just drop it quietly because poor Davey is too scared that Linus might shout at him again.

I have seen *hand-wavy* analyses of the Skylake thing that mean I'm not actually lying awake at night fretting about it, but nothing concrete that really says it's OK.

If you view retpoline as a performance optimisation, which is how it first arrived, then it's rather unconventional to say "well, it only opens a *little* bit of a security hole but it does go nice and fast so let's do it".

But fine, I'm content with ditching the use of IBRS to protect the kernel, and I'm not even surprised. There's a *reason* we put it last in the series, as both the most contentious and most dispensable part. I'd be *happier* with a coherent analysis showing Skylake is still OK, but hey-ho, screw Skylake.

The early part of the series adds the new feature bits and detects when it can turn KPTI off on non-Meltdown-vulnerable Intel CPUs, and also supports the IBPB barrier that we need to make retpoline complete. That much I think we definitely *do* want. There have been a bunch of us working on this behind the scenes; one of us will probably post that bit in the next day or so.

I think we also want to expose IBRS to VM guests, even if we don't use it ourselves. Because Windows guests (and RHEL guests; yay!) do use it.

If we can be done with the shouty part, I'd actually quite like to have a sensible discussion about when, if ever, we do IBPB on context switch (ptraceability and dumpable have both been suggested) and when, if ever, we set STIPB in userspace.

Most of the discussion on the mailing list focused on the technical issues surrounding finding actual solutions. But Linus was not alone in finding the situation unacceptable. A variety of developers, including David, were horribly offended, not by the design flaw itself, but by the way they perceived Intel to be handling the subsequent situation—the poor technical fixes, the lack of communication between Intel developers and the kernel community, and as Linus pointed out, the potential choice by Intel not to fix some of the problems at all.

About the Author

Zack Brown is a tech journalist at Linux Journal and Linux Magazine, and is a former author of the "Kernel Traffic" weekly newsletter and the "Learn Plover" stenographic typing tutorials. He first installed Slackware Linux in 1993 on his 386 with 8 megs of RAM and had his mind permanently blown by the Open Source community. He is the inventor of the Crumble pure strategy board game, which you can make yourself with a few pieces of cardboard. He also enjoys writing fiction, attempting animation, reforming Labanotation, designing and sewing his own clothes, learning French and spending time with friends'n'family.

Zack Brown