diff -u

What's New in Kernel Development by Zack Brown

Speeding Up Netfilter (by Avoiding Netfilter)

Imre Palik tried to speed up some of Linux's networking code but was met with stubborn opposition. Essentially, he wanted networking packets to bypass the netfilter code unless absolutely necessary. Netfilter, he said, was designed for flexibility at the expense of speed. According to his tests, bypassing it could speed up the system by as much as 15%.

Netfilter is a piece of infrastructure that gives users a tremendous amount of power and flexibility in processing and restricting networking traffic. Imre's idea was that if the user didn't want to filter network packets, the netfilter code shouldn't even be traversed. He therefore wanted to let users disable netfilter for any given firewall that didn't need it.

There was some initial interest and also some questions about how he'd calculated his 15% speed increase. Florian Westphal tried to reason out where the speedup might have come from. But David S. Miller put his foot down, saying that any speedup estimates were just guesses until they were properly analyzed via perf.

David absolutely refused to apply networking patches without a more reliable indication that they would improve the situation.

Imre explained his testing methods and asserted that they seemed sound to him. But Pablo Neira Ayuso felt that Imre's approach was too haphazard. He said there needed to be a more generic way to do that sort of testing.

David was completely unsatisfied by Imre's tests. Instead of trying to work around netfilter, even in cases where there were no actual filters configured, he said, the proper solution was to speed up netfilter so it wouldn't be necessary to bypass it. David said, "We need to find a clean and generic way to make the netfilter hooks as cheap as possible when netfilter rules are not in use."

David Woodhouse, on the other hand, felt that a 15% speedup was a 15% speedup, and we shouldn't look a gift horse in the mouth.

But, David M stood firm. The netfilter hooks were the fundamental issue, he said, and "I definitely would rather see the fundamental issue addressed rather than poking at it randomly with knobs for this case and that."

David W and others started hunting around for ways to satisfy David M without actually recoding the netfilter hooks. David W suggested having the hooks disable themselves automatically if they detected that they wouldn't be useful.

Ultimately there was no conclusion to the thread, although it seems clear that for the moment, Imre's code is dead in the water. The problem with that is that 15% really is 15%, and speedups are good even if they're not perfect. It's conceivable that no one will come up with a good way to fix netfilter hooks, and that Imre's patch will receive better testing and more meaningful performance numbers. At that point, it's possible even David M would say okay.

Read-Only Memory

Igor Stoppa posted a patch to allow kernel memory pools to be made read-only. Memory pools are a standard way to group memory allocations in Linux so their time cost is more predictable. With Igor's patch, once a memory pool was made read-only, it could not be made read-write again. This would secure the data for good and against attackers. Of course, you could free the memory and destroy the pool. But short of that, the data would stay read-only.

There was not much controversy about this patch. Kees Cook felt that XFS would work well with the feature. And, having an actual user would help Igor clarify the usage and nail down the API.

This apparently had come up at a recent conference, and Dave Chinner was ready for Igor's patch. He remarked, "we have a fair amount of static data in XFS that we set up at mount time and it never gets modified after that. I'm not so worried about VFS level objects (that's a much more complex issue) but there is a lot of low hanging fruit in the XFS structures we could convert to write-once structures."

Igor said this was exactly the kind of thing he'd had in mind.

A bunch of folks started talking about terminology and use cases, and speculated on further abilities. No one had any negative comment, and everyone was excited to get going with it.

The thing about a patch like this is that people can use the feature or not. It helps them with security, or it costs them nothing. It adds an ability but adds no complexity to the code. Unless something weird happens, I'd expect this patch to go into the kernel as soon as the API stabilizes.

Working around Intel Hardware Flaws

Efforts to work around serious hardware flaws in Intel chips are ongoing. Nadav Amit posted a patch to improve compatibility mode with respect to Intel's Meltdown flaw. Compatibility mode is when the system emulates an older CPU in order to provide a runtime environment that supports an older piece of software that relies on the features of that CPU. The thing to be avoided is to emulate massive security holes created by hardware flaws in that older chip as well.

In this case, Linux is already protected from Meltdown by use of PTI (page table isolation), a patch that went into Linux 4.15 and that was subsequently backported all over the place. However, like the BKL (big kernel lock) in the old days, PTI is a heavy-weight solution, with a big impact on system speed. Any chance to disable it without reintroducing security holes is a chance worth exploring.

Nadav's patch was an attempt to do this. The goal was "to disable PTI selectively as long as x86-32 processes are running and to enable global pages throughout this time."

One problem that Nadav acknowledged was that since so many developers were actively working on anti-Meltdown and anti-Spectre patches, there was plenty of opportunity for one patch to step all over what another was trying to do. As a result, he said, "the patches are marked as an RFC since they (specifically the last one) do not coexist with Dave Hansen's enabling of global pages, and might have conflicts with Joerg's work on 32-bit."

Andrew Cooper remarked, chillingly:

Being 32bit is itself sufficient protection against Meltdown (as long as there is nothing interesting of the kernel's mapped below the 4G boundary). However, a 32bit compatibility process may try to attack with Spectre/SP2 to redirect speculation back into userspace, at which point (if successful) the pipeline will be speculating in 64bit mode, and Meltdown is back on the table. SMEP will block this attack vector, irrespective of other SP2 defenses the kernel may employ, but a fully SP2-defended kernel doesn't require SMEP to be safe in this case.

And Dave, nearby, remarked, "regardless of Meltdown/Spectre, SMEP is valuable. It's valuable to everything, compatibility-mode or not."

SMEP (Supervisor Mode Execution Protection) is a hardware mode, whereby the OS can set a register on compatible CPUs to prevent userspace code from running. Only code that already has root permissions can run when SMEP is activated.

Andy Lutomirski said that he didn't like Nadav's patch because he said it drew a distinction between "compatibility mode" tasks and "non-compatibility mode" tasks. Andy said no such distinction should be made, especially since it's not really clear how to make that distinction, and because the ramifications of getting it wrong might be to expose significant security holes.

Andy felt that a better solution would be to enable and disable 32-bit mode and 64-bit mode explicitly as needed, rather than guessing at what might or might not be compatibility mode.

The drawback to this approach, Andy said, was that old software would need to be upgraded to take advantage of it, whereas with Nadav's approach, the judgment would be made automatically and would not require old code to be updated.

Linus Torvalds was not optimistic about any of these ideas. He said, "I just feel this all is a nightmare. I can see how you would want to think that compatibility mode doesn't need PTI, but at the same time it feels like a really risky move to do this." He added, "I'm not seeing how you keep user mode from going from compatibility mode to L mode with just a far jump."

In other words, the whole patch, and any alternative, may just simply be a bad idea.

Nadav replied that with his patch, he tried to cover every conceivable case where someone might try to break out of compatibility mode and to re-enable PTI protections if that were to happen. Though he did acknowledge, "There is one corner case I did not cover (LAR) and Andy felt this scheme is too complicated. Unfortunately, I don't have a better scheme in mind."

Linus remarked:

Sure, I can see it working, but it's some really shady stuff, and now the scheduler needs to save/restore/check one more subtle bit.

And if you get it wrong, things will happily work, except you've now defeated PTI. But you'll never notice, because you won't be testing for it, and the only people who will are the black hats.

This is exactly the "security depends on it being in sync" thing that makes me go "eww" about the whole model. Get one thing wrong, and you'll blow all the PTI code out of the water.

So now you tried to optimize one small case that most people won't use, but the downside is that you may make all our PTI work (and all the overhead for all the _normal_ cases) pointless.

And Andy also remarked, "There's also the fact that, if this stuff goes in, we'll be encouraging people to deploy 32-bit binaries. Then they'll buy Meltdown-fixed CPUs (or AMD CPUs!) and they may well continue running 32-bit binaries. Sigh. I'm not totally a fan of this."

The whole thread ended inconclusively, with Nadav unsure whether folks wanted a new version of his patch.

The bottom line seems to be that Linux has currently protected itself from Intel's hardware flaws, but at a cost of perhaps 5% to 30% efficiency (the real numbers depend on how you use your system). And although it will be complex and painful, there is a very strong incentive to improve efficiency by adding subtler and more complicated workarounds that avoid the heavy-handed approach of the PTI patch. Ultimately, Linux will certainly develop a smooth, near-optimal approach to Meltdown and Spectre, and probably do away with PTI entirely, just as it did away with the BKL in the past. Until then, we're in for some very ugly and controversial patches.

Cleaning Up the VFS

Dongsu Park posted a patch in collaboration with Eric W. Biederman, and originally inspired by Seth Forshee, to make an odd adjustment to the filesystem code. Specifically, they wanted any user with the capability CAP_CHOWN over a filesystem's superblock, to be able to chown (change the owner) of files within that filesystem.

Apparently, this would become an issue only when running a virtual system (that is, a container) on top of a running Linux system and if the underlying filesystem had files with user IDs or group IDs that didn't map to anything in the current user namespace within the container. Before writing such files to disk, you'd have to run chown on those files to tell them to which owner to map. Otherwise, writing such files to disk without a good uid or gid mapping would corrupt those fields in the filesystem.

A couple technical comments were made about the patch, but Miklos Szeredi expressed confusion about why the problem solved by the patch might ever be triggered. If you can't chown the file to be owned by the user doing the writing, he remarked, how can you write the file in order to produce the corruption? To which Eric replied that the patch wasn't actually intended to be a fix for any real problem. No one was in danger of hitting this particular problem.

The patch, he explained, was part of a larger strategy of shoring up the virtual file system (VFS) and making sure it handled all generic cases correctly—whether or not those cases could occur in real life. The goal was to draw a clear distinction between problems showing up in real-world filesystems and problems showing up at the lower VFS level. This way, when bug reports came in, it would be more straightforward to associate them with particular filesystems, rather than trying to debug them in the VFS.

He said, "In this case the generic concern is what happens when the uid is read from the filesystem and it gets mapped to INVALID_UID and then the inode for that file is written back. That is a trap for the unwary filesystem implementation and not a case that I think anyone will actually care about."

So essentially, it was not even a housekeeping patch, but instead a patch to make housekeeping itself easier.

Note: if you're mentioned in this article and have a response, please send the text to ljeditor@linuxjournal.com, and we'll run it in the next Letters section and post it on the website as an addendum to the original article.

About the Author

Zack Brown is a tech journalist at Linux Journal and Linux Magazine, and is a former author of the "Kernel Traffic" weekly newsletter and the "Learn Plover" stenographic typing tutorials. He first installed Slackware Linux in 1993 on his 386 with 8 megs of RAM and had his mind permanently blown by the Open Source community. He is the inventor of the Crumble pure strategy board game, which you can make yourself with a few pieces of cardboard. He also enjoys writing fiction, attempting animation, reforming Labanotation, designing and sewing his own clothes, learning French and spending time with friends'n'family.

Zack Brown