diff -u

What's New in Kernel Development. By Zack Brown

Android Low-Memory Killer—In or Out?

One of the jobs of the Linux kernel—and all operating system kernels—is to manage the resources available to the system. When those resources get used up, what should it do? If the resource is RAM, there's not much choice. It's not feasible to take over the behavior of any piece of user software, understand what that software does, and make it more memory-efficient. Instead, the kernel has very little choice but to try to identify the software that is most responsible for using up the system's RAM and kill that process.

The official kernel does this with its OOM (out-of-memory) killer. But, Linux descendants like Android want a little more—they want to perform a similar form of garbage collection, but while the system is still fully responsive. They want a low-memory killer that doesn't wait until the last possible moment to terminate an app. The unspoken assumption is that phone apps are not so likely to run crucial systems like heart-lung machines or nuclear fusion reactors, so one running process (more or less) doesn't really matter on an Android machine.

A low-memory killer did exist in the Linux source tree until recently. It was removed, partly because of the overlap with the existing OOM code, and partly because the same functionality could be provided by a userspace process. And, one element of Linux kernel development is that if something can be done just as well in userspace, it should be done there.

Sultan Alsawaf recently threw open his window, thrust his head out, and shouted, "I'm mad as hell, and I'm not gonna take this anymore!" And, he re-implemented a low-memory killer for the Android kernel. He felt the userspace version was terrible and needed to be ditched. Among other things, he said, it killed too many processes and was too slow. He felt that the technical justification of migrating to the userspace dæmon had not been made clear, and an in-kernel solution was really the way to go.

In Sultan's implementation, the algorithm was simple—if a memory request failed, then the process was killed—no fuss, no muss and no rough stuff.

There was a unified wall of opposition to this patch. So much so that it became clear that Sultan's main purpose was not to submit the patch successfully, but to light a fire under the asses of the people maintaining the userspace version, in hopes that they might implement some of the improvements he wanted.

Michal Hocko articulated his opposition to Sultan's patch very clearly—the Linux kernel would not have two separate OOM killers sitting side by side. The proper OOM killer would be implemented as well as could be, and any low-memory killers and other memory finaglers would have to exist in userspace for particular projects like Android.

Suren Baghdasaryan also was certain that multiple OOM killers in the kernel source tree would be a non-starter. He invited Sultan to approach the problem from the standpoint of improving the user-space low-memory killer instead.

There also were technical problems with Sultan's code. Michal felt it didn't have a broad enough scope and was really good only for a single very specific use case. And, Joel Fernandes agreed that Sultan's approach was too simple. Joel pointed out that "a transient temporary memory spike should not be a signal to kill _any_ process. The reaction to kill shouldn't be so spontaneous that unwanted tasks are killed because the system went into panic mode." Instead, he said, memory usage statistics needed to be averaged out so that a proper judgment of which process to kill could be made. So, the userspace version was indeed slow, but the slowness was by design, so the code could make subtle judgments about how to proceed.

But Suren, on the other hand, agreed that the userspace code could be faster, and that the developers were working on ways to speed it up.

In this way, the discussion gradually transitioned to addressing the deficiencies in the userspace implementation and finding ways to address them. To that extent, Sultan's code provided a benchmark for where the user code would like to be at some point in the future.

It's not unheard of for a developer to implement a whole feature, just to make the point that an existing feature gets it wrong. And in this case, it does seem like that point has been heard.

Securing the Kernel Stack

The Linux kernel stack is a tempting target for attack. This is because the kernel needs to keep track of where it is. If a function gets called, which then calls another, which then calls another, the kernel needs to remember the order they were all called, so that each function can return to the function that called it. To do that, the kernel keeps a "stack" of values representing the history of its current context.

If an attacker manages to trick the kernel into thinking it should transfer execution to the wrong location, it's possible the attacker could run arbitrary code with root-level privileges. Once that happens, the attacker has won, and the computer is fully compromised. And, one way to trick the kernel this way is to modify the stack somehow, or make predictions about the stack, or take over programs that are located where the stack is pointing.

Protecting the kernel stack is crucial, and it's the subject of a lot of ongoing work. There are many approaches to making it difficult for attackers to do this or that little thing that would expose the kernel to being compromised.

Elena Reshetova is working on one such approach. She wants to randomize the kernel stack offset after every system call. Essentially, she wants to obscure the trail left by the stack, so attackers can't follow it or predict it. And, she recently posted some patches to accomplish this.

At the time of her post, no specific attacks were known to take advantage of the lack of randomness in the stack. So Elena was not trying to fix any particular security hole. Rather, she said, she wanted to eliminate any possible vector of attack that depended on knowing the order and locations of stack elements.

This is often how it goes—it's fine to cover up holes as they appear, but even better is to cover a whole region so that no further holes can be dug.

There was a lot of interest in Elena's patch, and various developers made suggestions about how much randomness she would need, and where she should find entropy for that randomness, and so on.

In general, Linus Torvalds prefers security patches to fix specific security problems. He's less enthusiastic about adding security to an area where there are no exploits. But in this case, he may feel that Elena's patch adds a level of security that wasn't there before.

Security is always such a nightmare. Often, a perfectly desirable feature may have to be abandoned, not because it's not useful, but because it creates an inherent insecurity. Microsoft's operating system and applications often have suffered from making the wrong decisions in those cases—choosing to implement a cool feature in spite of the fact that it could not be done securely. Linux, on the other hand, and the other open-source systems like FreeBSD, never make that mistake.

Line Length Limits

Periodically, the kernel developers debate something everyone generally takes for granted, such as the length of a line of text. Personally, I like lines of text to reach both sides of my screen—it's just a question of not wasting space.

Alastair D'Silva recently agreed with me. He felt that monitor sizes and screen resolution had gotten so big in recent years, that the kernel should start allowing more data onto a single line of text. It was simple pragmatism—more visible text means more opportunity to spot the bug in a data dump.

Alastair posted a patch to allow 64-byte line lengths, instead of the existing options of 16 bytes and 32 bytes. It was met with shock and dismay from Petr Mladek, who said that 64 bytes added up to more than 256 characters per line, which he doubted any human would find easy to read. He pointed out that the resolution needed to fit such long lines on the screen would be greater than standard hi-def. He also pointed out that there were probably many people without high-definition screens who worked on kernel development.

Alastair noted that regular users never would see this data anyway, and he added that putting the choice in the hands of the calling routine couldn't possibly be a bad thing. In fact, instead of 16-, 32- and 64-bytes, Alastair felt the true option should be any multiple of the groupsize variable.

There's very little chance that Alastair's patch will make it into the kernel. Linus Torvalds is very strict about making sure Linux development does not favor wealthy people. He wants developers working on ancient hardware to have the same benefits and capabilities as those working with the benefit of the latest gadgets.

Linus commented about seven years ago on the possibility of changing the maximum patch line length from 80 to 100 characters. At that time he said:

I think we should still keep it at 80 columns.

The problem is not the 80 columns, it's that damn patch-check script that warns about people *occasionally* going over 80 columns.

But usually it's better to have the *occasional* 80+ column line, than try to split it up. So we do have lines that are longer than 80 columns, but that's not because 100 columns is ok - it's because 80+ columns is better than the alternative.

So it's a trade-off. Thinking that there is a hard limit is the problem. And extending that hard limit (and thinking that it's 'ok' to be over 80 columns) is *also* a problem.

So no, 100-char columns are not ok.

Deprecating a.out Binaries

Remember a.out binaries? They were the file format of the Linux kernel till around 1995 when ELF took over. ELF is better. It allows you to load shared libraries anywhere in memory, while a.out binaries need you to register shared library locations. That's fine at small scales, but it gets to be more and more of a headache as you have more and more shared libraries to deal with. But a.out is still supported in the Linux source tree, 25 years after ELF became the standard default format.

Recently, Borislav Petkov recommended deprecating it in the source tree, with the idea of removing it if it turned out there were no remaining users. He posted a patch to implement the deprecation. Alan Cox also remarked that "in the unlikely event that someone actually has an a.out binary they can't live with, they can also just write an a.out loader as an ELF program entirely in userspace."

Richard Weinberger had no problem deprecating a.out and gave his official approval of Borislav's patch.

In fact, there's a reason the issue happens to be coming up now, 25 years after the fact. Linus Torvalds pointed out:

I'd prefer to try to deprecate a.out core dumping first....That's the part that is actually broken, no?

In fact, I'd be happy to deprecate a.out entirely, but if somebody _does_ complain, I'd like to be able to bring it back without the core dumping.

Because I think the likelihood that anybody cares about a.out core dumps is basically zero. While the likelihood that we have some odd old binary that is still a.out is slightly above zero.

So I'd be much happier with this if it was a two-stage thing where we just delete a.out core dumping entirely first, and then deprecate even running a.out binaries separately.

Because I think all the known *bugs* we had were with the core dumping code, weren't they?

Removing it looks trivial. Untested patch attached.

Then I'd be much happier with your "let's deprecate a.out entirely" as a second patch, because I think it's an unrelated issue and much more likely to have somebody pipe up and say "hey, I have this sequence that generates executables dynamically, and I use a.out because it's much simpler than ELF, and now it's broken". Or something.

Jann Horn looked over Linus' patch and suggested additional elements of a.out that would no longer be used by anything, if core dumping was coming out. He suggested those things also could be removed with the same git commit, without risking anyone complaining.

Borislav was a little doubtful about Linus' approach—as he put it, "who knows what else has bitrotten out there through the years". But, he wasn't so doubtful as to suggest an alternative. Instead, he said to Linus, "the easiest would be if you apply your patch directly now and add the a.out phase-out strategy we're going for in its commit message so that people are aware of what we're doing." Then, he added, the architecture maintainers could each remove a.out core dump support from their architectures on a case by case basis, and then Borislav could continue to deprecate a.out in its entirety later on.

Linus said he'd be fine with that, but he also said he'd be happy to apply Borislav's a.out deprecation patch immediately on top of Linus' core-dump removal patch. He didn't care to have a time delay, so long as the two patches could be reverted independently if anyone squawked about one of them.

At this point, various architecture maintainers started commenting on a.out on their particular architectures.

Geert Uytterhoeven said, "I think it's safe to assume no one still runs a.out binaries on m68k."

And, Matt Turner said, "I'm not aware of a reason to keep a.out support on alpha."

The alpha architecture, however, proved more difficult than Matt initially thought. Linus looked into the port and found a lot of a.out support still remaining. And certain parts of the port, he said, didn't even make sense without a.out support. So there would actually be a lot more gutting to do, in the alpha code, as opposed to a simple amputation.

Måns Rullgård also remarked, "Anyone running an Alpha machine likely also has some old OSF/1 binaries they may wish to use. It would be a shame to remove this feature."

This actually made Linus stop dead in his tracks. He replied to Måns:

If that's the case, then we'd have to keep a.out alive for alpha, since that's the OSF/1 binary format (at least the only one we support - I'm not sure if later versions of OSF/1 ended up getting ELF).

Which I guess we could do, but the question is whether people really do have OSF/1 binaries. It was really useful early on as a source of known-good binaries to test with, but I'm not convinced it's still in use.

It's not like there were OSF/1 binaries that we didn't have access to natively (well, there _were_ special ones that didn't have open source versions, but most of them required more system-side support than Linux ever implemented, afaik).

And Måns replied, "I can well imagine people keeping an Alpha machine for no other reason than the ability to run some (old) application only available (to them) for OSF/1. Running them on Linux rather than Tru64 brings the advantage of being a modern system in other regards."

Matt said he hadn't been aware of this situation on alpha and agreed that it might be necessary to continue to support a.out on that architecture, just for the remaining users who needed it.

As a practical example, Arnd Bergmann recounted, "The main historic use case I've heard of was running Netscape Navigator on Alpha Linux, before there was an open-source version. Doing this today to connect to the open internet is probably a bit pointless, but there may be other use cases."

He also added that:

Looking at the system call table in the kernel...we seem to support a specific subset that was required for a set of applications, and not much more. Old system calls...are listed but not implemented, and the same is true for most of the later calls...just the ones in the middle are there. This would also indicate that it never really worked as a general-purpose emulation layer but was only there for a specific set of applications.

And in terms of anyone potentially complaining about the loss of a.out support, Arnd also pointed out that "osf1 emulation was broken between linux-4.13 and linux-4.16 without anyone noticing."

Linus replied:

Yeah, it never supported arbitrary binaries, particularly since there's often lots of other issues too with running things like that (ie filesystem layout etc). It worked for normal fairly well behaved stuff, but wasn't ever a full OSF/1 emulation environment.

I _suspect_ nobody actually runs any OSF/1 binaries any more, but it would obviously be good to verify that. Your argument that timeval handling was broken _may_ be an indication of that (or may just mean very few apps care).

And based on these reassuring considerations, Linus said, "I think we should try the a.out removal and see if anybody notices."

The discussion continued briefly, but it seems like a.out will finally be removed in the relatively near future.

The thing that fascinates me about this is the insistence on continuing to support ancient features if even a single user is found who still relies on it. If even one person came forward with a valid use case for a.out, Linus would leave it in the kernel. At the same time, if no users step forward, Linus won't assume they may be lurking secretly out in the wild somewhere—he'll kill the feature. It's not enough simply to use an ancient feature, the user needs to be an active part of the community—or at least, active enough to report his or her desire to continue to use the feature. And in that case, Linus probably would invite that user to maintain the feature in question.

Note: if you're mentioned in this article and want to send a response, please send a message with your response text to ljeditor@linuxjournal.com, and we'll run it in the next Letters section and post it on the website as an addendum to the original article.

About the Author

Zack Brown is a tech journalist at Linux Journal and Linux Magazine, and is a former author of the "Kernel Traffic" weekly newsletter and the "Learn Plover" stenographic typing tutorials. He first installed Slackware Linux in 1993 on his 386 with 8 megs of RAM and had his mind permanently blown by the Open Source community. He is the inventor of the Crumble pure strategy board game, which you can make yourself with a few pieces of cardboard. He also enjoys writing fiction, attempting animation, reforming Labanotation, designing and sewing his own clothes, learning French and spending time with friends'n'family.

Zack Brown