Steven Rostedt wanted to do a little housekeeping, specifically with the function
tracing code used in debugging the kernel. Up until then, the kernel could enable
function tracing using either GCC's -pg
flag or a
combination of -pg
and
-mfentry
. In each case, GCC would create a special routine that would execute at the
start of each function, so the kernel could track calls to all functions. With just
-pg
, GCC would create a call to mcount()
in all C functions,
although with -pg
coupled
with -mfentry
, it would create a call to fentry()
.
Steven pointed out that using -mfentry
was generally regarded as superior, so much so
that the kernel build system always would choose it over the mcount()
alternative by
testing GCC at compile time to see if it actually supported that command-line argument.
This is all very normal. Since any user might have any version of a given piece of software in the toolchain, or a variety of different CPUs and so on, each with different capabilities, the kernel build system runs many tests to identify the best available features that the kernel will be able to rely on.
But in this case, Steven noticed that for Linux version 4.19,
Linus Torvalds had agreed to bump
the minimum supported GCC version to 4.6. Coincidentally, as Steven now pointed out,
GCC version 4.6 was the first to support the -mfentry
argument. And, this was his
point—all supported versions of GCC now supported the better function tracing
option, and so there was no need for the kernel build system to cling to the
mcount()
implementation at all.
Steven posted a patch to rip it out by the roots.
Peter Zijlstra gave his support for this plan, as did Jiri
Kosina. And, Jiri in
particular spat upon the face of the mcount()
solution.
Linus also liked Steven's patch, and he pointed out that with mcount() out of the
picture, there were several more areas in the kernel that had existed simply to help
choose between mcount()
and fentry()
, and that those now also could be removed. But Steven
replied that, although yes this should be done, he still wanted to do split it up into
a separate patch, for cleanliness' sake.
As it turned out, Steven's patch actually applied only to the x86 kernel port. A lot of
other architectures still used mcount()
, as Josh
Poimboeuf pointed out. And Steven
confirmed, "fentry works nicely when you have a single instruction that pushes the
return address on the stack and then jumps to another location. It's much trickier to
implement with link registers. There's a few different implementations for other archs,
but mcount happens to be the one supported by most."
And that was that. Steven's patch certainly will go into the kernel as soon as it's fully ready. It's enjoyable to watch these details shake out, after the relatively large decision to change the minimum supported GCC version. I imagine there are several more areas of the kernel that can be simplified and cleaned up, now that they don't have to support older versions of GCC.
Often, a kernel developer will try to reduce the size of an attack surface against Linux, even if it can't be closed entirely. It's generally a toss-up whether such a patch makes it into the kernel. Linus Torvalds always prefers security patches that really close a hole, rather than just give attackers a slightly harder time of it.
Matthew Garrett recognized that userspace applications might have secret data that might be sitting in RAM at any given time, and that those applications might want to wipe that data clean so no one could look at it.
There were various ways to do this already in the kernel, as Matthew pointed out. An
application could use mlock()
to prevent its memory contents from being pushed into
swap, where it might be read more easily by attackers. An application also could use
atexit()
to cause its memory to be thoroughly overwritten when the application exited,
thus leaving no secret data in the general pool of available RAM.
The problem, Matthew pointed out, came if an attacker was able to reboot the system at a critical moment—say, before the user's data could be safely overwritten. If attackers then booted into a different OS, they might be able to examine the data still stored in RAM, left over from the previously running Linux system.
As Matthew also noted, the existing way to prevent even that was to tell the UEFI firmware to wipe system memory before booting to another OS, but this would dramatically increase the amount of time it took to reboot. And if the good guys had won out over the attackers, forcing them to wait a long time for a reboot could be considered a denial of service attack—or at least downright annoying.
Ideally, Matthew said, if the attackers were only able to induce a clean shutdown—not simply a cold boot—then there needed to be a way to tell Linux to scrub all data out of RAM, so there would be no further need for UEFI to handle it, and thus no need for a very long delay during reboot.
Matthew explained the reasoning behind his patch. He said:
Unfortunately, if an application exits uncleanly, its secrets may still be present in RAM. This can't be easily fixed in userland (eg, if the OOM killer decides to kill a process holding secrets, we're not going to be able to avoid that), so this patch adds a new flag to madvise() to allow userland to request that the kernel clear the covered pages whenever the page reference count hits zero. Since vm_flags is already full on 32-bit, it will only work on 64-bit systems.
Matthew Wilcox liked this plan and offered some technical suggestions for Matthew G's patch, and Matthew G posted an updated version in response.
Michal Hocko also had some technical suggestions, including the idea that the patch should not just wipe RAM, but also any swap space, for added protection.
But, Christopher Lameter replied to Matthew G's patch, saying that it didn't actually fix the problem, even if it made the attack more difficult to carry out. As he put it:
The pages are cleared anyways when reallocated to another process. This just clears it sooner before reuse. So it will reduce the time that a page contains the secret sauce in case the program is aborted and cannot run its exit handling.
Is that really worth extending system calls and adding kernel handling for this? Maybe the answer is yes given our current concern about anything related to "security".
Matthew G pointed out that if the system was mostly idle, no other process might claim the RAM that still held secret data. In this case, those secrets would sit unguarded. And if someone did reboot the system at that time, the secret data would be exposed.
A bunch of people contributed technical suggestions, and Matthew G submitted several new versions of his patch, before the discussion ended.
There's clearly some interest in this patch, but no one was singing about it on their way to the Grey Havens. It clearly represents a security improvement, in the sense that it makes the time window a bit tighter for an attacker to take advantage of exposed data, but at the same time, that window does remain open for a certain amount of time. Hostile attackers could potentially take advantage of that to gain access to privileged data, even with Matthew G's patch. It's unclear to me whether or not this patch will go into the kernel.
Mike Rapoport from IBM launched a bid to implement address space isolation in the Linux kernel. Address space isolation emanates from the idea of virtual memory—where the system maps all its hardware devices' memory addresses into a clean virtual space so that they all appear to be one smooth range of available RAM. A system that implements virtual memory also can create isolated address spaces that are available only to part of the system or to certain processes.
The idea, as Mike expressed it, is that if hostile users find themselves in an isolated address space, even if they find bugs in the kernel that might be exploited to gain control of the system, the system they would gain control over would be just that tiny area of RAM to which they had access. So they might be able to mess up their own local user, but not any other users on the system, nor would they be able to gain access to root level infrastructure.
In fact, Mike posted patches to implement an element of this idea, called System Call Isolation (SCI). This would cause system calls to each run in their own isolated address space. So if, somehow, an attacker were able to modify the return values stored in the stack, there would be no useful location to which to return.
His approach was relatively straightforward. The kernel already maintains a "symbol table" with the addresses of all its functions. Mike's patches would make sure that any return addresses that popped off the stack corresponded to entries in the symbol table. And since "attacks are all about jumping to gadget code which is effectively in the middle of real functions, the jumps they induce are to code that doesn't have an external symbol, so it should mostly detect when they happen."
The problem, he acknowledged, was that implementing this would have a speed hit. He saw no way to perform and enforce these checks without slowing down the kernel. For that reason, Mike said, "it should only be activated for processes or containers we know should be untrusted."
There was not much enthusiasm for this patch. As Jiri Kosina pointed out, Mike's code was incompatible with other security projects like retpolines, which tries to prevent certain types of data leaks falling into an attacker's hands.
There was no real discussion and no interest was expressed in the patch. The combination of the speed hit, the conflict with existing security projects, and the fact that it tried to secure against only hypothetical security holes and not actual flaws in the system, probably combined to make this patch set less interesting to kernel developers.
It's one of the less pleasant aspects of kernel development. Someone can put a lot of hours into a project, with no way to know in advance what objections might be raised at the end. It wouldn't have been obvious to Mike and his colleagues that a speed hit would be necessary. And the possibility of conflict with other existing kernel projects is always very difficult to predict, especially since there often are workarounds that can be discovered only once members of the two projects start debating the various issues in public.
Only Linus Torvalds' general reluctance to add security features that do not address existing security holes could have been predicted. He seems very consistent on that point, much to the annoyance of security-minded developers throughout the Open Source world. The idea of reducing the size of an attack surface seems self-evident to them; while to Linus, it seems self-evident that you shouldn't fix what isn't broken, especially where the fix adds bloat and increases the maintenance costs for the whole project. I think it's likely that even if Jiri and other developers had approved of Mike's patches, Linus might have objected later on.
Note: if you're mentioned in this article and want to send a response, please send a message with your response text to ljeditor@linuxjournal.com, and we'll run it in the next Letters section and post it on the website as an addendum to the original article.