Sacrifice a Canary upon the Stack of the Gods: on Canaries, Coal Mines and Stack Sanity

Matt Davis

Issue #222, October 2012

This article is a basic introduction to program execution stacks and how they become corrupted, and it discusses a means of detecting such corruption. GCC provides a compile-time solution for stack corruption detection and a set of protection mechanisms available through an additional GCC plugin.

The Canary, or Serinus canaria, is a species of bird that was bred for captivity as early as the 17th century. These tiny and colorful (Tweety-bird-esque) birds are well known for domestication and singing abilities (en.wikipedia.org/wiki/Domestic_Canary). To their dismay, however, these little feathered-friends often found themselves victims, acting as primitive detectors of methane, carbon monoxide and other toxic gases that lurk in the dark depths of coal mines (en.wikipedia.org/wiki/Animal_sentinels).

Like their sacrificial brethren, a stack canary lies within the dark depths of a stack frame during execution time of a program. When a function is executed, the canary (also known as cookie, en.wikibooks.org/wiki/Reverse_Engineering/Common_Solutions) is placed on the stack, and just before the function returns, it checks to ensure that the canary value has not been modified. If the canary value appears to have been modified (for example, the canary has been stepped on, crushed, slaughtered and so on), the program can terminate prematurely, such as with an abort(). Such a death of our digital fowl suggests that the stack has been modified/corrupted. A premature program termination—for example, abort()—prevents the executable from acting on a corrupted stack, which could be the result of bad programming or a sign of an intentional buffer overflow from a user of malicious intent (getting own3d). If the program did continue and operate on a corrupted stack, it might crash, execute shell code or continue executing but operating on bad data. The latter case of continuing to execute without apparent fault is often difficult to debug.

Personally, I am of the belief that it is best for a program to crash or terminate early, rather than be misleading and produce invalid output. In this article, I pay homage to the feathered martyrs and investigate the stack protector mechanism in GCC, as well as look at a few other implementations presented in SatanicCanary, a GCC plugin that instruments a polymorphic stack canary—a canary on 'roids.

Stacks, Flapjacks and Hungry Canaries

When functions are called during the runtime of a binary, the data for each function exists on the program's stack. The stack grows or shrinks based on the function being executed. And, as I am sure you are aware, a function can call other functions, which can call other functions, and so on and on. And a function can, of course, be recursive and call itself. Well, this path from function caller to callee is termed a call-graph. How, at runtime, is a function to know to whom to return control? This caller-to-callee path can be envisioned as a stack data structure. And, in fact, this stack of function call data, stack frames, resides in the process' stack segment of memory. I am sure most computer science students have regurgitated the vision of a stack being a series of pancakes or as a tower of plates at a buffet. Well, on this stack is the return address and the arguments to the caller, as well as local variables (automatic variables) for the function being called. The stack grows as a function calls a function, which calls another function and so on. When the callee finishes its work, it pops the stack, and the caller is returned control. As you can probably imagine, the stack frame holds quite a bit of program integrity.

If a frame is compromised, the return address, where the callee jumps back into the caller function, can be overridden. Because the memory area of a stack in the typical x86 calling convention is marked as executable, arbitrary code can be written to execute in place of the return address. This is a stack-smash or stack-overflow—whatever you want to call it. These can be intentional (exploit and shell code) or accidental (bad programming). Either way, protecting the stack from these vulnerabilities is key for assuring stack sanity and program safety.

Stack Frames and Calling Conventions

As I suggested previously, there is a bit of black magic that occurs when a program runs and calls functions—mainly, the magic of returning to the call site in the caller function, so that the program can continue on. Well, the magic is pretty simple: a program operates on a stack of call frames, also known as stack frames. These frames are pushed onto the program's stack memory segment each time a function is called. When the function completes, it pops the data consisting of the frame off the stack, and the caller gets control back, by jumping to the return address that was pushed onto the stack as the frame was being constructed.

In this article, I focus on a simple x86 C calling convention, cdecl, but other architectures can provide a playground for other calling conventions. For instance, if the CPU has more registers, arguments to that function can be passed in registers that make their data immediately available to the caller, rather than pushing and popping arguments from the stack. Further, a calling convention also can vary for different languages. The compiler is responsible for actually generating the function calls and doing all of that magic; thus, how a function is called is based on what the compiler decides to do. In addition, a compiler can optimize away arguments and use various hardware features of the underlying architecture to accomplish a function call.

In fact, GCC employs a variety of calling conventions. For a 64-bit Intel x86 architecture, the compiler will pass arguments by using registers instead of pushing to the stack, à la the System V AMD64 ABI. This is similar to the non-standardized “fastcall” calling convention. The 32-bit version of GCC for x86 uses a fastcall convention. The intricacies of these various conventions do not matter for our purposes here. Rather, the important piece to remember is a calling convention is a means of how a function is called, how the arguments are passed to it and how that callee function returns back to the caller.

The simple cdecl x86 calling convention places all return values from a function into the EAX/RAX register (32- or 64-bit, respectively). If you really want to explore other calling conventions, Wikipedia has some great articles explaining various calling conventions in more depth (see en.wikipedia.org/wiki/X86_calling_conventions and en.wikipedia.org/wiki/Calling_convention).

It's probably easier to imagine a stack frame if I provide an example. Here goes:

int foo(int x, int y)
{
    int z = bar(x, y);
    return x + y + z;
}

Let's assume that the program is compiled using the cdecl calling convention on an x86 architecture and is executing the “foo” function. In this case, the current frame on top of the stack is foo. Below the foo frame is foo's caller. And it's turtles all the way down, until the stack hits main's frame. Anyway, foo's frame consists of both its parameters and local variables, and it looks like this:

bottom-of-stack --> [y, x, RIP, caller's stack base pointer, z]

RIP is the instruction pointer register, and it points to the next instruction to be executed; therefore, that value is pushed onto the stack when a function is called, and that value will be the return point when the callee returns. Thus, if the stack is never corrupted, the return address will be valid. There are two more registers to be concerned with, the RBP and RSP (EBP and ESP on 32-bit x86 architectures). The former is the stack base pointer, and it represents the start of a new stack frame. RSP is the top of the stack. Because the x86 stack grows downward, meaning that as things are pushed onto it, the address those things live at approaches the address of zero. So, the RSP always should be at an address smaller than RBP. When the function being called has been jumped to, the first thing that function (the callee) does is sets up the RBP and RSP. It saves the previous stack frame base pointer by pushing it on the stack (the caller's stack base pointer in the example above). The RBP is then updated to reflect the current frame, so it sets the current top of the stack to the base pointer. And the stack is then increased to hold the local variables. To increase the stack, the function subtracts from the base pointer (remember the stack grows downward toward zero). So the range from RBP to RSP contains the address of the previous frame's base pointer and the current function's local variables. By adding an offset to the function's current base pointer, it can access any arguments passed to it. See unixwiz.net/techtips/win32-callconv-asm.html for more detail and lovely pictures.

Returning to our example, when the calling function calls foo, the cdecl convention first will push the arguments given to foo, such as foo(666, 667) from right to left. The construction of the frame begins with 667 and 666, with 666 on the top. Then, foo is called. By calling foo, via a “call” instruction, the return address where foo is to return to is pushed on the stack. This address is actually that of the RIP, since it points to the next instruction to execute, which just happens to be where foo will return control to. The RIP then is magically changed (a call instruction also jumps to the address of foo), and the foo function begins to execute. Because foo now has control, the first thing it does is sets up its stack frame. To do this, foo pushes the base pointer and grows the stack enough for any local variables (in this case z). Inside foo, the stack now looks like this, where top is the stack pointer and points to the top of the stack, z:

--> top [z, base pointer, call return, 666, 667]

The following is assembly output (from gdb) showing the stack frame for the example function above, “foo”:


0x00000000004004a7 <+0>:  push   %rbp
0x00000000004004a8 <+1>:  mov    %rsp,%rbp
0x00000000004004ab <+4>:  sub    $0x18,%rsp

As you can see, the first line saves the caller's base-stack pointer by pushing it on top of the stack. Next, the top of the caller's stack is set to the bottom/base of foo's stack frame. Then, the stack frame is increased by 0x18 (24) bytes, which is enough room to hold the local variables. This can hold three words, or more than enough to hold the three four-byte local variables x, y, z.

Why does this stack frame concern us? Well, the point of return (the RIP that was pushed onto the stack during the function call) is on that puppy. If somehow the stack is flooded with data, it is possible to overwrite the RIP and the function will jump to a bad address or possibly to the address of input data. This input data might be data passed from the user, as program input, and it might contain code that can do some pretty devious things (such as shell code). If the program is, say, a server and running under root privileges, any malicious user might be able to pass instructions as user input data to the server. The server might have a poorly written function that then might become victim of the user input data. If the data was formatted properly and of the right length, the RIP can be overridden, and instead of returning to the proper caller, the user can gain control. Consider the case where the evil user passes code as input, and this code (shell code) contains instructions to spawn another shell (for example, bash). Well, because in our scenario the dæmon, server, program, whatever, is running with root privileges, any commands it executes (even if they were from an angry prepubescent user who hates his daddy) will get executed as root; thus, the emo-black-hat user can gain a shell with root privileges. Crikey!

This is when our little buddy comes into play. The whole purpose of a stack canary (also known as a cookie) is to sit its feathered little butt on the stack and look pretty. If the canary remains intact from function start to function return (prologue to epilogue), chances are the stack was never overwritten by information, the return address is sane, and it has not been compromised by error or malicious intent. See, we like canaries, they are our friends.

Serinus canaria digitalus

The basic idea of the canary is to be a sacrificial beast sitting within the stack frame, and if this canary value is ever modified, that's a sign that the stack has been corrupted. There are a variety of these little sacrificial creatures. The first is the basic, static value canary. These are the easiest to implement from a compiler perspective and probably the easiest to defeat. When the compiler inserts code to summon the canary data into the stack frame, which will occur at runtime during function prologue as the stack frame is being constructed, the value inserted is constant. This makes checking the value upon function epilogue/return quite easy. Because the compiler inserts a known value at compile time, the checking routine just checks the canary data to see if it matches. Of course, if malware authors know what this value is, they can craft their exploit code carefully to overwrite the RIP and the canary. A disassemble of the program will reveal the canary value.

Another type of canary is the terminator canary (shudders). This canary is just a value containing a NULL or some other control character typically used for terminating a string (https://buildsecurityin.us-cert.gov/bsi/articles/knowledge/coding/310-BSI.html). Such a canary forces malware authors to insert a termination character into their malware source code. This means if one of the string manipulation functions (such as strcpy or strcat) is being used to trigger the stack overflow (corruption), it also might terminate the malware shell code prematurely, preventing the corruption from compromising the return address.

Random canaries are another species. These values can be decided upon at either compile or runtime. If the compiler chooses a random value, each function might have a different canary value. The check code, during function epilogue, is easy to create, because the compiler is inserting the value during compilation, much like the static canary, except that the value changes for each function and/or for each compilation of the program. Runtime random canaries are more tricky to implement. The value set at prologue must somehow be used to compare the canary against during epilogue. There are a variety of ways to accomplish this, such as storing the random data somewhere that both the prologue and epilogue easily can re-create.

XOR canaries are used to encode data (possibly random) against another piece of data, such as the return address. These canaries can protect the return address directly, and if the return address and the XOR'd canary value differ, a corruption has been detected.

Stack Protector: the GCC Species of Canary

A good example of a stack canary that is common and out in the wild is that which GCC provides, the Stack Protector. This GCC feature can build canaries and their corresponding validity check into programs via the command-line argument -fstack-protector. This argument guards only certain functions, and if -Wstack-protector also is passed as an argument to GCC, those functions not guarded will be noted by a compiler warning. To enable the Stack Protector on all functions, the argument -fstack-protector-all can be passed. The concept of this feature is pretty simple. Upon function prologue, a value is placed on the stack, and upon prologue epilogue, just before the return point, the canary value is checked to see if it has been corrupted/changed. So, let's peek at what this little sacrificial beast looks like. The following is the stack frame (as GDB output) for the “foo” function when compiled with GCC as:


'gcc -g3 test.c -fstack-protector-all'
0x0000000000400514 <+0>:  push   %rbp
0x0000000000400515 <+1>:  mov    %rsp,%rbp
0x0000000000400518 <+4>:  sub    $0x20,%rsp
0x000000000040051c <+8>:  mov    %edi,-0x14(%rbp)
0x000000000040051f <+11>: mov    %esi,-0x18(%rbp)
0x0000000000400522 <+14>: mov    %fs:0x28,%rax
0x000000000040052b <+23>: mov    %rax,-0x8(%rbp)
0x000000000040052f <+27>: xor    %eax,%eax

Upon function prologue, the stack frame is set up as normal; however, the stack is increased by 0x20 (32bytes) instead of 24. This is enough stack space to hold four one-word values (the local versions of x, y, z and a stack canary). Of course, the neat thing about canaries is how the value is chosen. What does the GCC species of canary look like? Well, this critter is generated at offset +14 and +23 in the function. The word of data located at 0x28 (byte 40) from the start of the fs (an extra segment) of memory is snagged and copied into the rax register. The FS segment is an additional x86 register used for XXXX. This canary value is stored 8 bytes from the base of the stack and will be verified that it has not been changed just before the function returns.

Before control is returned to the caller and the stack frame popped, the additional code injected by GCC for stack protection, the actual safety canary mechanism, must be executed to check for any stack corruption. The assembly dump for the case above in the epilogue of foo looks like this:


0x0000000000400589 <+58>: mov    -0x8(%rbp),%rdx
0x000000000040058d <+62>: xor    %fs:0x28,%rdx
0x0000000000400596 <+71>: je     0x40059d <foo+78>
0x0000000000400598 <+73>: callq  0x400410 <__stack_chk_fail@plt>
0x000000000040059d <+78>: leaveq 
0x000000000040059e <+79>: retq   

As the stack canary was created in the function prologue, the function epilogue essentially operates in reverse, by pulling the canary off the stack and then comparing it to what is 40 bytes from the start of the FS segment. If the values are equal, the canary is unchanged, and the function returns as normal. If the canary differs from where the value the canary was created from, __stack_chk_fail is called, and the program terminates with an error message:

*** buffer overflow detected ***: ./test terminated

The leaveq instruction cleans up the stack, performing an addition on the base pointer restoring it to before we increased the stack size (via the subtraction operation at the <+4> offset listed above). retq then pops the next value off the stack (the return address from the caller) and then jumps to that address.

Of course, the main issue with stack canaries is that they are not zero-sum when it comes to efficiency. While the above canary is relatively cheap on processing, it still adds a few instructions. Further, this also increases the code size of the binary—albeit, it's a very small addition, but I felt such “negative” properties of canaries deserve some mention. The other issue with canaries is that they are not 100% effective. If malware authors know what the value of the canary being used in the program will be, they can craft their malware in just the right fashion so that their code overrides the stack, but where the canary value is, they copy the same canary value the program expects. This is not necessarily an easy task to accomplish, but it is not impossible. Nonetheless, stack canaries still are useful for mere corruption detection and malware detection. Fly on good buddy, flap hard and true.

Birds of Another Feather

So, where do we go from here? Well, I decided to write my own family of stack canaries that operate in a slightly different manner—the idea being to have different canaries inserted into the functions of the program, somewhat of a compile-time metamorphic effort. Further, I also wanted to create a species of canary, a metamorphic canary, that changes each time a function is called. This new breed is called SatanicCanary and is available as a GCC plugin, specifically for x86 64-bit architectures (https://github.com/enferex/sataniccanary).

To become fertile and produce my own flavor of canaries, I decided to create a GCC plugin. The plugin operates toward the end of compilation during the RTL (register transfer language) passes (gcc.gnu.org/onlinedocs/gccint). RTL is GCC's somewhat architecture-independent representation of a machine. RTL code is driven against a machine description (md) for each platform (such as ARM, x86 or MIPS). The RTL code must match a template in the machine description for the platform being compiled. If there is no match for the RTL code in the template, the architecture will not produce the proper code for the RTL, and GCC will leave you with an error.

Now, SatanicCanary is pretty simple as a GCC RTL plugin. The main execution function is called after the pro_and_epilogue RTL pass. From that point, the plugin can look at each RTL expression that makes up the function. I insert the canary in the prologue, and I check the canary in the epilogue portion of the function. It's pretty simple. The most useful snippet of code from the SatanicCanary plugin is summarized below and shows the detection routine used to handle inserting and checking the canary values:


for (insn=get_insns(); insn; insn=NEXT_INSN(insn))
  if (NOTE_P(insn) && (NOTE_KIND(insn) == NOTE_INSN_PROLOGUE_END))
    insert_canary();
  else if (NOTE_P(insn) && NOTE_KIND(insn) == NOTE_INSN_EPILOGUE_BEG)
    insert_canary_check();

This is a summary of what you will find in the Git repository in the main execution routine of the plugin. As stated, we iterate through each RTL expression (insn) and check to see if there is a NOTE rtl expression. Notes do not get compiled into the binary, but are just metadata used during compilation. We look for the prologue end and epilogue begin notes and insert our canary data as necessary.

SatanicCanary provides three canaries so far, the first (basic_canary) is just your typical random number inserted at compile time. Therefore, each function can have a different canary value and would require the malware author to be lucky or to know beforehand what the canary value is. Such data is not secret, as you can disassemble the binary to find it by looking at the canary check code just before each function returns. This does mean that each compilation of the same binary will change, because the random numbers will (should) be different each compile.

The second canary created was more or less something I was playing with, and it is disabled because it is totally insecure. This canary (tsc_canary) uses the timestamp counter (TSC) value from the processor as the canary value. This will change each time the function is called. So malware authors must be really lucky if they want to kill this canary...right? Well, no. The check routine, which occurs at the epilogue phase, must know this same value. By the time the epilogue occurs, the TSC will have elapsed a few cycles or more. To get the epilogue to compare the canary TSC value, it must have the same TSC to compare against. Well, my solution was poor. I just called the TSC (rdtsc) instruction and popped the low 32 bits of it onto the stack twice, figuring that the low 32 bits would change more frequently than the higher bits. This was just blatantly stupid, and even the most amateur malware writers can defeat this canary. Recall that I said we pushed the TSC low 32 bits twice on the stack: once was the canary value, and the second push was for the epilogue to check the canary value against. To defeat this canary, just copy a string of all the same value across the stack, and both the canary and its check will pass the comparison in the epilogue (as long as they have the same data). This is merely an anecdote to say that if you do place your canary check value on the stack, encode it in some obscure way to prevent malware from easily compromising its integrity. Oh, and there is a better way of doing this canary, but the anecdote of the method I mentioned is important.

The third canary provided by SatanicCanary is the TSC canary, but the TSC is XOR'd against read-only data (tscdata_canary). Because the read-only data cannot be modified by the process, without issuing a segmentation fault, we encode our TSC against whatever data is in the code segment of the binary. In this case, we do store the canary (TSC value) and the check (TSC xor DATA) on the stack. Now the TSC and TSC-xor-DATA are two different values. Yes, I just went against what I said, and I placed a canary-check value on the stack. But it is encoded, so it differs from the canary value; thus, a simple stack overwrite will be defeated. This is what I would like to call a polymorphic stack canary. It changes each time the function is called and is also encoded against data that cannot be physically modified by the process (read-only). It does require a few ops: the encoding of TSC-xor-DATA and the check that just xors the TSC-xor-DATA and DATA. This should produce the TSC original canary value.

SatanicCanary is just a proof of concept. Work can be done to optimize it, such as not protecting functions that do not need to be protected, like those that do not manipulate user input.

Conclusion

I hope this exposé into the world of stack canaries was useful. I must say that I did not by any means exhaust this area of research. I owe a lot of kudos to the grsecurity and PaX teams for their GCC plugin for hardening the Linux kernel. They also provide their own species of canary/cookie (grsecurity.net and pax.grsecurity.net). The beauty of canaries is that they can be as wicked and creative as you want, but remember this is just one means of stack protection. Non-executable stacks can get around the malware execution from stack problem; however, just because a stack is not executable does not mean it cannot get corrupted. Canaries help to detect both corruption and potential malware attempts at compromise. So go forthwith my faithful canary-loving friends, and sacrifice your sanity in the name of binary integrity.

Matt Davis is a software engineer on leave from his job in the US to pursue a PhD from the computer science department at the University of Melbourne, where he is focusing his hackery toward the compiler field. He has been involved in both the fields of modeling and simulation, as well as kernel-level high-precision timing. His interests include coding, compilers, kernels, listening to obnoxious music, consuming vast quantities of coffee and being social with wulfpax and 757.