x86-64 Machine Code Obfuscation

Obfuscation is quite a debated topic, when it comes to software development.
There's obviously a tons of reason why you should not try to obfuscate your code.

But sometimes, for some very specific part of your software, you may consider trying it.
Now there's a lot of different ways to achieve this, and in my humble opinion there's no silver bullet when it come to obfuscation.

At the end, what you are trying to do is to prevent someone else to reverse-engineer your code, to prevent someone making sense of your logic and algorithms.

In such a situation, making your code «hard to read» may not be the best solution.
No matter how complex your code is, an experienced reverse-engineer will most of the time figure out what you are doing.

In such a context, psychological warfare is often more effective than pure obfuscation.
The goal here is to break down the attacker, and make it quit.

That being said, you'll often want to combine the psychological effect with some kind of obfuscation.

Again, lots of different ways to achieve this, depending on the language you're using, and your target platform.
But today I want to look at some techniques you can try from a machine code perspective, on x86-64 platforms.

Disassembly

When reverse-engineering some binary, you'll usually use a disassembler.
A disassembler is basically a software that will read a binary executable and display the machine code as «human-readable» assembly code.
State of the art is of course IDA, but there's also Hopper on macOS which is awesome.
Debuggers such as LLDB or GDB also provide disassembly.

Now what can you expect from such a tool?
Well, let's take a look at the following C program:

#include 

void foo( void );
void foo( void )
{
    printf( "hello, world\n" );
}

int main( void )
{
    foo();
    
    return 0;
}

A disassembler might give the following output:

_foo:
    push rbp
    mov  rbp, rsp
    sub  rsp, 0x10
    lea  rdi, qword [ aHelloWorldn ]
    mov  al, 0x0
    call imp___stubs__printf
    mov  dword [ rbp + var_4 ], eax
    add  rsp, 0x10
    pop  rbp
    ret

_main:
    push rbp
    mov  rbp, rsp
    sub  rsp, 0x10
    mov  dword [ rbp + var_4 ], 0x0
    call _foo
    xor  eax, eax
    add  rsp, 0x10
    pop  rbp
    ret

Now even if you're not used to x86-64 assembly code, you might figure out what the program is doing pretty quickly.
main calls foo, which calls printf, with some argument.

Among the different obfuscation techniques, hiding code or trying to fool the disassembler is quite common.
The goal here is to make the disassembler program go crazy, and make it output garbage instead of the actual instructions.

You should of course not rely solely on this.
An experienced reverse-engineer will obviously run your software in a debugger, so machine-code obfuscation won't resist this.
But still, it's great if you can prevent some kind of static analysis.

Now how can we achieve this on x86-64 platforms?

Incomplete instructions

Well, the x86 instruction set is quite a complex beast.
Unlike most of the RISC architectures, x86 instructions can have an arbitrary length.

The CPU will (very) basically read the first byte(s) of an instruction, and depending on its value will read additional bytes for the instruction operands.
As an example, in x86-64 assembly:

xor eax, eax
mov eax, [ edi ]

These two instructions will have a different lengths.
First one will be 2 bytes:

31 C0

While the second one will be 3 bytes:

67 8B 07

This means that, depending of the instruction, the CPU will expect trailing bytes for the operands.
And so will the disassembler.

This is a neat opportunity for us, as it implies we can in theory output raw bytes that correspond to valid x86 instructions and omit the operand bytes.
This way, the disassembler will expect the operands and will try to read them. It will then just miss the next instructions, and go crazy because it will then read at a wrong offset.

The hard part here is that we obviously want our code to be executable.
Trying to execute incomplete instructions will most certainly result in a crash.
So we want these incomplete instructions in a dead branch of our code; that is a branch that will never be executed, but that will still be read by the disassembler.
Something like:

if( false )
{
    /* Incomplete instructions here... */
}

Of course, we don't want this to be so obvious, so we'll need to use some other technique here.

Faking a return

When calling a function, using the call instruction, a few things happen before the target code is reached.
Mainly, the return address is pushed onto the stack. Later, from the called function, when the ret instruction is executed, the CPU will pop that return address, and jump back to the caller.
This pattern is recognised by all disassemblers. That's how they are able to generate complete call graphs.

Again, this is a nice opportunity for us.
First of all we might be able to break some disassemblers, as many will expect a function to have a single ret statement.
Their ability to generate a call graph will then be greatly compromised.

Then it will help us insert dead code in our program, that will still be seen by the disassembler as actual and valid code.
In order to do this, we can basically place another return address on the stack, corresponding to a valid portion of our code.
When ret is executed, the CPU will jump to that portion instead of the original caller, giving us control again.

Just like a local jmp, but hidden for the disassembler.
Let's see how we can implement this in assembly.

But first of all, let's take a look at stack frames.
The stack is a memory region that is basically used as scratch memory for functions.
This is where local and temporary variables are stored.

Now to avoid overriding values from other functions, each called function will create its own stack frame.
The CPU has two registers for the stack: rbp and rsp.
The first one is the base pointer, and contains the start address of the local stack frame.
The second is a pointer to the top of the stack. When using push or pop instructions, this one will change accordingly.

This is why functions usually start with the following prologue:

push rbp
mov  rbp, rsp

The first instruction saves the original base pointer, and the second one sets it to the top of the stack.
This way, we can have our own stack space, for the current function.

So let's start by saving the registers we are going to use, so we can restore them later and hopefully don't break anything:

push rax
push rcx
push rdx

Then we'll save the current stack pointer (rsp) into the rcx register. Again so we'll be able to restore it later.

mov rcx, rsp

Now we'll reset the stack pointer (rsp) to the base pointer (rbp).
Doing this effectively resets the current stack frame to where it was just before the call.

mov rsp, rbp

This is where it gets interesting, because as we moved the stack pointer, the next two values stored in the stack that we can pop are the original base pointer and the return address.
We'll pop the base pointer in rbp, and the return address in rax. We'll restore them later:

pop rbp
pop rax

Now we can push another return address, of a location we know:

lea  rdx, [ rip + 97 ]
push rdx
push rdx

Here, the first line loads a specific address into the rdx register.
rip is the current instruction pointer; that is where we are right now. +97 is simply an offset from here, and is the target code we'll want to execute.
We'll then have some room for garbage code, and some other neat tricks.

We'll obviously push this new return address, as we popped the old one earlier.
Note that we do it twice, so the stack keeps its original alignment (it used to have the base pointer as well).

And finally:

ret

This is where the magic occurs. For most disassemblers, our function is over here, as we've hit a ret instruction.
Control flow should return to the caller, but instead, it will simply jump a few bytes further, as we've overridden the return address.

What can we do from here?

Fake functions

Well, we have some room until we reach to code portion that will be executed.
Let's try to fool the disassembler a little more.

Now what logically comes after the ret instruction of a function?
The start of another function.

Let's do one.
Remember this is completely dead code, that will never be executed.
We are just doing some stuff that will seem logical for a disassembler.

Let's start by a standard stack frame:

push rbp
mov  rbp, rsp

Jump all over the place

What can we do next?

Well, disassemblers are smart.
You can try to fool them in many ways, but sometimes they'll eventually recover.

They do this by analysing your program's flow. That is, your jmp and call instructions.
Even if the code seems completely garbage (like if you used the incomplete instruction trick), they might be able to recover if they see a jump to a valid code location.
Instead of reading garbage, they'll just start disassembling again from that location.

So I found it can actually be useful to write bogus jump instructions, jumping anywhere in your code.
This will usually mess a bit more with the control graph, and the disassembler's ability to recover.

As we have room for some garbage code, let's do this:

xor  rax, rax
cmp  rax, rax

This just zeroes the rax register, and compares it with itself.
Useless, but remember this is dead code.

Now following a cmp instruction, we expect some kind of branching:

je j0

Meaning if the comparison was true (it surely is), jumps to the local j0 label, that we'll define later.
And let's continue a bit more, with other random comparisons, and other jumps:

cmp rax, rdi
je j1
add rax, 0xCAFE
cmp rax, rsi
je j2
cmp rax, rdx
je j3
cmp rax, rcx
je j4 
jmp 24[ rip ]

We are here just comparing useless stuff with useless stuff, and jumping to some local labels.
Again, just to mess with the control graph.
The last instruction jumps to a random location, based on the current instruction pointer.

So this is just:

if( ... )
{
    goto ...;
}
else if( ... )
{
    goto ...;
}
else
{
    goto ...;
}

Now we'll simply define these local labels, and in each one jump to another random location:

j0:
    jmp 16[ rip ]
j1:
    jmp 48[ rip ]
j2:
    jmp 64[ rip ]
j3:
    jmp 128[ rip ]
j4:
    jmp 256[ rip ]

Restoring everything

Now, at this point, the disassembler should be pretty confused.
This is time for us to go back to real code.

Remember our return address override?
It was [ rip + 97 ].

That +97 offset brings us just here, accounting for all the previous instructions we wrote.
So let's undo all the mess we've done:

pop  rdx
push rax

We saved the original return address in rax. So we'll restore it in the stack, and just before, as we pushed it twice to keep the stack alignment, we'll just pop it into rdx, which is a safe register for us to use at this point.

The original base pointer was saved in rbp previously, let's push it again:

push rbp

And now we can simply restore our previous stack frame (rsp was saved to rcx):

mov rbp, rsp
mov rsp, rcx

And that gives us the opportunity to restore the three registers we earlier pushed on the stack, because we were going to use them:

pop rdx
pop rcx
pop rax

At this very specific point, it's just as if nothing happened.
The stack frame and the registers are in the exact same conditions.

This is great, because it means our software will run unaffected.
But it's also great because we produced a lot of garbage for the disassembler.

One more thing

Now there's one more thing we can do, before continuing normal code execution.
We spoke about incomplete instructions, but we never actually used them.

Now is the right moment.

The idea of offsetting a disassembler is great, but I found in practice that many disassemblers are quite robust to incomplete instructions.
But now that we messed so much with its ability to generate a control graph, and detect that we're actually inside a single function, it might be quite efficient.

Now we're still in a valid code section, although it might not be recognised as such by the disassembler.
Let's do some shit, and jump to another valid code section:

push rax
xor  rax, rax
jz   done

Pushing the rax register on the stack, zeroing it, and jumping to a done label.
Nothing scary here, I don't expect the disassembler to see the done label because of the mess we just did.

Now let's output an actual incomplete instruction:

.byte 0x89
.byte 0x84
.byte 0xD9

For an x86-64 processor, that is:

0x89 is the opcode for the mov (r/m16/32/64 r16/32/64) instruction.
0x84 (1000 0100) is MOD-REG-R/M for a four byte displacement following SIB with RAX (000) as destination register.
0xD9 (1101 1001) is SIB for 8 as scale, RBX as index (011) and RCX as base (001).

As you can see, the four displacement bytes are omitted, so the instruction is incomplete.
Assuming the disassembler is able to reach this location, this will fool it as it will try to interpret the next instructions as the displacement bytes.

Note that the complete instruction, if complete, would translate to:

mov rax, [ rcx + rbx * 8 + displacement ]

Now we simply have to declare our done label, pop rax, and we can continue normal execution:

done:
    pop rax

Wrapping up

We're basically done, and I hope you found this article interesting.

Now remember this is a basic approach to some kind of obfuscation, for a specific platform.
In practice, I found that mixing different techniques in some specific way usually gives the best results.

That being said, disassemblers are very smart, and getting smarter each day.
Each one uses different heuristics, so as I said at the beginning of the article, there's really no silver bullet.

But if you're looking into obfuscation, my only hope is that this article gave you some ideas… : )
As alsways, you can find the code for the article on my GitHub.
Cheers!