Detours into Arcana: A Static Malware Analysis Trick

June 23, 2015 Rolf Rolles

Several friends asked for my advice on a particular technique seen in some recent malware samples, which is similar in nature to the Nanomites feature of the Armadillo copy protection. There are three parts to this technique:

When creating the malware binary, replace control transfer (call and jmp) instructions with privileged instructions.
At run-time, install an exception handler.
Catch the exceptions thrown by the privileged instructions, perform the intended control transfers, and resume execution.

A concrete example follows.

UPX0:004081F7 in eax, dx
UPX0:004081F8 dw 4EE0h

When the process attempts to execute this instruction, an exception is generated. The exception handler checks to see that the faulting instruction is "in eax, dx", then reads the word following the instruction, and generates a call to location 0x405000 + 0x4EE0. In other words, call instructions within the module are replaced by:

in eax, dx
dw call_destination - 0x405000

As malware analysts, we would like to deobfuscate the binary by replacing these privileged sequences with the original call instructions, both for our own sake as well as the ability to use the Hex-Rays decompiler (which otherwise balks at the use of privileged instructions). However, the particular implementation within this sample poses a slight conundrum. The sequence "in eax, dx / dw XXXXh" is three bytes long, whereas the original "call 409EE0h" is five bytes. Therefore, we cannot merely rewrite the original instruction atop the privileged one without overwriting subsequent instructions.

A second idea is to use detouring, another staple technique in reverse engineering. We could find or create some unused space within the binary, patch a jmp from the privileged instruction to that new location, recreate the original instruction at that location, and then patch a jmp back to the location after the privileged instruction. However, this idea is flawed for the same reason: a long jmp instruction is five bytes long, so we would also alter subsequent instructions.

Bill Gates questionably said "I choose a lazy person to do a hard job, because a lazy person will find an easy way to do it." After some thought, I recalled a bit of x86 processor arcana that can help us fit our detours into the three-byte spaces provided by the obfuscator: the address-size prefix, 0x67. Quoth the Intel manuals: "Address calculations are first truncated to the effective address size of the current mode, as overridden by any address-size prefix. The result is then zero-extended to the full address width." I.e., when operating in 32-bit mode, if we prefix an instruction that references an address with 0x67, the address will be truncated to 16-bits.

To be specific, consider the following:

UPX1:00410320 EB 00 jmp near ptr unk_410322 
; jump to 0x00410320 + 2(length of jmp instruction) + 0x00

When we place an address-size prefix on this instruction, we get:

UPX1:00410320          db 67h
UPX1:00410320 67 EB 00 jmp near ptr unk_323 
; jump to (0x00410320 + 3(length of jmp instruction) + 0x00) & 0xFFFF

To use this trick for deobfuscation, we must first create a segment at address 0x0 of length 0xFFFF.

Recalling our motivating example:

UPX0:004081F7 in eax, dx ; obfuscated call 0x405000 + 0x4EE0
UPX0:004081F8 dw 4EE0h

Let us overwrite these bytes with 67 EB FD:

UPX0:004081F7 db 67h
UPX0:004081F7 jmp short loc_81F7

This is the first half of our detour. Now, at location 81F7h, let's replicate the original control transfer instruction and add a jmp back to the location after the obfuscated sequence:

ROLFMSRE:000081F7 loc_81F7:
ROLFMSRE:000081F7 call    sub_409EE0
ROLFMSRE:000081FC jmp     loc_4081FA

And now we have accomplished our goal. Once we have replaced the obfuscated control transfers, we human analysts can read the listing more easily, and Hex-Rays no longer has trouble decompiling the code.

SMT Solvers for Software Security

August 6, 2012 Rolf Rolles

I co-authored a paper at WOOT '12 on applying SMT solvers to software security issues.

Video of my RECON 2012 Keynote: The Case for Semantics-Based Methods in Reverse Engineering

July 2, 2012 Rolf Rolles

Originally published July 2nd, 2012 on OpenRCE

My previous blog entry concerned the keynote speech that I gave at RECON 2012, entitled "The Case for Semantics-Based Methods in Reverse Engineering". (You can find a link to the slides at that previous entry.) In that blog, based on conversations that I had had with the RECON organizers, I made statements to the effect that the video had been destroyed. It turns out that the video was in fact not destroyed. You can watch it here. I would recommend reading along with the slides, seeing as I apparently skipped a couple of slides during the beginning of my presentation. Blah

RECON 2012 Keynote: The Case for Semantics-Based Methods in Reverse Engineering

June 18, 2012 Rolf Rolles

Originally published June 18th, 2012 on OpenRCE.

The goal of my RECON 2012 keynote speech was to introduce methods in academic program analysis and demonstrate -- intuitively, without drawing too much on formalism -- how they can be used to solve practical problems that are interesting to industrial researchers in the real world. Given that it was the keynote speech, and my goal of making the material as accessible as possible, I attempted to make my points with pictures instead of dense technical explanations. As a result, one might consider this presentation to be a friendly (but decidedly incomplete) introduction to binary program analysis as opposed to a rigorous mathematical monograph. The presentation features five detailed expositions of applying static program analysis (abstract interpretation and SMT solving) towards practically-interesting reverse engineering problems. (Aside: it's quite challenging to present this material without using terms such as "lattice", "Galois connection", etc.!)

Unfortunately, due to an error with the camera, the recording of the talk does not exist. This is problematic: I failed somewhat in walking the sharp edge of Einstein's razor, "as simple as possible, but no simpler" -- it was in fact made simpler than what was possible, and some important details (for example, about relational abstract interpretation and reduced products) were included in the spoken material but not the actual slides. Therefore, the learned reader is advised to imagine judiciously-placed asterisks and the accompanying errata, and the untutored pupil would be well-advised to recognize the incomplete and intuitive nature of the exposition and perhaps consult this program analysis reading list.

~~I would like to give the talk at some other conference at which the video can be reliably recorded, so that it may be published online.~~

Here are the slides.

Compiler 1, X86 Virtualizer 0

April 4, 2008 Rolf Rolles

Originally published on April 4th, 2008. This post won Honorable Mention for Most Innovative Research at the Pwnie Awards 2008.

There are two types of virtual machine software protections:  A) the ones that convert x86 machine code into virtual machine bytecode and execute it at runtime; B) the ones that execute some arbitrary code in a virtual environment.  I've discussed the latter several times in the past, and by now there exists a wealth of literature on that variety.  But breaking the former kind remains an unsolved problem.

In my article I said "basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it's repetitious, and the high-level details are obscured by the flood of low-level details".  The more I thought about this, the more I realized that the word "basically" is out of place:  virtualizing software protections are programming language interpreters, albeit for weird languages.

Consequently, an idea struck me:  what we want here is not an interpreter, but a compiler to compile the bytecode back into x86 machine code.  I spent a week coding one (~1000 lines) in OCaml to test this theory, and I'm able to report that, indeed, it works.  I chose ReWolf's x86 Virtualizer, a simple target that uses some of the same techniques as the heavy hitters in this area.  Here is a walkthrough of the analysis and recompilation of a small function with one basic block.  The compiler works equally well for arbitrarily-large functions, although that would make this posting unnecessarily long and complicated.

Step -2:  Protect something with the virtualizer.  In this case I just used ReWolf's sample executable itself.

.text:00401896 call ds:GetTickCount
.text:0040189C push eax
.text:0040189D call _srand
.text:004018A2 pop ecx
.text:004018A3 push 0
.text:004018A5 push offset DialogFunc
.text:004018AA push 0
.text:004018AC push 65h
.text:004018AE push [esp+10h+hInstance]
.text:004018B2 call ds:DialogBoxParamA
.text:004018B8 xor eax, eax
.text:004018BA retn 10h

Step -1: Analyze the virtual machine. Although this was not strictly necessary in this case because ReWolf provided source code, I decided to ignore it and reverse the VM manually, since you don't always have such niceties.

Step 0: Break the polymorphism in the instruction set. I made use of two remarkably ghetto hacks here, one of which may be considered elegant. To avoid provoking any arms races I'll omit the details.

Step 1: Disassemble the relevant region into VM bytecode. In the process, construct a graph in which each vertex is an instruction, and the edges are the flows between them.

.VM:004131D0 db 0C2h, 0C9h, 0C0h, 0BDh, 14h, 0DFh, 63h, 9Ah, 86h, 5Eh, 50h, 30h, 0Bh
.VM:004131D0 db 0Ah, 0C0h, 0C7h, 0CEh, 5Eh, 44h, 0E1h, 0E0h, 0C7h, 0FCh, 0FDh, 12h
.VM:004131D0 db 10h, 50h, 0D8h, 0D2h, 0DBh, 0A6h, 3Dh, 34h, 0C9h, 12h, 0DEh, 0E5h, 4Bh
.VM:004131D0 db 2Ch, 2Eh, 6Eh, 23h, 21h, 27h, 0E2h, 0E5h, 0ECh, 99h, 14h, 13h, 0C2h
.VM:004131D0 db 0E5h, 0F9h, 0FDh, 0F4h, 38h, 14h, 0F7h, 0F0h, 0F9h, 0ABh, 79h, 6, 0D7h
.VM:004131D0 db 0F0h, 8Bh, 88h, 81h, 41h, 87h, 8Ch, 85h, 0F8h, 51h, 9Ah, 26h, 0DFh
.VM:004131D0 db 0CFh, 1Eh, 15h, 75h, 76h, 74h, 6Bh, 98h, 9Dh, 94h, 6Eh, 0Ch, 6Bh, 90h
.VM:004131D0 db 93h, 9Ah, 0Fh

becomes

vertexlist =
[{label = 84; instruction = VMExit 16l};
{label = 81; instruction = LiteralInstruction [|51; 192|]};
{label = 69; instruction = ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l)};
{label = 65; instruction = PushDereferencedTemp};
{label = 57; instruction = AddImmediateToTemp 20l};
{label = 52; instruction = AddRegisterToTemp Esp};
{label = 44; instruction = SetTemp 0l};
{label = 41; instruction = LiteralInstruction [|106; 101|]};
{label = 38; instruction = LiteralInstruction [|106; 0|]};
{label = 27; instruction = ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l)};
{label = 24; instruction = LiteralInstruction [|106; 0|]};
{label = 22; instruction = LiteralInstruction [|89|]};
{label = 14; instruction = X86Call 6471l};
{label = 12; instruction = LiteralInstruction [|80|]};
{label = 0;instruction = ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l)}];
edgelist =
[({contents = {label = 0}},{contents = {label = 12}});
({contents = {label = 12}}, {contents = {label = 14}});
({contents = {label = 14}}, {contents = {label = 22}});
({contents = {label = 22}}, {contents = {label = 24}});
(* Lots and lots of edges removed *)]

Step 2: Form basic blocks within the instruction-level CFG. The previous output becomes:

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|]; 
X86Call 6471l; 
LiteralInstruction [|89|];
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
SetTemp 0l; 
AddRegisterToTemp Esp; 
AddImmediateToTemp 20l;
PushDereferencedTemp;
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|]; VMExit 16l|]}];

Step 3: Optimize the code within the basic block. The goal is to convert sequences of VM instructions into a new language more conducive to being compiled back into X86. The optimizer is the most powerful component of my compiler: it can remove obfuscation automatically simply as a side-effect of being an optimizer (not that ReWolf's has any, but others do), and employs no pattern matching.

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|]; 
X86Call 6471l;
LiteralInstruction [|89|];
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
SyntheticInstruction (Push, Plus (Constant 20l, Register Esp));
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|]; 
VMExit 16l|]}];

Step 4: Recompile all virtual instructions into x86 machine language.

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|];
RelativeFixupInstruction ([|232; 0; 0; 0; 0|], 6471l, 1l);
LiteralInstruction [|89|]; 
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
LiteralInstruction [|255; 116; 36; 20|];
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|];
LiteralInstruction [|194; 16; 0|]|]}];

Step 5: Stuff the original bytes back into the binary and perform fixups specified. If you can convert between hex and decimal in your head, you'll notice that the bytes above correspond to those below, modulo fixups. For multi-basic-block functions, this is harder, as you have to sequence the blocks and decide between short and long jumps.

.VM:004131D0 FF 15 28 A0 40 00 call ds:GetTickCount
.VM:004131D6 50                push eax
.VM:004131D7 E8 6B E7 FE FF    call loc_401947
.VM:004131DC 59                pop  ecx
.VM:004131DD 6A 00             push 0
.VM:004131DF 68 F0 16 40 00    push offset loc_4016F0
.VM:004131E4 6A 00             push 0
.VM:004131E6 6A 65             push 65h
.VM:004131E8 FF 74 24 14       push dword ptr [esp+14h]
.VM:004131EC FF 15 48 A1 40 00 call ds:DialogBoxParamA
.VM:004131F2 33 C0             xor eax, eax
.VM:004131F4 C2 10 00          retn 10h

Step 6: Celebrate. ReWolf's X86 Virtualizer was simple, and surely breaking the harder ones is, well, harder, but I believe that the general principles espoused here should be applicable to the others.

Here is the source code.