Compiler Optimizations for Reverse Engineers

March 8, 2010 Rolf Rolles

Originally written Q1 2007 as part of my Static Reverse Engineering Training Class, first published March 8th, 2010 on OpenRCE.

I've decided to release the part of my training material relating to compiler optimizations. I created this back in January of 2007; for a while I was teaching classes very often, and so it made sense to keep it private. Nowadays I only teach a few classes a year, and so this presentation is sort of languishing away on my hard drive, which is a shame since it's my favorite part. I think people will enjoy reading it, so I have decided to make it public.

Since I made this in early 2007, it lacks some of GCC 4's optimizations, and from time to time I realize that I forgot an optimization or two. But for the most part, it's fairly complete. If you have any specific suggestions for optimizations that I missed, please either email me or respond via comment with a link to a binary exhibiting said optimization and the address at which I can find it. Proper compiler-theoretic names for the optimizations, if applicable, are also appreciated.

Here it is. Enjoy.

C-subset Compiler in OCaml

December 25, 2009 Rolf Rolles

Originally published on December 25th, 2009 on OpenRCE

Here is the source code for a compiler that I wrote in Objective Caml this semester, for a subset of the C language.  It requires a standalone C->IR translator which is not included in this release, as the school owns the copyright on that particular piece of code.  Hence one cannot immediately use this compiler to compile C programs without writing a C front end; anyway, an existing compiler such as MSVC or GCC would be a better choice.

The portion of the code that I wrote (everything except bitv.ml and .mli) totals roughly 3200 lines of code.  It includes two optimizations based on classical data-flow analysis, constant propagation and dead-statement elimination.  It also supports translation into and out of static single assignment form, as well as two optimizations based on SSA:  constant propagation and loop-invariant code motion.  The code for the graph-coloring register allocator is not included.  As a back-end, the compiler produces C code that can be compiled by any other C compiler.

The code is a pretty good example of how to structure medium-sized programs in functional programming languages.  I tried to adopt a pure functional style throughout most of the codebase.  Sometimes this failed:  cfg_ir.ml is unnecessarily ugly, and should have been re-written in an imperative style with mutability; also, I made the mistake of using a mutable array to describe Phi values in static single assignment whereas pure functional style would have dictated a list.  But those are my only complaints with the code; overall, I'm pretty pleased with how it all turned out.

This code is substantially more sophisticated than the compiler that I wrote to break VMProtect, so if you can read and understand this release, you should be in good shape for breaking virtualization obfuscators.

Unpacking Virtualization Obfuscators

August 10, 2009 Rolf Rolles

I published a paper at WOOT '09 formalizing some of my research into unpacking virtual machine software protections.

Switch as Binary Search, Part 1

November 28, 2008 Rolf Rolles

Originally published on November 28th, 2008 on OpenRCE

(Continued from part 0)

We have seen how a compiler-generated binary search partitions the range of integers into successively smaller intervals and ultimately converging upon either a single value or one contiguous range of values. We will now develop a tool to deal with this construct in a friendly fashion for the working reverse engineer.

The tool has the following high-level specification:

Given the first address of a compiled binary switch construct:

Determine the register used and whether the comparisons are signed (trivial)
Determine all terminal labels, at which cases reside, and all values that lead to these cases.

A discussion of the problem and its solution follows.

Upon entering the switch, the register value is entirely unconstrained: the range of values that it might have are contained in the interval [0, MAX_INT] (or [-(MAX_INT+1)/2, MAX_INT/2] for signed integers). The following example uses an unsigned integer.

Suppose that the first comparison is of the form

@loc0:
cmp eax, val1
jbe @loc1
@loc2:

If the jump is taken, the value must be below or equal to 'val1'. In other words, it lies in the range [0, val1]. If the jump is not taken, the value must be above 'val1', so it lies in the range [val1+1, MAX_INT]. Therefore, each comparison in the tree partitions the range of possible values into two disjoint ranges: one for those values where the jump is taken, and one for those where it is not.

At the second level of the tree, there will be two comparisons: one for the case where the jump for the first was not taken, and one for the case where it was. Consider the second set of such instructions:

@loc1:
cmp eax, val2
jbe @loc3
@loc4:

@loc2:
cmp eax, val3
jbe @loc5
@loc6:

Each of these cases further constrains the input:

The range of values leading to loc3 is [0, val2]
The range of values leading to loc4 is [val2+1, val1]
The range of values leading to loc5 is [val1+1, val3]
The range of values leading to loc6 is [val3+1, MAX_INT]

This table has a very regular structure, and it should not be too hard to imagine what it would look like for three levels into the tree, or four, or... The following image summarizes the process. The unique path to a given vertex specifies the constraints required for input to reach it.

As we walk the comparisons in the tree in this fashion, the binary tree will eventually stop and give way to the code for the cases in the switch. At this point, our constraints will be terminal cases of either single values (most commonly) or of simple ranges. For each terminal address, we maintain a dictionary associating it with the corresponding ranges that lead there.

The following pseudocode codifies the discussion above.

// The first range returned is that for which the jump is taken;
// the second is for non-jump-taking values
partition(ea, low, high, compared)
{
  if(comparison at ea is ">")
    return [compared+1, high], [low, compared]

  if(comparison at ea is ">=")
    return [compared, high], [low, compared+1]

  if(comparison at ea is "<")
    return [low, compared-1], [compared, high]

  if(comparison at ea is "<=")
    return [low, compared], [compared+1, high]

  if(comparison at ea is "==")
    return [compared, compared], [low, high]

  if(comparison at ea is "!=")
    return [low, high], [compared, compared]
}

The following recursive algorithm solves the general problem.

analyze_bswitch(ea_t ea, int low, int high, int compared)
{
  if(ea is cmp reg, constant)
    compared = instruction's immediate value 
    ea = address of next instruction

  if(ea is a conditional jump)
    [low1,high2], [low2,high2] = partition(ea, low, high, compared)
    analyze_bswitch(jump taken ea, low1, high1, compared)
    analyze_bswitch(jump not taken ea, low2, high2, compared)

  // Instruction is a leaf in the binary tree
  else
    associate(ea, low, high)
}

IDAPython-1.0 compatible source code is available here.

The resulting disassembly is properly annotated with case labels, as IDA does normally:

AUTO:0046348E   jbe     loc_463D16      ; case 0A0h
AUTO:00463494   cmp     eax, 0A2h
AUTO:00463499   jb      loc_462FA0      ; case 0837F81BAh, 0837F81D8h, 0A1h, 091h
AUTO:0046349F   ja      loc_463D21      ; case 0837F90BAh, 0A3h
AUTO:004634A5 loc_4634A5:               ; CODE XREF: sub_462120:loc_462EDAj
AUTO:004634A5                           ; sub_462120+DDDj ...
AUTO:004634A5   push    ebx             ; case 0837F8106h, 0837F8124h, 0A2h, 031h
AUTO:004634A6   call    sub_45C630
AUTO:004634AB   jmp     loc_462217

Switch as Binary Search, Part 0

November 28, 2008 Rolf Rolles

Originally posted on OpenRCE on November 28th, 2008

For a standard switch with a contiguous region of cases, as in the below, compilers generate constant-time lookups by treating the cases as indexes into an array of pointers.

switch(x)
{
  case 10: /* case 10 code */ break;
  // No case 11 (default)
  case 12: /* case 12 code */ break;
  case 13: /* case 13 code */ break;
  case 14: /* case 14 code */ break;
  // No case 15 (default)
  case 16: /* case 16 code */ break;
  default: /* default code */ break;
}

This might compile into the following:

sub eax, 10 
cmp eax, 6 
ja @default 
jmp dword ptr switch_table[eax*4] 

switch_table dd offset loc_case0
             dd offset default
             dd offset loc_case2
             dd offset loc_case3
             dd offset loc_case4
             dd offset default
             dd offset loc_case6
             ...

For switches with multiple labels pointing to the same code, some compilers save space by generating doubly-indexed, tabular code.

switch(x)
{
  case 0x00..0x20: /* 0x00 <= x < 0x20 code */ break;
  case 0x21..0x40: /* 0x21 <= x < 0x40 code */ break;
}

This might compile to:

cmp eax, 40h 
ja @over 
movzx eax, byte ptr index_table[eax*4] 
jmp dword ptr switch_table[eax*4] ; 
72 bytes instead of 256 
index_table  db 32 DUP(0)
             db 32 DUP(1) 
switch_table dd offset loc_case_00_00
             dd offset loc_case_20_40

For switches with non-contiguous, sparsely-distributed cases like the below, the table-based approaches from above are unsuitable, and yet cases like this do appear in practice, so the compiler must have a strategy for dealing with them.

switch(x)
{
  case 0x0:
  case 0x16:
  case 0x326:
  case 0x9821:
  case 0x90826:
  case 0x278946:
  case 0x4578939:
  case 0x12372826:
  default:
}

One obvious solution to this problem would be to generate a sequence of comparisons, one for each case, starting from the lowest case and ending with the highest. This would work, but the lookup would be O(N) in the number of case labels. An equally obvious solution, and one that was mentioned in the book "Reversing", is to generate a binary search construct in the generated code to perform lookups in O(log(N)) time.

To briefly review binary searching, consider the (necessarily) sorted sequence [1;2;3;4;5;6;7;8;9], and consider finding the value 2.  We begin by comparing against the middle element, 5. Since 2 is lesser, we discard all values 5 and above and consider those below it.  Now we have a smaller range of values upon which we can perform the same procedure.

[1;2;3;4]

As before, we compare against the middle element, 3.  Since it is lesser, we can discard anything above 3, leaving [1;2] as the remaining search space.

[1;2]

The middle element is 2, which is our search key, so we stop.  This process took three steps (log(9)) to complete, compared with N=9 steps for searching the whole array.

The compiler is responsible for generating code that implements these comparisons.  The first comparison will be against the middle element in the collection; whether the case value is above or below determines whether to repeat the process for the integers above or below the middle one.  At each comparison there is a corresponding equality test to see whether the search key is the same as the comparator.

Below is an example of this construct in the wild.

AUTO:00462141  cmp  eax, 0E3h 
AUTO:00462146  jnb  loc_46220F ; see below 
AUTO:0046214C  cmp  eax, 73h 
AUTO:0046214F  jnb  loc_463131 
AUTO:00462155  cmp  eax, 3Bh 
AUTO:00462158  jnb  loc_463606 
AUTO:0046215E  cmp  eax, 1Fh 
AUTO:00462161  jnb  loc_463835 
AUTO:00462167  cmp  eax, 10h 
AUTO:0046216A  jnb  loc_463948 
AUTO:00462170  cmp  eax, 8 
AUTO:00462173  jnb  loc_4639D1 
AUTO:00462179  cmp  eax, 4 
AUTO:0046217C  jnb  loc_463A1A 
AUTO:00462182  cmp  eax, 2 
AUTO:00462185  jnb  loc_463A2E  
AUTO:0046220F  ja   loc_462257 
AUTO:00462211  push ebx             ; case 0E3h 
AUTO:00462212  call sub_460920

Hex-Rays does a nice job of illustrating the time-consuming ugly confusingness of dealing with code like this:

if ( v7 <= 0x837FDF02 )
  goto LABEL_755;
if ( v7 >= 0x837FE22C )
{
  if ( v7 <= 0x837FE22C )
    return sub_45EA80(v4);
  if ( v7 < 0x837FE358 )
  {
    if ( 2088770818 != v5 )
      goto LABEL_10;
    goto LABEL_97;
  }
  if ( v7 <= 0x837FE358 )
    goto LABEL_750;
  if ( 2088770608 != v5 )
    goto LABEL_10;
  goto LABEL_100;
}
if ( v7 < 0x837FE0E2 )
  goto LABEL_10;
if ( v7 <= 0x837FE0E2 )
  goto LABEL_535;
if ( 2088771118 == v5 )
{
  sub_460880(v4);
  goto LABEL_23;
}

This entry is continued in part 1.

Bagle.W Thorough IDB

November 5, 2008 Rolf Rolles

Originally published on November 5th, 2008 on OpenRCE

Here is an IDB I did of the Bagle.W malware. It is more thorough than usual; perhaps one of the most thorough reverse engineering projects I have ever done.

VMProtect, Part 3: Optimization and Code Generation

August 6, 2008 Rolf Rolles

Originally published on August 6th, 2008 on OpenRCE

This is part #3 of a four-part series on VMProtect. The other parts can be found here:

The reader should inspect this unoptimized IR listing before continuing. In an attempt to keep this entry from becoming unnecessarily long, the example snippets will be small, but for completeness a more thorough running example is linked throughout the text.

We begin by removing the stack machine features of the IR. Since VMProtect operates on disassembled x86 code, and x86 itself is not a stack machine, this aspect of the protection is unnatural and easily removed. Here is a 15-line fragment of VMProtect IR.

push Dword(-88)
push esp
push Dword(4)
pop t3
pop t4
t5 = t3 + t4
push t5
push flags t5
pop DWORD Scratch:[Dword(52)]
pop t6
pop t7
t8 = t6 + t7
push t8
push flags t8
pop DWORD Scratch:[Dword(12)]
pop esp

All but two instructions are pushes or pops, and the pushes can be easily matched up with the pops. Tracking the stack pointer, we see that, for example, t3 = Dword(4). A simple analysis allows us to "optimize away" the push/pop pairs into assignment statements. Simply iterate through each instruction in a basic block and keep a stack describing the source of each push. For every pop, ensure that the sizes match and record the location of the corresponding push. We wish to replace the pop with an assignment to the popped expression from the pushed expression, as in

t3 = Dword(4)
t4 = esp 
t7 = Dword(-88)

With the stack aspects removed, we are left with a more conventional listing containing many assignment statements. This optimization substantially reduces the number of instructions in a given basic block (~40% for the linked example) and opens the door for other optimizations. The newly optimized code is eight lines, roughly half of the original:

t3 = Dword(4)
t4 = esp
t5 = t3 + t4
DWORD Scratch:[Dword(52)] = flags t5
t6 = t5
t7 = Dword(-88)
t8 = t6 + t7
DWORD Scratch:[Dword(12)] = flags t8
esp = t8

A complete listing of the unoptimized IR versus the one with the stack machine features removed is here, which should be perused before proceeding.

Now we turn our attention to the temporary variables and the scratch area. Recall that the former were not part of the pre-protected x86 code, nor the VMProtect bytecode -- they were introduced in order to ease the IR translation. The latter is part of the VMProtect bytecode, but was not part of the original pre-protected x86 code. Since these are not part of the languages we are modelling, we shall eliminate them wholesale. On a high level, we treat each temporary variable, each byte of the scratch space, and each register as being a variable defined within a basic block, and then eliminate the former two via the compiler optimizations previously discussed.

Looking again at the last snippet of IR, we can see several areas for improvement. First, consider the variable t6. It is clearly just a copy of t5, neither of which are redefined before the next use in the assignment to t8. Copy propagation will replace variable t6 with t5 and eliminate the former. More generally, t3, t4, and t7 contain either constants or values that are not modified between their uses. Constant and copy propagation will substitute the assignments to these variables in for their uses and eliminate them.

The newly optimized code is a slender three lines compared to the original 15; we have removed 80% of the IR for the running example.

DWORD Scratch:[Dword(52)] = flags Dword(4) + esp
esp = Dword(4) + esp + Dword(-88)
DWORD Scratch:[Dword(12)] = flags Dword(4) + esp + Dword(-88)

The side-by-side comparison can be found here.

The IR now looks closer to x86, with the exception that the results of computations are being stored in the scratch area, not into registers. As before, we apply dead-store elimination, copy and constant propagation to the scratch area, removing dependence upon it entirely in the process. See here for a comparison with the last phase.

Here is a comparison of the final, optimized code against the original x86:

push ebp                          push ebp
ebp = esp                         mov ebp, esp
push Dword(-1)                    push 0FFFFFFFFh
push Dword(4525664)               push 450E60h
push Dword(4362952)               push offset sub_4292C8
eax = DWORD FS:[Dword(0)]         mov eax, large fs:0
push eax                          push eax
DWORD FS:[Dword(0)] = esp         mov large fs:0, esp
eflags = flags esp + Dword(-88)
esp = esp + Dword(-88)            add esp, 0FFFFFFA8h
push ebx                          push ebx
push esi                          push esi
push edi                          push edi
DWORD SS:[Dword(-24) + ebp] = esp mov [ebp-18h], esp
call DWORD [Dword(4590300)]       call dword ptr ds:unk_460ADC
vmreturn Dword(0) + Dword(4638392)

Code generation is an afterthought.

VMProtect, Part 2: Primer on Optimization

August 6, 2008 Rolf Rolles

Originally published on OpenRCE on August 6th, 2008

This is part #2 of a four-part series on VMProtect. The other parts can be found here:

Basically, VMProtect bytecode and the IR differ from x86 assembly language in four ways:

It's a stack machine;
The IR contains "temporary variables";
It contains what I've called a "scratch" area, upon which computations are performed rather than in the registers;
In the case of VMProtect Ultra, it's obfuscated (or rather, "de-optimized") in certain ways.

It turns out that removing these four aspects from the IR is sufficient preparation for compilation into sensible x86 code. We accomplish this via standard compiler optimizations applied locally to each basic block. In general, there are a few main compiler optimizations used in this process. The first one is "constant propagation". Consider the following C code.

int x = 1; 
function(x);

Clearly x will always be 1 when function is invoked: it is defined on the first line, and is not re-defined before it is used in the following line (the definition in line one "reaches" the use in line two; alternatively, the path between the two is "definition-clear" with respect to x). Thus, the code can be safely transformed into "function(1)". If the first line is the only definition of the variable x, then we can replace all uses of x with the integer 1. If the second line is the only use of the variable x, then we can eliminate the variable.

The next is "constant folding". Consider the following C code.

int x = 1024; 
function(x*1024);

By the above remarks, we know we can transform the second line into "function(1024*1024);". It would be silly to generate code that actually performed this multiplication at run-time: the value is known at compile-time, and should be computed then. We can replace the second line with "function(1048576);", and in general we can replace any binary operation performed upon constant values with the computed result of that operation.

Similar to constant propagation is "copy propagation", as in the below.

void f(int i)
{
  int j = i;
  g(j);
}

The variable j is merely a copy of the variable i, and so the variable i can be substituted in for j until the point where either is redefined. Lacking redefinitions, j can be eliminated entirely.

The next optimization is "dead-store elimination". Consider the following C code.

int y = 2; 
y = 1;

The definition of the variable y on line one is immediately un-done by the one on line two. Therefore, there is no reason to actually assign the value 2 to the variable; the first store to y is a "dead store", and can be eliminated by the standard liveness-based compiler optimization known as "dead-store elimination", or more generally "dead-code elimination".

Here's an example from the VMProtect IR.

ecx = DWORD Scratch:[Dword(44)]
ecx = DWORD Scratch:[Dword(20)]

After dead-store-eliminating the first instruction, it turns out no other instructions use Scratch:[Dword(44)], and so its previous definition can be eliminated as well.

VMProtect, Part 1: Bytecode and IR

August 6, 2008 Rolf Rolles

Originally posted on August 6th, 2008 on OpenRCE

This is part #1 of a four-part series on VMProtect. The other parts can be found here:

The approach I took with ReWolf's x86 Virtualizer is also applicable here, although a more sophisticated compiler is required. What follows is some preliminary notes on the design and implementation of such a component. These are not complete details on breaking the protection; I confess to having only looked at a few samples, and I am not sure which protection options were enabled.

As before, we begin by constructing a disassembler for the interpreter. This is immediately problematic, since the bytecode language is polymorphic. I have created an IDA plugin that automatically constructs OCaml source code for a bytecode disassembler. In a production-quality implementation, this should be implemented as a standalone component that returns a closure.

The generated disassembler, then, looks like this:

let disassemble bytearray index =
match (bytearray.(index) land 0xff) with
| 0x0-> (VM__Handler0__PopIntoRegister(0),[index+1])
| 0x1-> (VM__Handler1__PushDwordFromRegister(0),[index+1])
| 0x2-> (VM__Handler2__AddWords,[index+1])
| 0x3-> (VM__Handler3__StoreByteIntoRegister(bytearray.(index+1)),[index+2])
| 0x4-> (VM__Handler0__PopIntoRegister(4),[index+1])
| 0x5-> (VM__Handler1__PushDwordFromRegister(4),[index+1])
| 0x6-> (VM__Handler4__ShrDword,[index+1])
| 0x7-> (VM__Handler5__ReadDword__FromStackSegment,[index+1])
| ...-> ...

Were we to work with the instructions individually in their natural granularity, depicted above, the bookkeeping on the semantics of each would likely prove tedious. For illustration, compare and contrast handlers #02 and #04. Both have the same basic pattern: pop two values (words vs. dwords), perform a binary operation (add vs. shr), push the result, then push the flags. The current representation of instructions does not express these, or any, similarities.

Handler #02:           Handler #04:
mov ax, [ebp+0]        mov eax, [ebp+0]
sub ebp, 2             mov cl, [ebp+4]
add [ebp+4], ax        sub ebp, 2
pushf                  shr eax, cl
pop dword ptr [ebp+0]  mov [ebp+4], eax
pushf
pop dword ptr [ebp+0]

Therefore, we pull a standard compiler-writer's trick and translate the VMProtect instructions into a simpler, "intermediate" language (hereinafter "IR") which resembles the pseudocode snippets atop the handlers in part zero. Below is a fragment of that language's abstract syntax.

type size = B | W | D | Q
type temp = int * size
type seg= Scratch | SS | FS | Regular
type irbinop= Add | And | Shl | Shr | MakeQword
type irunop= Neg | MakeByte | TakeHighDword | Flags
type irexpr = 
| Reg of register 
| Temp of int 
| Const of const 
| Deref of seg * irexpr * size 
| Binop of irexpr * irbinop * irexpr 
| Unop of irexpr * irunop

type ir = 
DeclareTemps of temp list
| Assign of irexpr * irexpr
| Push of irexpr
| Pop of irexpr
| Return

A portion of the VMProtect -> IR translator follows; compare the translation for handlers #02 and #04.

let make_microcode = function
VM__Handler0__PopIntoRegister(b) -> [Pop(Deref(Scratch, Const(Dword(zero_extend_byte_dword(b land 0x3C))), D))]
| VM__Handler2__AddWords -> [DeclareTemps([(0, W);(1, W);(2, W)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Temp(2), Binop(Temp(0), Add, Temp(1)));
 Push(Temp(2));
 Push(Unop(Temp(2), Flags))]
| VM__Handler4__ShrDword -> [DeclareTemps([(0, D);(1, W);(2, D)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Temp(2), Binop(Temp(0), Shr, Temp(1)));
 Push(Temp(2));
 Push(Unop(Temp(2), Flags))]
| VM__Handler7__PushESP-> [Push(Reg(Esp))]
| VM__Handler23__WriteDwordIntoFSSegment -> [DeclareTemps([(0, D);(1, D)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Deref(FS, Temp(0), D), Temp(1))]
| (*...*) -> (*...*)

To summarize the process, below is a listing of VMProtect instructions, followed by the assembly code that is executed for each, and to the right is the IR translation.

VM__Handler1__PushDwordFromRegister 32

; Push (Deref (Scratch, Const (Dword 32l), D));

and al, 3Ch ; al = 32
mov edx, [edi+eax]
sub ebp, 4
mov [ebp+0], edx

VM__Handler7__PushESP
; Push (Reg Esp);
mov eax, ebp
sub ebp, 4
mov [ebp+0], eax

VM__Handler0__PopIntoRegister 40
; Pop (Deref (Scratch, Const (Dword 40l), D));
and al, 3Ch
mov edx, [ebp+0]
add ebp, 4
mov [edi+eax], edx

VM__Handler19__PushSignedByteAsDword (-1l)
; Push (Const (Dword (-1l))); 
movzx eax, byte ptr [esi] ; *esi = -1
sub esi, 0FFFFFFFFh
cbw
cwde
sub ebp, 4
mov [ebp+0], eax

VM__Handler9__PushDword 4525664l
; Push (Const (Dword 4525664l));
mov eax, [esi] ; *esi = 4525664l
add esi, 4
sub ebp, 4
mov [ebp+0], eax

VM__Handler9__PushDword 4362952l};
; Push (Const (Dword 4362952l));
mov eax, [esi] ; *esi = 4362952l
add esi, 4
sub ebp, 4
mov [ebp+0], eax

VM__Handler19__PushSignedByteAsDword 0l};
; Push (Const (Dword (0l))); 
movzx eax, byte ptr [esi] ; *esi = 0
sub esi, 0FFFFFFFFh
cbw
cwde
sub ebp, 4
mov [ebp+0], eax

VM__Handler42__ReadDwordFromFSSegment};
;Pop (Temp 0);
;Push (Deref (FS, Temp 0, D));
mov eax, [ebp+0]DeclareTemps([(0,D)])
mov eax, fs:[eax]
mov [ebp+0], eax

VMProtect, Part 0: Basics

August 6, 2008 Rolf Rolles

Originally published on August 6th, 2008 on OpenRCE

This is part #0 of a four-part series on VMProtect. The other parts can be found here:

VMProtect is a virtualization protector. Like other protections in the genre, among others ReWolf's x86 Virtualizer and CodeVirtualizer, it works by disassembling the x86 bytecode of the target executable and compiling it into a proprietary, polymorphic bytecode which is executed in a custom interpreter at run-time. This is unlike the traditional notions of packing, in which the x86 bytecode is simply encrypted and/or compressed: with virtualization, the original x86 bytecode in the protected areas is gone, never to be seen again. Or so the idea goes.

If you've never looked at VMProtect before, I encourage you to take a five-minute look in IDA (here's a sample packed binary). As far as VMs go, it is particularly skeletal and easily comprehended. The difficulty lies in recreating working x86 bytecode from the VM bytecode. Here's a two-minute analysis of its dispatcher.

push edi; push all registers
push ecx
push edx
push esi
push ebp
push ebx
push eax
push edx
pushf
push 0 ; imagebase fixup
mov esi, [esp+8+arg_0] ; esi = pointer to VM bytecode
mov ebp, esp ; ebp = VM's "stack" pointer
sub esp, 0C0h
mov edi, esp ; edi = "scratch" data area

VM__FOLLOW__Update:
add esi, [ebp+0]

VM__FOLLOW__Regular:
mov al, [esi]; read a byte from EIP
movzx eax, al
sub esi, -1; increment EIP
jmp ds:VM__HandlerTable[eax*4] ; execute instruction handler

A feature worth discussing is the "scratch space", referenced by the register edi throughout the dispatch loop. This is a 16-dword-sized area on the stack where VMProtect saves the registers upon entering the VM, modifies them throughout the course of a basic block, and from whence it restores the registers upon exit. For each basic block protected by the VM, the layout of the registers in the scratch space can potentially be different.

Here's a disassembly of some instruction handlers. Notice that A) VMProtect is a stack machine and that B) each handler -- though consisting of scant few instructions -- performs several tasks, e.g. popping several values, performing multiple operations, pushing one or more values.

#00:x = [EIP-1] & 0x3C; y = popd; [edi+x] = y

.text:00427251 and al, 3Ch; al = instruction number
.text:00427254 mov edx, [ebp+0] ; grab a dword off the stack
.text:00427257 add ebp, 4 ; pop the stack
.text:0042725A mov [edi+eax], edx ; store the dword in the scratch space

#01:x = [EIP-1] & 0x3C; y = [edi+x]; pushd y

.vmp0:0046B0EB and al, 3Ch; al = instruction number
.vmp0:0046B0EE mov edx, [edi+eax] ; grab a dword out of the scratch space
.vmp0:0046B0F1 sub ebp, 4 ; subtract 4 from the stack pointer
.vmp0:0046B0F4 mov [ebp+0], edx ; push the dword onto the stack

#02:x = popw, y = popw, z = x + y, pushw z, pushf

.text:004271FB mov ax, [ebp+0] ; pop a word off the stack
.text:004271FF sub ebp, 2
.text:00427202 add [ebp+4], ax ; add it to another word on the stack
.text:00427206 pushf
.text:00427207 pop dword ptr [ebp+0] ; push the flags

#03:x = [EIP++]; w = popw; [edi+x] = Byte(w)

.vmp0:0046B02A movzx eax, byte ptr [esi] ; read a byte from EIP
.vmp0:0046B02D mov dx, [ebp+0] ; pop a word off the stack
.vmp0:0046B031 inc esi ; EIP++
.vmp0:0046B032 add ebp, 2; adjust stack pointer
.vmp0:0046B035 mov [edi+eax], dl ; write a byte into the scratch area

#04:x = popd, y = popw, z = x << y, pushd z, pushf

.vmp0:0046B095 mov eax, [ebp+0]; pop a dword off the stack
.vmp0:0046B098 mov cl, [ebp+4] ; pop a word off the stack
.vmp0:0046B09B sub ebp, 2
.vmp0:0046B09E shr eax, cl ; shr the dword by the word
.vmp0:0046B0A0 mov [ebp+4], eax; push the result
.vmp0:0046B0A3 pushf
.vmp0:0046B0A4 pop dword ptr [ebp+0] ; push the flags

#05:x = popd, pushd ss:[x]

.vmp0:0046B5F7 mov eax, [ebp+0]; pop a dword off the stack
.vmp0:0046B5FA mov eax, ss:[eax] ; read a dword from ss
.vmp0:0046B5FD mov [ebp+0], eax; push that dword