Switch as Binary Search, Part 1

Originally published on November 28th, 2008 on OpenRCE

(Continued from part 0)

We have seen how a compiler-generated binary search partitions the range of integers into successively smaller intervals and ultimately converging upon either a single value or one contiguous range of values.  We will now develop a tool to deal with this construct in a friendly fashion for the working reverse engineer.  

The tool has the following high-level specification:

Given the first address of a compiled binary switch construct:

  • Determine the register used and whether the comparisons are signed (trivial)
  • Determine all terminal labels, at which cases reside, and all values that lead to these cases.

A discussion of the problem and its solution follows.

Upon entering the switch, the register value is entirely unconstrained:  the range of values that it might have are contained in the interval [0, MAX_INT] (or [-(MAX_INT+1)/2, MAX_INT/2] for signed integers).  The following example uses an unsigned integer.

Suppose that the first comparison is of the form

@loc0:
cmp eax, val1
jbe @loc1
@loc2:

If the jump is taken, the value must be below or equal to 'val1'.  In other words, it lies in the range [0, val1].  If the jump is not taken, the value must be above 'val1', so it lies in the range [val1+1, MAX_INT].  Therefore, each comparison in the tree partitions the range of possible values into two disjoint ranges:  one for those values where the jump is taken, and one for those where it is not.

At the second level of the tree, there will be two comparisons:  one for the case where the jump for the first was not taken, and one for the case where it was.  Consider the second set of such instructions:

@loc1:
cmp eax, val2
jbe @loc3
@loc4:

@loc2:
cmp eax, val3
jbe @loc5
@loc6:

Each of these cases further constrains the input:

  • The range of values leading to loc3 is [0,         val2]
  • The range of values leading to loc4 is [val2+1,    val1]
  • The range of values leading to loc5 is [val1+1,    val3]
  • The range of values leading to loc6 is [val3+1, MAX_INT]

This table has a very regular structure, and it should not be too hard to imagine what it would look like for three levels into the tree, or four, or...  The following image summarizes the process.  The unique path to a given vertex specifies the constraints required for input to reach it.

As we walk the comparisons in the tree in this fashion, the binary tree will eventually stop and give way to the code for the cases in the switch.  At this point, our constraints will be terminal cases of either single values (most commonly) or of simple ranges.  For each terminal address, we maintain a dictionary associating it with the corresponding ranges that lead there.

The following pseudocode codifies the discussion above.

// The first range returned is that for which the jump is taken;
// the second is for non-jump-taking values
partition(ea, low, high, compared)
{
  if(comparison at ea is ">")
    return [compared+1, high], [low, compared]

  if(comparison at ea is ">=")
    return [compared, high], [low, compared+1]

  if(comparison at ea is "<")
    return [low, compared-1], [compared, high]

  if(comparison at ea is "<=")
    return [low, compared], [compared+1, high]

  if(comparison at ea is "==")
    return [compared, compared], [low, high]

  if(comparison at ea is "!=")
    return [low, high], [compared, compared]
}

The following recursive algorithm solves the general problem.

analyze_bswitch(ea_t ea, int low, int high, int compared)
{
  if(ea is cmp reg, constant)
    compared = instruction's immediate value 
    ea = address of next instruction

  if(ea is a conditional jump)
    [low1,high2], [low2,high2] = partition(ea, low, high, compared)
    analyze_bswitch(jump taken ea, low1, high1, compared)
    analyze_bswitch(jump not taken ea, low2, high2, compared)

  // Instruction is a leaf in the binary tree
  else
    associate(ea, low, high)
}

IDAPython-1.0 compatible source code is available here.

The resulting disassembly is properly annotated with case labels, as IDA does normally:

AUTO:0046348E   jbe     loc_463D16      ; case 0A0h
AUTO:00463494   cmp     eax, 0A2h
AUTO:00463499   jb      loc_462FA0      ; case 0837F81BAh, 0837F81D8h, 0A1h, 091h
AUTO:0046349F   ja      loc_463D21      ; case 0837F90BAh, 0A3h
AUTO:004634A5 loc_4634A5:               ; CODE XREF: sub_462120:loc_462EDAj
AUTO:004634A5                           ; sub_462120+DDDj ...
AUTO:004634A5   push    ebx             ; case 0837F8106h, 0837F8124h, 0A2h, 031h
AUTO:004634A6   call    sub_45C630
AUTO:004634AB   jmp     loc_462217

Switch as Binary Search, Part 0

Originally posted on OpenRCE on November 28th, 2008

For a standard switch with a contiguous region of cases, as in the below, compilers generate constant-time lookups by treating the cases as indexes into an array of pointers.

switch(x)
{
  case 10: /* case 10 code */ break;
  // No case 11 (default)
  case 12: /* case 12 code */ break;
  case 13: /* case 13 code */ break;
  case 14: /* case 14 code */ break;
  // No case 15 (default)
  case 16: /* case 16 code */ break;
  default: /* default code */ break;
}

This might compile into the following:

sub eax, 10 
cmp eax, 6 
ja @default 
jmp dword ptr switch_table[eax*4] 

switch_table dd offset loc_case0
             dd offset default
             dd offset loc_case2
             dd offset loc_case3
             dd offset loc_case4
             dd offset default
             dd offset loc_case6
             ... 

For switches with multiple labels pointing to the same code, some compilers save space by generating doubly-indexed, tabular code.

switch(x)
{
  case 0x00..0x20: /* 0x00 <= x < 0x20 code */ break;
  case 0x21..0x40: /* 0x21 <= x < 0x40 code */ break;
}

This might compile to:

cmp eax, 40h 
ja @over 
movzx eax, byte ptr index_table[eax*4] 
jmp dword ptr switch_table[eax*4] ; 
72 bytes instead of 256 
index_table  db 32 DUP(0)
             db 32 DUP(1) 
switch_table dd offset loc_case_00_00
             dd offset loc_case_20_40

For switches with non-contiguous, sparsely-distributed cases like the below, the table-based approaches from above are unsuitable, and yet cases like this do appear in practice, so the compiler must have a strategy for dealing with them.

switch(x)
{
  case 0x0:
  case 0x16:
  case 0x326:
  case 0x9821:
  case 0x90826:
  case 0x278946:
  case 0x4578939:
  case 0x12372826:
  default:
}

One obvious solution to this problem would be to generate a sequence of comparisons, one for each case, starting from the lowest case and ending with the highest.  This would work, but the lookup would be O(N) in the number of case labels.  An equally obvious solution, and one that was mentioned in the book "Reversing", is to generate a binary search construct in the generated code to perform lookups in O(log(N)) time.


To briefly review binary searching, consider the (necessarily) sorted sequence [1;2;3;4;5;6;7;8;9], and consider finding the value 2.  We begin by comparing against the middle element, 5. Since 2 is lesser, we discard all values 5 and above and consider those below it.  Now we have a smaller range of values upon which we can perform the same procedure.

[1;2;3;4]

As before, we compare against the middle element, 3.  Since it is lesser, we can discard anything above 3, leaving [1;2] as the remaining search space.

[1;2]

The middle element is 2, which is our search key, so we stop.  This process took three steps (log(9)) to complete, compared with N=9 steps for searching the whole array.

The compiler is responsible for generating code that implements these comparisons.  The first comparison will be against the middle element in the collection; whether the case value is above or below determines whether to repeat the process for the integers above or below the middle one.  At each comparison there is a corresponding equality test to see whether the search key is the same as the comparator.

Below is an example of this construct in the wild.

AUTO:00462141  cmp  eax, 0E3h 
AUTO:00462146  jnb  loc_46220F ; see below 
AUTO:0046214C  cmp  eax, 73h 
AUTO:0046214F  jnb  loc_463131 
AUTO:00462155  cmp  eax, 3Bh 
AUTO:00462158  jnb  loc_463606 
AUTO:0046215E  cmp  eax, 1Fh 
AUTO:00462161  jnb  loc_463835 
AUTO:00462167  cmp  eax, 10h 
AUTO:0046216A  jnb  loc_463948 
AUTO:00462170  cmp  eax, 8 
AUTO:00462173  jnb  loc_4639D1 
AUTO:00462179  cmp  eax, 4 
AUTO:0046217C  jnb  loc_463A1A 
AUTO:00462182  cmp  eax, 2 
AUTO:00462185  jnb  loc_463A2E  
AUTO:0046220F  ja   loc_462257 
AUTO:00462211  push ebx             ; case 0E3h 
AUTO:00462212  call sub_460920

Hex-Rays does a nice job of illustrating the time-consuming ugly confusingness of dealing with code like this:

if ( v7 <= 0x837FDF02 )
  goto LABEL_755;
if ( v7 >= 0x837FE22C )
{
  if ( v7 <= 0x837FE22C )
    return sub_45EA80(v4);
  if ( v7 < 0x837FE358 )
  {
    if ( 2088770818 != v5 )
      goto LABEL_10;
    goto LABEL_97;
  }
  if ( v7 <= 0x837FE358 )
    goto LABEL_750;
  if ( 2088770608 != v5 )
    goto LABEL_10;
  goto LABEL_100;
}
if ( v7 < 0x837FE0E2 )
  goto LABEL_10;
if ( v7 <= 0x837FE0E2 )
  goto LABEL_535;
if ( 2088771118 == v5 )
{
  sub_460880(v4);
  goto LABEL_23;
}

This entry is continued in part 1.

 

VMProtect, Part 3: Optimization and Code Generation

Originally published on August 6th, 2008 on OpenRCE

This is part #3 of a four-part series on VMProtect. The other parts can be found here:

The reader should inspect this unoptimized IR listing before continuing.  In an attempt to keep this entry from becoming unnecessarily long, the example snippets will be small, but for completeness a more thorough running example is linked throughout the text.

We begin by removing the stack machine features of the IR.  Since VMProtect operates on disassembled x86 code, and x86 itself is not a stack machine, this aspect of the protection is unnatural and easily removed.  Here is a 15-line fragment of VMProtect IR.

push Dword(-88)
push esp
push Dword(4)
pop t3
pop t4
t5 = t3 + t4
push t5
push flags t5
pop DWORD Scratch:[Dword(52)]
pop t6
pop t7
t8 = t6 + t7
push t8
push flags t8
pop DWORD Scratch:[Dword(12)]
pop esp

All but two instructions are pushes or pops, and the pushes can be easily matched up with the pops.  Tracking the stack pointer, we see that, for example, t3 = Dword(4).  A simple analysis allows us to "optimize away" the push/pop pairs into assignment statements.  Simply iterate through each instruction in a basic block and keep a stack describing the source of each push.  For every pop, ensure that the sizes match and record the location of the corresponding push.  We wish to replace the pop with an assignment to the popped expression from the pushed expression, as in

t3 = Dword(4)
t4 = esp 
t7 = Dword(-88)

With the stack aspects removed, we are left with a more conventional listing containing many assignment statements.  This optimization substantially reduces the number of instructions in a given basic block (~40% for the linked example) and opens the door for other optimizations.  The newly optimized code is eight lines, roughly half of the original:

t3 = Dword(4)
t4 = esp
t5 = t3 + t4
DWORD Scratch:[Dword(52)] = flags t5
t6 = t5
t7 = Dword(-88)
t8 = t6 + t7
DWORD Scratch:[Dword(12)] = flags t8
esp = t8

A complete listing of the unoptimized IR versus the one with the stack machine features removed is here, which should be perused before proceeding.

Now we turn our attention to the temporary variables and the scratch area.  Recall that the former were not part of the pre-protected x86 code, nor the VMProtect bytecode -- they were introduced in order to ease the IR translation.  The latter is part of the VMProtect bytecode, but was not part of the original pre-protected x86 code.  Since these are not part of the languages we are modelling, we shall eliminate them wholesale.  On a high level, we treat each temporary variable, each byte of the scratch space, and each register as being a variable defined within a basic block, and then eliminate the former two via the compiler optimizations previously discussed.

Looking again at the last snippet of IR, we can see several areas for improvement.  First, consider the variable t6.  It is clearly just a copy of t5, neither of which are redefined before the next use in the assignment to t8.  Copy propagation will replace variable t6 with t5 and eliminate the former.  More generally, t3, t4, and t7 contain either constants or values that are not modified between their uses.  Constant and copy propagation will substitute the assignments to these variables in for their uses and eliminate them.

The newly optimized code is a slender three lines compared to the original 15; we have removed 80% of the IR for the running example.

DWORD Scratch:[Dword(52)] = flags Dword(4) + esp
esp = Dword(4) + esp + Dword(-88)
DWORD Scratch:[Dword(12)] = flags Dword(4) + esp + Dword(-88)

The side-by-side comparison can be found here.

The IR now looks closer to x86, with the exception that the results of computations are being stored in the scratch area, not into registers.  As before, we apply dead-store elimination, copy and constant propagation to the scratch area, removing dependence upon it entirely in the process.  See here for a comparison with the last phase.

Here is a comparison of the final, optimized code against the original x86:

push ebp                          push ebp
ebp = esp                         mov ebp, esp
push Dword(-1)                    push 0FFFFFFFFh
push Dword(4525664)               push 450E60h
push Dword(4362952)               push offset sub_4292C8
eax = DWORD FS:[Dword(0)]         mov eax, large fs:0
push eax                          push eax
DWORD FS:[Dword(0)] = esp         mov large fs:0, esp
eflags = flags esp + Dword(-88)
esp = esp + Dword(-88)            add esp, 0FFFFFFA8h
push ebx                          push ebx
push esi                          push esi
push edi                          push edi
DWORD SS:[Dword(-24) + ebp] = esp mov [ebp-18h], esp
call DWORD [Dword(4590300)]       call dword ptr ds:unk_460ADC
vmreturn Dword(0) + Dword(4638392)

Code generation is an afterthought.

VMProtect, Part 2: Primer on Optimization

Originally published on OpenRCE on August 6th, 2008

This is part #2 of a four-part series on VMProtect. The other parts can be found here:

Basically, VMProtect bytecode and the IR differ from x86 assembly language in four ways:

  1. It's a stack machine;
  2. The IR contains "temporary variables";
  3. It contains what I've called a "scratch" area, upon which computations are performed rather than in the registers;
  4.  In the case of VMProtect Ultra, it's obfuscated (or rather, "de-optimized") in certain ways.

It turns out that removing these four aspects from the IR is sufficient preparation for compilation into sensible x86 code.  We accomplish this via standard compiler optimizations applied locally to each basic block.  In general, there are a few main compiler optimizations used in this process.  The first one is "constant propagation".  Consider the following C code.

int x = 1; 
function(x);

Clearly x will always be 1 when function is invoked:  it is defined on the first line, and is not re-defined before it is used in the following line (the definition in line one "reaches" the use in line two; alternatively, the path between the two is "definition-clear" with respect to x).  Thus, the code can be safely transformed into "function(1)".  If the first line is the only definition of the variable x, then we can replace all uses of x with the integer 1.  If the second line is the only use of the variable x, then we can eliminate the variable.

The next is "constant folding".  Consider the following C code.

int x = 1024; 
function(x*1024);

By the above remarks, we know we can transform the second line into "function(1024*1024);".  It would be silly to generate code that actually performed this multiplication at run-time:  the value is known at compile-time, and should be computed then.  We can replace the second line with "function(1048576);", and in general we can replace any binary operation performed upon constant values with the computed result of that operation.

Similar to constant propagation is "copy propagation", as in the below.

void f(int i)
{
  int j = i;
  g(j);
}

The variable j is merely a copy of the variable i, and so the variable i can be substituted in for j until the point where either is redefined.  Lacking redefinitions, j can be eliminated entirely.

The next optimization is "dead-store elimination".  Consider the following C code.

int y = 2; 
y = 1;

The definition of the variable y on line one is immediately un-done by the one on line two.  Therefore, there is no reason to actually assign the value 2 to the variable; the first store to y is a "dead store", and can be eliminated by the standard liveness-based compiler optimization known as "dead-store elimination", or more generally "dead-code elimination".  

Here's an example from the VMProtect IR.

ecx = DWORD Scratch:[Dword(44)]
ecx = DWORD Scratch:[Dword(20)]

After dead-store-eliminating the first instruction, it turns out no other instructions use Scratch:[Dword(44)], and so its previous definition can be eliminated as well.

VMProtect, Part 1: Bytecode and IR

Originally posted on August 6th, 2008 on OpenRCE

This is part #1 of a four-part series on VMProtect. The other parts can be found here:

The approach I took with ReWolf's x86 Virtualizer is also applicable here, although a more sophisticated compiler is required.  What follows is some preliminary notes on the design and implementation of such a component.  These are not complete details on breaking the protection; I confess to having only looked at a few samples, and I am not sure which protection options were enabled.

As before, we begin by constructing a disassembler for the interpreter.  This is immediately problematic, since the bytecode language is polymorphic.  I have created an IDA plugin that automatically constructs OCaml source code for a bytecode disassembler.  In a production-quality implementation, this should be implemented as a standalone component that returns a closure.

The generated disassembler, then, looks like this:

let disassemble bytearray index =
match (bytearray.(index) land 0xff) with
| 0x0-> (VM__Handler0__PopIntoRegister(0),[index+1])
| 0x1-> (VM__Handler1__PushDwordFromRegister(0),[index+1])
| 0x2-> (VM__Handler2__AddWords,[index+1])
| 0x3-> (VM__Handler3__StoreByteIntoRegister(bytearray.(index+1)),[index+2])
| 0x4-> (VM__Handler0__PopIntoRegister(4),[index+1])
| 0x5-> (VM__Handler1__PushDwordFromRegister(4),[index+1])
| 0x6-> (VM__Handler4__ShrDword,[index+1])
| 0x7-> (VM__Handler5__ReadDword__FromStackSegment,[index+1])
| ...-> ...

Were we to work with the instructions individually in their natural granularity, depicted above, the bookkeeping on the semantics of each would likely prove tedious.  For illustration, compare and contrast handlers #02 and #04.  Both have the same basic pattern:  pop two values (words vs. dwords), perform a binary operation (add vs. shr), push the result, then push the flags.  The current representation of instructions does not express these, or any, similarities.

Handler #02:           Handler #04:
mov ax, [ebp+0]        mov eax, [ebp+0]
sub ebp, 2             mov cl, [ebp+4]
add [ebp+4], ax        sub ebp, 2
pushf                  shr eax, cl
pop dword ptr [ebp+0]  mov [ebp+4], eax
pushf
pop dword ptr [ebp+0]

Therefore, we pull a standard compiler-writer's trick and translate the VMProtect instructions into a simpler, "intermediate" language (hereinafter "IR") which resembles the pseudocode snippets atop the handlers in part zero.  Below is a fragment of that language's abstract syntax.

type size = B | W | D | Q
type temp = int * size
type seg= Scratch | SS | FS | Regular
type irbinop= Add | And | Shl | Shr | MakeQword
type irunop= Neg | MakeByte | TakeHighDword | Flags
type irexpr = 
| Reg of register 
| Temp of int 
| Const of const 
| Deref of seg * irexpr * size 
| Binop of irexpr * irbinop * irexpr 
| Unop of irexpr * irunop

type ir = 
DeclareTemps of temp list
| Assign of irexpr * irexpr
| Push of irexpr
| Pop of irexpr
| Return

A portion of the VMProtect -> IR translator follows; compare the translation for handlers #02 and #04.

let make_microcode = function
VM__Handler0__PopIntoRegister(b) -> [Pop(Deref(Scratch, Const(Dword(zero_extend_byte_dword(b land 0x3C))), D))]
| VM__Handler2__AddWords -> [DeclareTemps([(0, W);(1, W);(2, W)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Temp(2), Binop(Temp(0), Add, Temp(1)));
 Push(Temp(2));
 Push(Unop(Temp(2), Flags))]
| VM__Handler4__ShrDword -> [DeclareTemps([(0, D);(1, W);(2, D)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Temp(2), Binop(Temp(0), Shr, Temp(1)));
 Push(Temp(2));
 Push(Unop(Temp(2), Flags))]
| VM__Handler7__PushESP-> [Push(Reg(Esp))]
| VM__Handler23__WriteDwordIntoFSSegment -> [DeclareTemps([(0, D);(1, D)]);
 Pop(Temp(0));
 Pop(Temp(1));
 Assign(Deref(FS, Temp(0), D), Temp(1))]
| (*...*) -> (*...*)

To summarize the process, below is a listing of VMProtect instructions, followed by the assembly code that is executed for each, and to the right is the IR translation.

VM__Handler1__PushDwordFromRegister 32

; Push (Deref (Scratch, Const (Dword 32l), D));

and al, 3Ch ; al = 32
mov edx, [edi+eax]
sub ebp, 4
mov [ebp+0], edx

VM__Handler7__PushESP
; Push (Reg Esp);
mov eax, ebp
sub ebp, 4
mov [ebp+0], eax

VM__Handler0__PopIntoRegister 40
; Pop (Deref (Scratch, Const (Dword 40l), D));
and al, 3Ch
mov edx, [ebp+0]
add ebp, 4
mov [edi+eax], edx

VM__Handler19__PushSignedByteAsDword (-1l)
; Push (Const (Dword (-1l))); 
movzx eax, byte ptr [esi] ; *esi = -1
sub esi, 0FFFFFFFFh
cbw
cwde
sub ebp, 4
mov [ebp+0], eax

VM__Handler9__PushDword 4525664l
; Push (Const (Dword 4525664l));
mov eax, [esi] ; *esi = 4525664l
add esi, 4
sub ebp, 4
mov [ebp+0], eax

VM__Handler9__PushDword 4362952l};
; Push (Const (Dword 4362952l));
mov eax, [esi] ; *esi = 4362952l
add esi, 4
sub ebp, 4
mov [ebp+0], eax

VM__Handler19__PushSignedByteAsDword 0l};
; Push (Const (Dword (0l))); 
movzx eax, byte ptr [esi] ; *esi = 0
sub esi, 0FFFFFFFFh
cbw
cwde
sub ebp, 4
mov [ebp+0], eax

VM__Handler42__ReadDwordFromFSSegment};
;Pop (Temp 0);
;Push (Deref (FS, Temp 0, D));
mov eax, [ebp+0]DeclareTemps([(0,D)])
mov eax, fs:[eax]
mov [ebp+0], eax

VMProtect, Part 0: Basics

Originally published on August 6th, 2008 on OpenRCE

This is part #0 of a four-part series on VMProtect. The other parts can be found here:

VMProtect is a virtualization protector.  Like other protections in the genre, among others ReWolf's x86 Virtualizer and CodeVirtualizer, it works by disassembling the x86 bytecode of the target executable and compiling it into a proprietary, polymorphic bytecode which is executed in a custom interpreter at run-time.  This is unlike the traditional notions of packing, in which the x86 bytecode is simply encrypted and/or compressed:  with virtualization, the original x86 bytecode in the protected areas is gone, never to be seen again.  Or so the idea goes.

If you've never looked at VMProtect before, I encourage you to take a five-minute look in IDA (here's a sample packed binary).  As far as VMs go, it is particularly skeletal and easily comprehended.  The difficulty lies in recreating working x86 bytecode from the VM bytecode.  Here's a two-minute analysis of its dispatcher.

push edi; push all registers
push ecx
push edx
push esi
push ebp
push ebx
push eax
push edx
pushf
push 0 ; imagebase fixup
mov esi, [esp+8+arg_0] ; esi = pointer to VM bytecode
mov ebp, esp ; ebp = VM's "stack" pointer
sub esp, 0C0h
mov edi, esp ; edi = "scratch" data area

VM__FOLLOW__Update:
add esi, [ebp+0]

VM__FOLLOW__Regular:
mov al, [esi]; read a byte from EIP
movzx eax, al
sub esi, -1; increment EIP
jmp ds:VM__HandlerTable[eax*4] ; execute instruction handler

A feature worth discussing is the "scratch space", referenced by the register edi throughout the dispatch loop.  This is a 16-dword-sized area on the stack where VMProtect saves the registers upon entering the VM, modifies them throughout the course of a basic block, and from whence it restores the registers upon exit.  For each basic block protected by the VM, the layout of the registers in the scratch space can potentially be different.

Here's a disassembly of some instruction handlers.  Notice that A) VMProtect is a stack machine and that B) each handler -- though consisting of scant few instructions -- performs several tasks, e.g. popping several values, performing multiple operations, pushing one or more values.

#00:x = [EIP-1] & 0x3C; y = popd; [edi+x] = y

.text:00427251 and al, 3Ch; al = instruction number
.text:00427254 mov edx, [ebp+0] ; grab a dword off the stack
.text:00427257 add ebp, 4 ; pop the stack
.text:0042725A mov [edi+eax], edx ; store the dword in the scratch space

#01:x = [EIP-1] & 0x3C; y = [edi+x]; pushd y

.vmp0:0046B0EB and al, 3Ch; al = instruction number
.vmp0:0046B0EE mov edx, [edi+eax] ; grab a dword out of the scratch space
.vmp0:0046B0F1 sub ebp, 4 ; subtract 4 from the stack pointer
.vmp0:0046B0F4 mov [ebp+0], edx ; push the dword onto the stack

#02:x = popw, y = popw, z = x + y, pushw z, pushf

.text:004271FB mov ax, [ebp+0] ; pop a word off the stack
.text:004271FF sub ebp, 2
.text:00427202 add [ebp+4], ax ; add it to another word on the stack
.text:00427206 pushf
.text:00427207 pop dword ptr [ebp+0] ; push the flags

#03:x = [EIP++]; w = popw; [edi+x] = Byte(w)

.vmp0:0046B02A movzx eax, byte ptr [esi] ; read a byte from EIP
.vmp0:0046B02D mov dx, [ebp+0] ; pop a word off the stack
.vmp0:0046B031 inc esi ; EIP++
.vmp0:0046B032 add ebp, 2; adjust stack pointer
.vmp0:0046B035 mov [edi+eax], dl ; write a byte into the scratch area

#04:x = popd, y = popw, z = x << y, pushd z, pushf

.vmp0:0046B095 mov eax, [ebp+0]; pop a dword off the stack
.vmp0:0046B098 mov cl, [ebp+4] ; pop a word off the stack
.vmp0:0046B09B sub ebp, 2
.vmp0:0046B09E shr eax, cl ; shr the dword by the word
.vmp0:0046B0A0 mov [ebp+4], eax; push the result
.vmp0:0046B0A3 pushf
.vmp0:0046B0A4 pop dword ptr [ebp+0] ; push the flags

#05:x = popd, pushd ss:[x]

.vmp0:0046B5F7 mov eax, [ebp+0]; pop a dword off the stack
.vmp0:0046B5FA mov eax, ss:[eax] ; read a dword from ss
.vmp0:0046B5FD mov [ebp+0], eax; push that dword

Compiler 1, X86 Virtualizer 0

Originally published on April 4th, 2008.  This post won Honorable Mention for Most Innovative Research at the Pwnie Awards 2008.

There are two types of virtual machine software protections:  A) the ones that convert x86 machine code into virtual machine bytecode and execute it at runtime; B) the ones that execute some arbitrary code in a virtual environment.  I've discussed the latter several times in the past, and by now there exists a wealth of literature on that variety.  But breaking the former kind remains an unsolved problem.

In my article I said "basically, reverse engineering a VM with the common tools is like reverse engineering a scripted installer without a script decompiler: it's repetitious, and the high-level details are obscured by the flood of low-level details".  The more I thought about this, the more I realized that the word "basically" is out of place:  virtualizing software protections are programming language interpreters, albeit for weird languages.

Consequently, an idea struck me:  what we want here is not an interpreter, but a compiler to compile the bytecode back into x86 machine code.  I spent a week coding one (~1000 lines) in OCaml to test this theory, and I'm able to report that, indeed, it works.  I chose ReWolf's x86 Virtualizer, a simple target that uses some of the same techniques as the heavy hitters in this area.  Here is a walkthrough of the analysis and recompilation of a small function with one basic block.  The compiler works equally well for arbitrarily-large functions, although that would make this posting unnecessarily long and complicated.

Step -2:  Protect something with the virtualizer.  In this case I just used ReWolf's sample executable itself.

.text:00401896 call ds:GetTickCount
.text:0040189C push eax
.text:0040189D call _srand
.text:004018A2 pop ecx
.text:004018A3 push 0
.text:004018A5 push offset DialogFunc
.text:004018AA push 0
.text:004018AC push 65h
.text:004018AE push [esp+10h+hInstance]
.text:004018B2 call ds:DialogBoxParamA
.text:004018B8 xor eax, eax
.text:004018BA retn 10h

Step -1:  Analyze the virtual machine.  Although this was not strictly necessary in this case because ReWolf provided source code, I decided to ignore it and reverse the VM manually, since you don't always have such niceties.


Step 0:  Break the polymorphism in the instruction set.  I made use of two remarkably ghetto hacks here, one of which may be considered elegant.  To avoid provoking any arms races I'll omit the details.

Step 1:  Disassemble the relevant region into VM bytecode.  In the process, construct a graph in which each vertex is an instruction, and the edges are the flows between them.

.VM:004131D0 db 0C2h, 0C9h, 0C0h, 0BDh, 14h, 0DFh, 63h, 9Ah, 86h, 5Eh, 50h, 30h, 0Bh
.VM:004131D0 db 0Ah, 0C0h, 0C7h, 0CEh, 5Eh, 44h, 0E1h, 0E0h, 0C7h, 0FCh, 0FDh, 12h
.VM:004131D0 db 10h, 50h, 0D8h, 0D2h, 0DBh, 0A6h, 3Dh, 34h, 0C9h, 12h, 0DEh, 0E5h, 4Bh
.VM:004131D0 db 2Ch, 2Eh, 6Eh, 23h, 21h, 27h, 0E2h, 0E5h, 0ECh, 99h, 14h, 13h, 0C2h
.VM:004131D0 db 0E5h, 0F9h, 0FDh, 0F4h, 38h, 14h, 0F7h, 0F0h, 0F9h, 0ABh, 79h, 6, 0D7h
.VM:004131D0 db 0F0h, 8Bh, 88h, 81h, 41h, 87h, 8Ch, 85h, 0F8h, 51h, 9Ah, 26h, 0DFh
.VM:004131D0 db 0CFh, 1Eh, 15h, 75h, 76h, 74h, 6Bh, 98h, 9Dh, 94h, 6Eh, 0Ch, 6Bh, 90h
.VM:004131D0 db 93h, 9Ah, 0Fh

becomes

vertexlist =
[{label = 84; instruction = VMExit 16l};
{label = 81; instruction = LiteralInstruction [|51; 192|]};
{label = 69; instruction = ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l)};
{label = 65; instruction = PushDereferencedTemp};
{label = 57; instruction = AddImmediateToTemp 20l};
{label = 52; instruction = AddRegisterToTemp Esp};
{label = 44; instruction = SetTemp 0l};
{label = 41; instruction = LiteralInstruction [|106; 101|]};
{label = 38; instruction = LiteralInstruction [|106; 0|]};
{label = 27; instruction = ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l)};
{label = 24; instruction = LiteralInstruction [|106; 0|]};
{label = 22; instruction = LiteralInstruction [|89|]};
{label = 14; instruction = X86Call 6471l};
{label = 12; instruction = LiteralInstruction [|80|]};
{label = 0;instruction = ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l)}];
edgelist =
[({contents = {label = 0}},{contents = {label = 12}});
({contents = {label = 12}}, {contents = {label = 14}});
({contents = {label = 14}}, {contents = {label = 22}});
({contents = {label = 22}}, {contents = {label = 24}});
(* Lots and lots of edges removed *)]

Step 2:  Form basic blocks within the instruction-level CFG.  The previous output becomes:

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|]; 
X86Call 6471l; 
LiteralInstruction [|89|];
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
SetTemp 0l; 
AddRegisterToTemp Esp; 
AddImmediateToTemp 20l;
PushDereferencedTemp;
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|]; VMExit 16l|]}];

Step 3:  Optimize the code within the basic block.  The goal is to convert sequences of VM instructions into a new language more conducive to being compiled back into X86.  The optimizer is the most powerful component of my compiler:  it can remove obfuscation automatically simply as a side-effect of being an optimizer (not that ReWolf's has any, but others do), and employs no pattern matching.

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|]; 
X86Call 6471l;
LiteralInstruction [|89|];
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
SyntheticInstruction (Push, Plus (Constant 20l, Register Esp));
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|]; 
VMExit 16l|]}];

Step 4:  Recompile all virtual instructions into x86 machine language.

vertexlist =
[{label = 0;
 instruction =
[|ImagebaseFixupInstruction ([|255; 21; 40; 160; 0; 0|], 2l);
LiteralInstruction [|80|];
RelativeFixupInstruction ([|232; 0; 0; 0; 0|], 6471l, 1l);
LiteralInstruction [|89|]; 
LiteralInstruction [|106; 0|];
ImagebaseFixupInstruction ([|104; 240; 22; 0; 0|], 1l);
LiteralInstruction [|106; 0|]; 
LiteralInstruction [|106; 101|];
LiteralInstruction [|255; 116; 36; 20|];
ImagebaseFixupInstruction ([|255; 21; 72; 161; 0; 0|], 2l);
LiteralInstruction [|51; 192|];
LiteralInstruction [|194; 16; 0|]|]}];

Step 5:  Stuff the original bytes back into the binary and perform fixups specified.  If you can convert between hex and decimal in your head, you'll notice that the bytes above correspond to those below, modulo fixups.  For multi-basic-block functions, this is harder, as you have to sequence the blocks and decide between short and long jumps.

.VM:004131D0 FF 15 28 A0 40 00 call ds:GetTickCount
.VM:004131D6 50                push eax
.VM:004131D7 E8 6B E7 FE FF    call loc_401947
.VM:004131DC 59                pop  ecx
.VM:004131DD 6A 00             push 0
.VM:004131DF 68 F0 16 40 00    push offset loc_4016F0
.VM:004131E4 6A 00             push 0
.VM:004131E6 6A 65             push 65h
.VM:004131E8 FF 74 24 14       push dword ptr [esp+14h]
.VM:004131EC FF 15 48 A1 40 00 call ds:DialogBoxParamA
.VM:004131F2 33 C0             xor eax, eax
.VM:004131F4 C2 10 00          retn 10h

Step 6:  Celebrate.  ReWolf's X86 Virtualizer was simple, and surely breaking the harder ones is, well, harder, but I believe that the general principles espoused here should be applicable to the others.


Here is the source code.

Industrial-Grade Binary-Only Profiling and Coverage

Originally published on February 16th, 2008 on OpenRCE

There are a few options for profiling or performing code-coverage analysis on a per-module binary level:

* Run traces (very slow and generate a huge amount of uninteresting data, but it works);
* MSR tracing (strengths and weaknesses remain to be seen, but seems fairly promising);
* BinNavi/CoverIt/PaiMei/presumably Inspector:  put a breakpoint on every function you found in a static disassembly (doesn't work in general; I explained why here)

There are more options rooted in academia, the most practical of which being dynamic binary instrumentation (DBI), the technology behind tools such as valgrind and DynamoRIO.  The inner workings of this technology are very interesting, but they are rather involved and their precise technical details are beyond the scope of this entry.  Informally speaking, they disassemble a basic block, convert the instructions into an intermediate language like the ones you find inside of a compiler, and finally re-compile the IL with the "instrumentation" code baked directly into the new assembly language.  For more information, read the original Ph.D. thesis describing Valgrind and then read the source to libVEX, a component thereof.  Valgrind is slow and linux-only, but DynamoRIO was specifically designed with speed in mind (hence the "Dynamo") and runs on Windows.

Here I present a DynamoRIO extension for code coverage and profiling.  It works on a function-level (although block-level support could be added easily -- the source weighs in at a measly 70 lines in 2kb, so if you want some other feature, just code it), and it can either be a profiler or a code coverage analyzer.  All it does is instrument the code such that each call instruction, direct or indirect, will write its source and target addresses into a file.  This data can then be used for either profiling or code coverage purposes:  simply discard all of the duplicates for the latter, and use the data as-is for the former.  This is just the back-end, but I imagine that this could be easily integrated into PaiMei's front end to provide an industrial-grade coverage and profiling tool.

Strengths of DynamoRIO:
* speed (you might not even notice the slowdown);
* stability (there used to be a commercial security product based on this technology -- it is literally industrial grade);
* trivial to code extensions for (70 lines, 2kb for this simple yet powerful extension).

Weaknesses:
* definitely won't work with self-modifying code
* probably won't work with obfuscated or "self-protecting" code (there's particularly a problem with so-called "pc-relative" addressing, such as call $ / pop ebp).

Studious readers may note that automatic indirect call resolution is exceptionally useful for C++ reverse engineering;  comment out the direct call resolution, recompile, write a quick IDC script to add the x-refs to the disassembly listing, and you've got a killer C++ RE tool.  Credit goes to spoonm for having and implementing this idea initially.

Note that in the six years since this was published, new binary instrumentation tools such as Intel's PIN have emerged with ameliorate some of the weaknesses of the tools described in this post from 2008.

Code Generation Quirk Involving Array Indexing

Originally published on February 13th, 2008 on OpenRCE.

In this post, we shall investigate some strange-looking code generated in the context of an array index.

.text:10002D49 mov eax, [esp+arg_0]
.text:10002D4D lea ecx, [eax-9C40h]
.text:10002D53 cmp ecx, 50h
.text:10002D56 ja short loc_10002D60
.text:10002D58 mov eax, dword ptr ds:(loc_1000EF5B+1)[eax*8]
.text:10002D5F retn
.text:10002D60
.text:10002D60 loc_10002D60:
.text:10002D60 lea edx, [eax-0A029h]
.text:10002D66 cmp edx, 9
.text:10002D69 ja short loc_10002D73
.text:10002D6B mov eax, dword ptr ds:loc_1000D344[eax*8]
.text:10002D72 retn

We don't find any arrays at the locations referenced on lines -D58 and -D6B (in fact we find code) which is unusual:

; First target
.text:1000EF57 movzx eax, word ptr [esi+18h]
.text:1000EF5B loc_1000EF5B: ; DATA XREF: 10002D58
.text:1000EF5B add dword_10065280, eax
.text:1000EF61 xor eax, eax
.text:1000EF63 pop esi
.text:1000EF64 mov esp, ebp
.text:1000EF66 pop ebp

; Second target
.text:1000D342 mov esp, ebp
.text:1000D344 loc_1000D344: ; DATA XREF: 10002D6B
.text:1000D344 pop ebp

Looking closer at the code, the trick lies in the fact that the arrays are not being indexed starting at zero.

.text:10002D58 mov eax, dword ptr ds:(loc_1000EF5B+1)[eax*8] ; <- 0x9C40 <= eax < 0x9C90
.text:10002D6B mov eax, dword ptr ds:loc_1000D344[eax*8] ; <- 0xA029 <= eax < 0xA032

So the first array actually begins at 0x1000EF5B+1+0x9C40*8 == 0x1005D15C, and the second array begins at 0x1000D344+0x0A029*8 == 0x1005D48C.  What happened here is that the pointer expression has been simplified to conform to x86's instruction encoding:

[1005D15Ch + (eax - 0x9C40) * 8] => [1005D15Ch - 4E200h + eax*8] => [1000EF5Ch + eax*8]

This is pretty uncommon; I've only seen it a handful of times in my reversing endeavors over the years.