Control Flow Deobfuscation via Abstract Interpretation

Work done May 30th, 2010.  Originally published on July 9th, 2011 on OpenRCE.

I present some work that I did involving automatic deobfuscation of obfuscated control flow constructs with abstract interpretation.  Considering the image below, this project is responsible for taking graphs like the one on the left (where most of the "conditional" branches actually only go in one direction and are only present to thwart static analysis) and converting them into graphs like the one on the right.

Much work on deobfuscation relies on pattern-matching at least to some extent; I have coded such tools myself.  I have some distaste for such methods, since they stop working when the patterns change (they are "syntactic").  I prefer to code my deobfuscation tools as generically ("semantically") as possible, such that they capture innate properties of the obfuscation method in question, rather than hard-coding individual instances of the obfuscation.

The slides present a technique based on abstract interpretation, a form of static program analysis, for deobfuscating control flow transfers.  I translate the x86 code into a different ("intermediate") language, and then perform an analysis based on three-valued logic over the translated code.  The end result is that certain classes of opaque predicates (conditional jumps that are either always taken or always not taken) are detected and resolved.  I have successfully used this technique to break several protections making use of similar obfuscation techniques.  

Although I invented and implemented these techniques independently, given the wealth of work in program analysis, it wouldn't surprise me to learn that the particular technique has been previously invented.  Proper references are appreciated.

Code is also included.  The source relies upon my Pandemic program analysis framework, which is not publicly available.  Hence, the code is for educational purposes only.  Nonetheless, I believe it is one of very few examples of publicly-available source code involving abstract interpretation on binaries.

PPTX presentation, OCaml source code (for educational purposes only -- does not include my framework.)

PatchDiff2 Analysis and Decompilation

Originally published June 10th, 2010 on OpenRCE

Now that PatchDiff2 is open-source, I release my IDB and the initial decompilation that I did.  You can find it here.

I met the author, Nicolas Pouvesle, at RECon 2008, who told me of his upcoming plans to release this plugin.  He mentioned that the license would be friendly towards reverse engineering, so I asked him if I could decompile the plugin and release the source code.  We must have miscommunicated, because I could have sworn he gave me the green light.  

When it was released, I spent about three days reverse engineering and decompiling it, resulting in the workproduct linked above.  I had intended to spend an extra couple of days on the decompilation to make sure it was functionally equivalent, if not byte-perfect (which is always the ultimate goal).

After I finished the initial phase of the decompilation, I sent it over to Nicolas to solicit his feedback.  He informed me that releasing the source code would violate PatchDiff2's EULA.  Therefore, I abandoned the project.  As it stands, I never even tried to compile the source code, so I'm afraid it's not worth much beyond the mere curiosity.  Still, I'm releasing this hoping that somebody might find it interesting.  The IDB itself is very thorough, e.g. all structures are recovered.

Enjoy.

Compiler Optimizations for Reverse Engineers

Originally written Q1 2007 as part of my Static Reverse Engineering Training Class, first published March 8th, 2010 on OpenRCE.

I've decided to release the part of my training material relating to compiler optimizations.  I created this back in January of 2007; for a while I was teaching classes very often, and so it made sense to keep it private.  Nowadays I only teach a few classes a year, and so this presentation is sort of languishing away on my hard drive, which is a shame since it's my favorite part.  I think people will enjoy reading it, so I have decided to make it public.


Since I made this in early 2007, it lacks some of GCC 4's optimizations, and from time to time I realize that I forgot an optimization or two.  But for the most part, it's fairly complete.  If you have any specific suggestions for optimizations that I missed, please either email me or respond via comment with a link to a binary exhibiting said optimization and the address at which I can find it.  Proper compiler-theoretic names for the optimizations, if applicable, are also appreciated.

Here it is.  Enjoy.

C-subset Compiler in OCaml

Originally published on December 25th, 2009 on OpenRCE

Here is the source code for a compiler that I wrote in Objective Caml this semester, for a subset of the C language.  It requires a standalone C->IR translator which is not included in this release, as the school owns the copyright on that particular piece of code.  Hence one cannot immediately use this compiler to compile C programs without writing a C front end; anyway, an existing compiler such as MSVC or GCC would be a better choice.

The portion of the code that I wrote (everything except bitv.ml and .mli) totals roughly 3200 lines of code.  It includes two optimizations based on classical data-flow analysis, constant propagation and dead-statement elimination.  It also supports translation into and out of static single assignment form, as well as two optimizations based on SSA:  constant propagation and loop-invariant code motion.  The code for the graph-coloring register allocator is not included.  As a back-end, the compiler produces C code that can be compiled by any other C compiler.

The code is a pretty good example of how to structure medium-sized programs in functional programming languages.  I tried to adopt a pure functional style throughout most of the codebase.  Sometimes this failed:  cfg_ir.ml is unnecessarily ugly, and should have been re-written in an imperative style with mutability; also, I made the mistake of using a mutable array to describe Phi values in static single assignment whereas pure functional style would have dictated a list.  But those are my only complaints with the code; overall, I'm pretty pleased with how it all turned out.

This code is substantially more sophisticated than the compiler that I wrote to break VMProtect, so if you can read and understand this release, you should be in good shape for breaking virtualization obfuscators.

Switch as Binary Search, Part 1

Originally published on November 28th, 2008 on OpenRCE

(Continued from part 0)

We have seen how a compiler-generated binary search partitions the range of integers into successively smaller intervals and ultimately converging upon either a single value or one contiguous range of values.  We will now develop a tool to deal with this construct in a friendly fashion for the working reverse engineer.  

The tool has the following high-level specification:

Given the first address of a compiled binary switch construct:

  • Determine the register used and whether the comparisons are signed (trivial)
  • Determine all terminal labels, at which cases reside, and all values that lead to these cases.

A discussion of the problem and its solution follows.

Upon entering the switch, the register value is entirely unconstrained:  the range of values that it might have are contained in the interval [0, MAX_INT] (or [-(MAX_INT+1)/2, MAX_INT/2] for signed integers).  The following example uses an unsigned integer.

Suppose that the first comparison is of the form

@loc0:
cmp eax, val1
jbe @loc1
@loc2:

If the jump is taken, the value must be below or equal to 'val1'.  In other words, it lies in the range [0, val1].  If the jump is not taken, the value must be above 'val1', so it lies in the range [val1+1, MAX_INT].  Therefore, each comparison in the tree partitions the range of possible values into two disjoint ranges:  one for those values where the jump is taken, and one for those where it is not.

At the second level of the tree, there will be two comparisons:  one for the case where the jump for the first was not taken, and one for the case where it was.  Consider the second set of such instructions:

@loc1:
cmp eax, val2
jbe @loc3
@loc4:

@loc2:
cmp eax, val3
jbe @loc5
@loc6:

Each of these cases further constrains the input:

  • The range of values leading to loc3 is [0,         val2]
  • The range of values leading to loc4 is [val2+1,    val1]
  • The range of values leading to loc5 is [val1+1,    val3]
  • The range of values leading to loc6 is [val3+1, MAX_INT]

This table has a very regular structure, and it should not be too hard to imagine what it would look like for three levels into the tree, or four, or...  The following image summarizes the process.  The unique path to a given vertex specifies the constraints required for input to reach it.

As we walk the comparisons in the tree in this fashion, the binary tree will eventually stop and give way to the code for the cases in the switch.  At this point, our constraints will be terminal cases of either single values (most commonly) or of simple ranges.  For each terminal address, we maintain a dictionary associating it with the corresponding ranges that lead there.

The following pseudocode codifies the discussion above.

// The first range returned is that for which the jump is taken;
// the second is for non-jump-taking values
partition(ea, low, high, compared)
{
  if(comparison at ea is ">")
    return [compared+1, high], [low, compared]

  if(comparison at ea is ">=")
    return [compared, high], [low, compared+1]

  if(comparison at ea is "<")
    return [low, compared-1], [compared, high]

  if(comparison at ea is "<=")
    return [low, compared], [compared+1, high]

  if(comparison at ea is "==")
    return [compared, compared], [low, high]

  if(comparison at ea is "!=")
    return [low, high], [compared, compared]
}

The following recursive algorithm solves the general problem.

analyze_bswitch(ea_t ea, int low, int high, int compared)
{
  if(ea is cmp reg, constant)
    compared = instruction's immediate value 
    ea = address of next instruction

  if(ea is a conditional jump)
    [low1,high2], [low2,high2] = partition(ea, low, high, compared)
    analyze_bswitch(jump taken ea, low1, high1, compared)
    analyze_bswitch(jump not taken ea, low2, high2, compared)

  // Instruction is a leaf in the binary tree
  else
    associate(ea, low, high)
}

IDAPython-1.0 compatible source code is available here.

The resulting disassembly is properly annotated with case labels, as IDA does normally:

AUTO:0046348E   jbe     loc_463D16      ; case 0A0h
AUTO:00463494   cmp     eax, 0A2h
AUTO:00463499   jb      loc_462FA0      ; case 0837F81BAh, 0837F81D8h, 0A1h, 091h
AUTO:0046349F   ja      loc_463D21      ; case 0837F90BAh, 0A3h
AUTO:004634A5 loc_4634A5:               ; CODE XREF: sub_462120:loc_462EDAj
AUTO:004634A5                           ; sub_462120+DDDj ...
AUTO:004634A5   push    ebx             ; case 0837F8106h, 0837F8124h, 0A2h, 031h
AUTO:004634A6   call    sub_45C630
AUTO:004634AB   jmp     loc_462217

Switch as Binary Search, Part 0

Originally posted on OpenRCE on November 28th, 2008

For a standard switch with a contiguous region of cases, as in the below, compilers generate constant-time lookups by treating the cases as indexes into an array of pointers.

switch(x)
{
  case 10: /* case 10 code */ break;
  // No case 11 (default)
  case 12: /* case 12 code */ break;
  case 13: /* case 13 code */ break;
  case 14: /* case 14 code */ break;
  // No case 15 (default)
  case 16: /* case 16 code */ break;
  default: /* default code */ break;
}

This might compile into the following:

sub eax, 10 
cmp eax, 6 
ja @default 
jmp dword ptr switch_table[eax*4] 

switch_table dd offset loc_case0
             dd offset default
             dd offset loc_case2
             dd offset loc_case3
             dd offset loc_case4
             dd offset default
             dd offset loc_case6
             ... 

For switches with multiple labels pointing to the same code, some compilers save space by generating doubly-indexed, tabular code.

switch(x)
{
  case 0x00..0x20: /* 0x00 <= x < 0x20 code */ break;
  case 0x21..0x40: /* 0x21 <= x < 0x40 code */ break;
}

This might compile to:

cmp eax, 40h 
ja @over 
movzx eax, byte ptr index_table[eax*4] 
jmp dword ptr switch_table[eax*4] ; 
72 bytes instead of 256 
index_table  db 32 DUP(0)
             db 32 DUP(1) 
switch_table dd offset loc_case_00_00
             dd offset loc_case_20_40

For switches with non-contiguous, sparsely-distributed cases like the below, the table-based approaches from above are unsuitable, and yet cases like this do appear in practice, so the compiler must have a strategy for dealing with them.

switch(x)
{
  case 0x0:
  case 0x16:
  case 0x326:
  case 0x9821:
  case 0x90826:
  case 0x278946:
  case 0x4578939:
  case 0x12372826:
  default:
}

One obvious solution to this problem would be to generate a sequence of comparisons, one for each case, starting from the lowest case and ending with the highest.  This would work, but the lookup would be O(N) in the number of case labels.  An equally obvious solution, and one that was mentioned in the book "Reversing", is to generate a binary search construct in the generated code to perform lookups in O(log(N)) time.


To briefly review binary searching, consider the (necessarily) sorted sequence [1;2;3;4;5;6;7;8;9], and consider finding the value 2.  We begin by comparing against the middle element, 5. Since 2 is lesser, we discard all values 5 and above and consider those below it.  Now we have a smaller range of values upon which we can perform the same procedure.

[1;2;3;4]

As before, we compare against the middle element, 3.  Since it is lesser, we can discard anything above 3, leaving [1;2] as the remaining search space.

[1;2]

The middle element is 2, which is our search key, so we stop.  This process took three steps (log(9)) to complete, compared with N=9 steps for searching the whole array.

The compiler is responsible for generating code that implements these comparisons.  The first comparison will be against the middle element in the collection; whether the case value is above or below determines whether to repeat the process for the integers above or below the middle one.  At each comparison there is a corresponding equality test to see whether the search key is the same as the comparator.

Below is an example of this construct in the wild.

AUTO:00462141  cmp  eax, 0E3h 
AUTO:00462146  jnb  loc_46220F ; see below 
AUTO:0046214C  cmp  eax, 73h 
AUTO:0046214F  jnb  loc_463131 
AUTO:00462155  cmp  eax, 3Bh 
AUTO:00462158  jnb  loc_463606 
AUTO:0046215E  cmp  eax, 1Fh 
AUTO:00462161  jnb  loc_463835 
AUTO:00462167  cmp  eax, 10h 
AUTO:0046216A  jnb  loc_463948 
AUTO:00462170  cmp  eax, 8 
AUTO:00462173  jnb  loc_4639D1 
AUTO:00462179  cmp  eax, 4 
AUTO:0046217C  jnb  loc_463A1A 
AUTO:00462182  cmp  eax, 2 
AUTO:00462185  jnb  loc_463A2E  
AUTO:0046220F  ja   loc_462257 
AUTO:00462211  push ebx             ; case 0E3h 
AUTO:00462212  call sub_460920

Hex-Rays does a nice job of illustrating the time-consuming ugly confusingness of dealing with code like this:

if ( v7 <= 0x837FDF02 )
  goto LABEL_755;
if ( v7 >= 0x837FE22C )
{
  if ( v7 <= 0x837FE22C )
    return sub_45EA80(v4);
  if ( v7 < 0x837FE358 )
  {
    if ( 2088770818 != v5 )
      goto LABEL_10;
    goto LABEL_97;
  }
  if ( v7 <= 0x837FE358 )
    goto LABEL_750;
  if ( 2088770608 != v5 )
    goto LABEL_10;
  goto LABEL_100;
}
if ( v7 < 0x837FE0E2 )
  goto LABEL_10;
if ( v7 <= 0x837FE0E2 )
  goto LABEL_535;
if ( 2088771118 == v5 )
{
  sub_460880(v4);
  goto LABEL_23;
}

This entry is continued in part 1.

 

VMProtect, Part 3: Optimization and Code Generation

Originally published on August 6th, 2008 on OpenRCE

This is part #3 of a four-part series on VMProtect. The other parts can be found here:

The reader should inspect this unoptimized IR listing before continuing.  In an attempt to keep this entry from becoming unnecessarily long, the example snippets will be small, but for completeness a more thorough running example is linked throughout the text.

We begin by removing the stack machine features of the IR.  Since VMProtect operates on disassembled x86 code, and x86 itself is not a stack machine, this aspect of the protection is unnatural and easily removed.  Here is a 15-line fragment of VMProtect IR.

push Dword(-88)
push esp
push Dword(4)
pop t3
pop t4
t5 = t3 + t4
push t5
push flags t5
pop DWORD Scratch:[Dword(52)]
pop t6
pop t7
t8 = t6 + t7
push t8
push flags t8
pop DWORD Scratch:[Dword(12)]
pop esp

All but two instructions are pushes or pops, and the pushes can be easily matched up with the pops.  Tracking the stack pointer, we see that, for example, t3 = Dword(4).  A simple analysis allows us to "optimize away" the push/pop pairs into assignment statements.  Simply iterate through each instruction in a basic block and keep a stack describing the source of each push.  For every pop, ensure that the sizes match and record the location of the corresponding push.  We wish to replace the pop with an assignment to the popped expression from the pushed expression, as in

t3 = Dword(4)
t4 = esp 
t7 = Dword(-88)

With the stack aspects removed, we are left with a more conventional listing containing many assignment statements.  This optimization substantially reduces the number of instructions in a given basic block (~40% for the linked example) and opens the door for other optimizations.  The newly optimized code is eight lines, roughly half of the original:

t3 = Dword(4)
t4 = esp
t5 = t3 + t4
DWORD Scratch:[Dword(52)] = flags t5
t6 = t5
t7 = Dword(-88)
t8 = t6 + t7
DWORD Scratch:[Dword(12)] = flags t8
esp = t8

A complete listing of the unoptimized IR versus the one with the stack machine features removed is here, which should be perused before proceeding.

Now we turn our attention to the temporary variables and the scratch area.  Recall that the former were not part of the pre-protected x86 code, nor the VMProtect bytecode -- they were introduced in order to ease the IR translation.  The latter is part of the VMProtect bytecode, but was not part of the original pre-protected x86 code.  Since these are not part of the languages we are modelling, we shall eliminate them wholesale.  On a high level, we treat each temporary variable, each byte of the scratch space, and each register as being a variable defined within a basic block, and then eliminate the former two via the compiler optimizations previously discussed.

Looking again at the last snippet of IR, we can see several areas for improvement.  First, consider the variable t6.  It is clearly just a copy of t5, neither of which are redefined before the next use in the assignment to t8.  Copy propagation will replace variable t6 with t5 and eliminate the former.  More generally, t3, t4, and t7 contain either constants or values that are not modified between their uses.  Constant and copy propagation will substitute the assignments to these variables in for their uses and eliminate them.

The newly optimized code is a slender three lines compared to the original 15; we have removed 80% of the IR for the running example.

DWORD Scratch:[Dword(52)] = flags Dword(4) + esp
esp = Dword(4) + esp + Dword(-88)
DWORD Scratch:[Dword(12)] = flags Dword(4) + esp + Dword(-88)

The side-by-side comparison can be found here.

The IR now looks closer to x86, with the exception that the results of computations are being stored in the scratch area, not into registers.  As before, we apply dead-store elimination, copy and constant propagation to the scratch area, removing dependence upon it entirely in the process.  See here for a comparison with the last phase.

Here is a comparison of the final, optimized code against the original x86:

push ebp                          push ebp
ebp = esp                         mov ebp, esp
push Dword(-1)                    push 0FFFFFFFFh
push Dword(4525664)               push 450E60h
push Dword(4362952)               push offset sub_4292C8
eax = DWORD FS:[Dword(0)]         mov eax, large fs:0
push eax                          push eax
DWORD FS:[Dword(0)] = esp         mov large fs:0, esp
eflags = flags esp + Dword(-88)
esp = esp + Dword(-88)            add esp, 0FFFFFFA8h
push ebx                          push ebx
push esi                          push esi
push edi                          push edi
DWORD SS:[Dword(-24) + ebp] = esp mov [ebp-18h], esp
call DWORD [Dword(4590300)]       call dword ptr ds:unk_460ADC
vmreturn Dword(0) + Dword(4638392)

Code generation is an afterthought.