Detours into Arcana: A Static Malware Analysis Trick

Several friends asked for my advice on a particular technique seen in some recent malware samples, which is similar in nature to the Nanomites feature of the Armadillo copy protection.  There are three parts to this technique:

  1. When creating the malware binary, replace control transfer (call and jmp) instructions with privileged instructions.
  2. At run-time, install an exception handler.
  3. Catch the exceptions thrown by the privileged instructions, perform the intended control transfers, and resume execution.

A concrete example follows.

UPX0:004081F7 in eax, dx
UPX0:004081F8 dw 4EE0h

When the process attempts to execute this instruction, an exception is generated.  The exception handler checks to see that the faulting instruction is "in eax, dx", then reads the word following the instruction, and generates a call to location 0x405000 + 0x4EE0.  In other words, call instructions within the module are replaced by:

in eax, dx
dw call_destination - 0x405000

As malware analysts, we would like to deobfuscate the binary by replacing these privileged sequences with the original call instructions, both for our own sake as well as the ability to use the Hex-Rays decompiler (which otherwise balks at the use of privileged instructions).  However, the particular implementation within this sample poses a slight conundrum.  The sequence "in eax, dx / dw XXXXh" is three bytes long, whereas the original "call 409EE0h" is five bytes.  Therefore, we cannot merely rewrite the original instruction atop the privileged one without overwriting subsequent instructions.

A second idea is to use detouring, another staple technique in reverse engineering.  We could find or create some unused space within the binary, patch a jmp from the privileged instruction to that new location, recreate the original instruction at that location, and then patch a jmp back to the location after the privileged instruction.  However, this idea is flawed for the same reason:  a long jmp instruction is five bytes long, so we would also alter subsequent instructions.

Bill Gates questionably said "I choose a lazy person to do a hard job, because a lazy person will find an easy way to do it."  After some thought, I recalled a bit of x86 processor arcana that can help us fit our detours into the three-byte spaces provided by the obfuscator:  the address-size prefix, 0x67.  Quoth the Intel manuals:  "Address calculations are first truncated to the effective address size of the current mode, as overridden by any address-size prefix.  The result is then zero-extended to the full address width."  I.e., when operating in 32-bit mode, if we prefix an instruction that references an address with 0x67, the address will be truncated to 16-bits.  

To be specific, consider the following:

UPX1:00410320 EB 00 jmp near ptr unk_410322 
; jump to 0x00410320 + 2(length of jmp instruction) + 0x00 

When we place an address-size prefix on this instruction, we get:

UPX1:00410320          db 67h
UPX1:00410320 67 EB 00 jmp near ptr unk_323 
; jump to (0x00410320 + 3(length of jmp instruction) + 0x00) & 0xFFFF

To use this trick for deobfuscation, we must first create a segment at address 0x0 of length 0xFFFF.

Recalling our motivating example:

UPX0:004081F7 in eax, dx ; obfuscated call 0x405000 + 0x4EE0
UPX0:004081F8 dw 4EE0h 

Let us overwrite these bytes with 67 EB FD:

UPX0:004081F7 db 67h
UPX0:004081F7 jmp short loc_81F7

This is the first half of our detour.  Now, at location 81F7h, let's replicate the original control transfer instruction and add a jmp back to the location after the obfuscated sequence:

ROLFMSRE:000081F7 loc_81F7:
ROLFMSRE:000081F7 call    sub_409EE0
ROLFMSRE:000081FC jmp     loc_4081FA

And now we have accomplished our goal.  Once we have replaced the obfuscated control transfers, we human analysts can read the listing more easily, and Hex-Rays no longer has trouble decompiling the code.

Trivia Questions for X86 Nerds

Googling, referencing the Intel manuals, and using a debugger are all discouraged.  Please don't post the answers in the comments!

  1. Name two instructions that have a memory expression for an operand, but do not access memory.
  2. Conditional jumps with 16-/32-bit displacements were not available on the 8086. How did compilers generate long conditional jumps back then?
  3. For ModRM-32 memory expressions (such as dword ptr [eax], byte ptr [eax+ebx], word ptr [eax+ebx*4], qword ptr [ebx*8]), what are the rules for determining the segment against which the address is applied?  What about ModRM-16 memory expressions (like [bx+si])?
  4. The instruction "bswap r32" endian-swaps the specified 32-bit register.  I.e., if eax = 12345678h, after executing bswap eax, eax = 78563412h.  The behavior of "bswap r16" (i.e., bswap ax) is undefined as per the Intel manuals.  Name a behavior exhibited by an actual processor when "bswap r16" executes.
  5. Name two single-byte, undocumented instructions, and describe their behavior.
  6. Name a circumstance under which the "aam" instruction can fault.
  7. Name an instruction that writes to memory in some specific segment, where the segment cannot be overridden by a segment prefix.
  8. The "nop r/m32" instruction (e.g., "nop [eax]"), introduced in the Pentium Pro series of processors, behaves identically to the "nop" instruction which has been present since the original 8088.  Why does the Pentium Pro instruction exist?
  9. For SSE instructions with mandatory prefixes (66/F1/F3), what happens if you put two such prefixes on an instruction?
  10. Name a 32-bit instruction that is not encodable in 64-bit mode due to its assimilation into the VEX prefix schema.
  11. "mov eax, [reg32]" is an invalid instruction (i.e., cannot be encoded) for which general-purpose 32-bit register (eax, ebx, ecx, edx, esp, ebp, esi, edi)?
  12. Comparing "inc eax" and "add eax, 1", what is the difference in processor state (i.e. the registers, flags, and memory, without considering EIP) after execution?
  13. Name a register that existed before the Pentium series, and ceased to exist beginning with the Pentium series.
  14. What happens when you put an address size (67) prefix on a conditional jump?
  15. "movsb" implicitly references two memory operands, ds:[esi] and es:[edi].  What happens when you put a segment prefix on this instruction?
  16. The "bit-scan in reverse" instruction, "bsr eax, ebx", sets eax to the bit number of the least significant 1-bit set within ebx.  If ebx is zero, the value placed into eax is undefined as per the Intel manuals.  Name a behavior exhibited by an actual processor when executing "bsr" with a right-hand size of zero.
  17. Arithmetic comparison operations are not commutative.  I.e., "cmp eax, ebx" is not the same as "cmp ebx, eax".  In the instruction "cmpxchg cl, bl", which comparison is performed?
  18. In terms of processor state, is "rol al, 0" the same as "rol al, 8"?
  19. The auxiliary carry flag (AF) is similar to the carry flag (CF), albeit for X-bit quantities instead of 8/16/32/64. What is X?
  20. Apart from "pushf" and "lahf", name an instruction that uses the value of the AF flag (as opposed to merely defining AF without using it).
  21. "shld r32, r/m32, r/imm8" shifts bits from the second operand into the first operand (from the left, i.e., the bottom).  For example, if eax = 0x40000001, edx = 0x80000000, and cl = 1, after executing "shld eax, edx, cl", eax = 0x80000003.  The shld instruction behaves analogously for 16-bit operands, but its behavior is undefined as per the Intel manuals if the shift count (third operand) specifies a shift of more than 16.  Name a behavior exhibited by an actual processor when "shld ax, dx, cl" executes with 0x10 <= cl < 0x20.
  22. After executing "shl eax, 32", is the overflow flag (OF) defined?  If so, what is its value?
  23. After executing "shl ax, 16", is the overflow flag (OF) defined?  If so, what is its value?
  24. In terms of processor state, is there any difference between: "btc eax, ebx" and "push eax / btc [esp], ebx / pop eax" (apart from the values of EIP and dword ptr [esp-4])?
  25. In 16-bit real mode, segments are 64k windows into a 1MB address space.  This coincides with the range of a 16-bit near call or near jump.  Name a strategy that 16-bit linkers employ to allow seamless merging of the control flow between object files whose combined code size exceeds 64kb.

Program Synthesis in Reverse Engineering

The slides and video for my keynote speech, "Program Synthesis in Reverse Engineering", are now online.  

Abstract

Program synthesis is an academic discipline devoted to creating computer programs automatically, given a precise specification of how the program should operate.  It works on small scales and is mostly researched for programs without loops in them.  We apply and adapt existing academic work in program synthesis to solve problems in reverse engineering.

  1. Semi-automated synthesis of CPU emulators (academic inspiration here)
  2. Automated generation of deobfuscators for peephole-expansion obfuscators (academic inspiration here)
  3. Reconstruction of obfuscated, metamorphic code sequences (academic inspiration here)

Viewing Instructions

Open the video on half of your screen and the slides on the other, switching through the slides as I do so during the video.  The presentation uses a lot of in-frame animations, so for best results, you will want to view the PDF in contiguous rather than contiguous mode.  I.e., only one slide should be on-screen at a time (no fractions of subsequent slides visible), and that advancing the slide should bring up an entirely new slide.  This is easy to accomplish with the full-page mode of standalone PDF viewers.  In Chrome's PDF viewer, you can use the left and right arrow keys to advance or retreat one slide at a time.  Alternatively, there is an icon bar at the bottom right of each slide.  The first two buttons from the left retreat and advance by one slide, respectively.  Failing all of these options, use a different PDF viewer.

Program Analysis Course

Information about the new course described at the end of the talk can be found here.  At present, for logistical reasons, only private on-site offerings can be accommodated.  I hope to install a web application to help track demand (availability in time and in region) to schedule public courses.

Video of my RECON 2012 Keynote: The Case for Semantics-Based Methods in Reverse Engineering

Originally published July 2nd, 2012 on OpenRCE

My previous blog entry concerned the keynote speech that I gave at RECON 2012, entitled "The Case for Semantics-Based Methods in Reverse Engineering".  (You can find a link to the slides at that previous entry.)  In that blog, based on conversations that I had had with the RECON organizers, I made statements to the effect that the video had been destroyed.  It turns out that the video was in fact not destroyed.  You can watch it here.  I would recommend reading along with the slides, seeing as I apparently skipped a couple of slides during the beginning of my presentation.  Blah

RECON 2012 Keynote: The Case for Semantics-Based Methods in Reverse Engineering

Originally published June 18th, 2012 on OpenRCE.

The goal of my RECON 2012 keynote speech was to introduce methods in academic program analysis and demonstrate -- intuitively, without drawing too much on formalism -- how they can be used to solve practical problems that are interesting to industrial researchers in the real world.  Given that it was the keynote speech, and my goal of making the material as accessible as possible, I attempted to make my points with pictures instead of dense technical explanations.  As a result, one might consider this presentation to be a friendly (but decidedly incomplete) introduction to binary program analysis as opposed to a rigorous mathematical monograph.  The presentation features five detailed expositions of applying static program analysis (abstract interpretation and SMT solving) towards practically-interesting reverse engineering problems.  (Aside:  it's quite challenging to present this material without using terms such as "lattice", "Galois connection", etc.!)

Unfortunately, due to an error with the camera, the recording of the talk does not exist. This is problematic: I failed somewhat in walking the sharp edge of Einstein's razor, "as simple as possible, but no simpler" -- it was in fact made simpler than what was possible, and some important details (for example, about relational abstract interpretation and reduced products) were included in the spoken material but not the actual slides. Therefore, the learned reader is advised to imagine judiciously-placed asterisks and the accompanying errata, and the untutored pupil would be well-advised to recognize the incomplete and intuitive nature of the exposition and perhaps consult this program analysis reading list.

I would like to give the talk at some other conference at which the video can be reliably recorded, so that it may be published online.

Here are the slides.