A Walk-Through Tutorial, with Code, on Statically Unpacking the FinSpy VM: Part One, x86 Deobfuscation

1. Introduction

Normally when I publish about breaking virtual machine software protections, I do so to present new techniques. Past examples have included:

Today's document has a different focus. I am not going to be showcasing any particularly new techniques. I will, instead, be providing a step-by-step walk-through of the process I used to analyze the FinSpy VM, including my thoughts along the way, the procedures and source code I used, and summaries of the notes I took. The interested reader is encouraged to obtain the sample and walk through the analysis process for themselves.

I have three motives in publishing this document:

  1. I think it's in the best interest of the security defense community if every malware analyst is able to unpack the FinSpy malware VM whenever they encounter it (for obvious reasons).
  2. Reverse engineering is suffering from a drought of hands-on tutorial material in modern times. I was fortunate to begin reverse engineering when such tutorials were common, and they were invaluable in helping me learn the craft. Slides are fine for large analyses, but for smaller ones, let's bring back tutorials for the sake of those that have followed us.
  3. Publications on obfuscation, especially virtualization obfuscation, have become extremely abstruse particularly in the past five years. Many of these publications are largely inaccessible to those not well-versed in master's degree-level program analysis (or above). I want to demonstrate that easier techniques can still produce surprisingly fast and useful results for some contemporary obfuscation techniques. (If you want to learn more about program analysis-based approaches to deobfuscation, there is currently a public offering of my SMT-based program analysis training class, which has over 200 slides on modern deobfuscation with working, well-documented code.)

2. Initial Steps

The first thing I did upon learning that a new FinSpy sample with VM was publicly available was, of course, to obtain the sample. VirusTotal gave the SHA256 hash; and I obtained the corresponding sample from Hybrid-Analysis.

The next step was to load the sample into IDA. The navigation bar immediately tipped me off that the binary was obfuscated:

NavBar-Original.png
  • The first half of the .text section is mostly colored grey and red, indicating data and non-function code respectively.
  • The second half of the .text section is grey in the navigation bar, indicating data turned into arrays.

A normal binary would have a .text section that was mostly blue, indicating code within functions.

3. Analysis of WinMain: Suspicions of VM-Based Obfuscation

IDA's auto-analysis feature identified that the binary was compiled by the Microsoft Visual C compiler. I began by identifying the WinMain function. Normally IDA would do this on my behalf, but the code at that location is obfuscated, so IDA did not name it or turn it into a function. I located WinMain by examining the ___tmainCRTStartup function from the Visual C Run-Time and finding where it called into user-written code. The first few instructions resembled a normal function prologue; from there, the obfuscation immediately began.

.text:00406154     mov     edi, edi                 ; Normal prologue
.text:00406156     push    ebp                      ; Normal prologue
.text:00406157     mov     ebp, esp                 ; Normal prologue
.text:00406159     sub     esp, 0C94h               ; Normal prologue
.text:0040615F     push    ebx                      ; Save registers #1
.text:00406160     push    esi                      ; Save registers #1
.text:00406161     push    edi                      ; Save registers #1
.text:00406162     push    edi                      ; Save registers #2
.text:00406163     push    edx                      ; Save registers #2
.text:00406164     mov     edx, offset byte_415E41  ; Obfuscation - #1
.text:00406169     and     edi, 0C946B9C3h          ; Obfuscation - #2
.text:0040616F     sub     edi, [edx+184h]          ; Obfuscation - #3
.text:00406175     imul    edi, esp, 721D31h        ; Obfuscation - #4
.text:0040617B     stc                              ; Obfuscation
.text:0040617C     sub     edi, [edx+0EEh]          ; Obfuscation - #5
.text:00406182     shl     edi, cl                  ; Obfuscation
.text:00406184     sub     edi, [edx+39h]           ; Obfuscation - #6
.text:0040618A     shl     edi, cl                  ; Obfuscation
.text:0040618C     imul    edi, ebp                 ; Obfuscation
.text:0040618F     mov     edi, edi                 ; Obfuscation
.text:00406191     stc                              ; Obfuscation
.text:00406192     sub     edi, 0A14686D0h          ; Obfuscation

; ... obfuscation continues ...

.text:004065A2     pop     edx                      ; Restore registers
.text:004065A3     pop     edi                      ; Restore registers

The obfuscation in the sequence above continues for several hundred instructions, nearly all of them consisting of random-looking modifications to the EDI register. I wanted to know A) whether the computations upon EDI were entirely immaterial junk instructions, or whether a real value was being produced by this sequence, and B) whether the memory references in the lines labeled #1, #3, #5, and #6 were meaningful.

As for the first question, note that the values of the registers upon entering this sequence are unknown. We are, after all, in WinMain(), which uses the __cdecl calling convention, meaning that the caller did not pass arguments in registers. Therefore, the value computed on line #2 is unpredictable and can potentially change across different executions. Also, the value computed on line #4 is pure gibberish -- the value of the stack pointer will change across runs (and the modification to EDI overwrites the values computed on lines #1-#3).

As for the second question, I skimmed the obfuscated listing and noticed that there were no writes to memory, only reads, all intertwined with gibberish instructions like the ones just described. Finally, the original value of edi is popped off the stack at the location near the end labeled "restore registers". So I was fairly confident that I was looking at a sequence of instructions meant to do nothing, producing no meaningful change to the state of the program.

Following that was a short sequence:

.text:004065A4     push    5A403Dh                  ; Obfuscation
.text:004065A9     push    ecx                      ; Obfuscation
.text:004065AA     sub     ecx, ecx                 ; Obfuscation
.text:004065AC     pop     ecx                      ; Obfuscation
.text:004065AD     jz      loc_401950               ; Transfer control elsewhere
.text:004065AD ; ---------------------------------------------------------------------------
.text:004065B3     db 5 dup(0CCh)
.text:004065B8 ; ---------------------------------------------------------------------------
.text:004065B8     mov     edi, edi
.text:004065BA     push    ebp
.text:004065BB     mov     ebp, esp
.text:004065BD     sub     esp, 18h

; ... followed by similar obfuscation to what we saw above ...

By inspection, this sequence just pushes the value 5A403Dh onto the stack, and transfers control to loc_401950. (The "sub ecx, ecx" instruction above sets the zero flag to 1, therefore the JZ instruction will always branch.) 

Next we see the directive "db 5 dup(0CCh)" followed by "mov edi, edi". Reverse engineers will recognize these sequences as the Microsoft Visual C compiler's implementation of hot-patching support. The details of hot-patching are less important than the observation that I expected that the original pre-obfuscated binary contained a function that began at the address of the first sequence, and ended before the "db 5 dup(0CCh)" sequence. I.e. I expect that the obfuscator disassembled all of the code within this function, replaced it with gibberish instructions, placed a branch at the end to some other location, and then did the same thing with the next function.

This is a good sign that we're dealing with a virtualization-based obfuscator: namely, it looks like the binary was compiled with an ordinary compiler, then passed to a component that overwrote the original instructions (rather than merely encrypting them in-place, as would normal packers). 

4. Learning More About the VM Entrypoint and VM Pre-Entry

Recall again the second sequence of assembly code from the previous sequence:

.text:004065A4     push    5A403Dh                  ; Obfuscation - #1
.text:004065A9     push    ecx                      ; Obfuscation
.text:004065AA     sub     ecx, ecx                 ; Obfuscation
.text:004065AC     pop     ecx                      ; Obfuscation
.text:004065AD     jz      loc_401950               ; Transfer control elsewhere

Since -- by supposition -- all of the code from this function was replaced with gibberish, there wasn't much to meaningfully analyze. My only real option was to examine the code at the location loc_401950, the target of the JZ instruction on the last line. The first thing I noticed at this location, loc_401950, was that there were 125 incoming references, nearly all of them of the form "jz loc_401950", with some of the form "jmp loc_401950". Having analyzed a number of VM-based obfuscators in my day, this location fits the pattern of being the part of the VM known as the "entrypoint" -- the part where the virtual CPU begins to execute. Usually this location will save the registers and flags onto the stack, before performing any necessary setup, and finally beginning to execute VM instructions. VM entrypoints usually require a pointer or other identifier to the bytecode that will be executed by the VM; maybe that's the value from the instruction labeled #1 in the sequence above? Let's check another incoming reference to that location to verify:

.text:00408AB8     push    5A7440h ; #2
.text:00408ABD     push    eax
.text:00408ABE     sub     eax, eax
.text:00408AC0     pop     eax
.text:00408AC1     jz      loc_401950

The other location leading to the entrypoint is functionally identical, apart from pushing a different value onto the stack. This value is not a pointer; it does not correspond to an address within the executable's memory image. Nevertheless, we expect that this value is somehow responsible for telling the VM entrypoint where the bytecode is located.

5. Analyzing the VM Entrypoint Code

So far we have determined that loc_401950 is the VM entrypoint, targeted by 125 branching locations within the binary, which each push a different non-pointer DWORD before branching. Let's start analyzing that code:

.text:00401950                   loc_401950:
.text:00401950 0F 82 D1 02 00 00     jb      loc_401C27
.text:00401956 0F 83 CB 02 00 00     jnb     loc_401C27

Immediately we see an obvious and well-known form of obfuscation. The first line jumps to loc_401C27 if the "below" conditional is true, and the second line jumps to loc_401C27 if the "not below" conditional is true. I.e., execution will reach loc_401C27 if either "below" or "not below" is true in the current EFLAGS context. I.e., these two instructions will transfer control to loc_401C27 no matter what is in EFLAGS -- and in particular, we might as well replace these two instructions with "jmp loc_401C27", as the effect would be identical.

Continuing to analyze at loc_401C27, we see another instance of the same basic idea:

.text:00401C27                   loc_401C27:
.text:00401C27 77 CD                 ja      short loc_401BF6
.text:00401C29 76 CB                 jbe     short loc_401BF6

Here we have an unconditional branch to loc_401BF6, split across two instructions -- a "jump if above", and "jump if below or equals", where "above" and "below or equals" are logically opposite and mutually exclusive conditions.

After this, at location loc_401BF6, there is a legitimate-looking instruction (push eax), followed by another conditional jump pair to loc_401D5C. At that location, there is another legitimate-looking instruction (push ecx), followed by a conditional jump pair to loc_4019D2. At that location, there is another legitimate-looking instruction (push edx), followed by another conditional jump pair. It quickly became obvious that every legitimate instruction was interspersed between one or two conditional jump pairs -- there are hundreds or thousands of these pairs throughout the binary.

Though an extremely old and not particularly sophisticated form of obfuscation, it is nevertheless annoying and degrades the utility of one's disassembler. As I discussed in a previous entry on IDA processor module extensions, IDA does not automatically recognize that two opposite conditional branches to the same location are an unconditional branch to that location. As a result, IDA thinks that the address following the second conditional branch must necessarily contain code. Obfuscation authors exploit this by putting junk bytes after the second conditional branch, which then causes the disassembler to generate garbage instructions, which may overlap and occlude legitimate instructions following the branch due to the variable-length encoding scheme for X86. (Note that IDA is not to blame for this conundrum -- ultimately these problems are undecidable under ordinary Von Neumann-based models of program execution.) The result is that many of the legitimate instructions get lost in the dreck generated by this process, and that, in order to follow the code as usual in manual static analysis, one would spend a lot of time manually undefining the gibberish instructions and re-defining the legitimate ones.

6. Deobfuscating the Conditional Branch Obfuscation: Theory and Practice

Manually undefining and redefining instructions as just described, however, would be a waste of time, so let's not do that. Speaking of IDA processor modules, once it became clear that this pattern repeated between every legitimate non-control-flow instruction, I got the idea to write an IDA processor module extension to remove the obfuscation automatically. IDA processor module extensions give us the ability to have a function of ours called every time the disassembler encounters an instruction. If we could recognize that the instruction we were disassembling was a conditional branch, and determine that the following instruction contains its opposite conditional branch to the same target as the first, we could replace the first one with an unconditional branch and NOP out the second branch instruction.

Thus, the first task is to come up with a way to recognize instances of this obfuscation. It seemed like the easiest way would be to do this with byte pattern-recognition. In my callback function that executes before an instruction is disassembled, I can inspect the raw bytes to determine whether I'm dealing with a conditional branch, and if so, what the condition is and the branch target. Then I can apply the same logic to determine whether the following instruction is a conditional branch and determine its condition and target. If the conditions are opposite and the branch targets are the same, we've found an instance of the obfuscation and can neutralize it.

In practice, this is even easier than it sounds! Recall the first example from above, reproduced here for ease of reading:

.text:00401950 0F 82 D1 02 00 00     jb      loc_401C27
.text:00401956 0F 83 CB 02 00 00     jnb     loc_401C27

Each of these two instructions is six bytes long. They both begin with the byte 0F (the x86 two-byte escape opcode stem), are then followed by a byte in the range of 80 to 8F, and are then followed by a DWORD encoding the displacement from the end of the instructions to the branch targets. As a fortuitous quirk of x86 instruction encodings, opposite conditional branches are encoded with adjacent bytes. I.e. 82 represents the long form of JB, and 83 represents the long form of JNB. Two long branches have opposite condition codes if and only if their second opcode byte differs from one another in the lowest bit (i.e. 0x82 ^ 0x83 == 0x01). And note also that the DWORDs following the second opcode byte differ by exactly 6 -- the length of a long conditional branch instruction.

That's all we need to know for the long conditional branches. There is also a short form for conditionals, shown in the second example above and reproduced here for ease of reading:

.text:00401C27 77 CD                 ja      short loc_401BF6
.text:00401C29 76 CB                 jbe     short loc_401BF6

Virtually identical comments apply to these sequences. The first bytes of both instructions are in the range of 0x70 to 0x7F, opposite conditions have differing lowest bits, and the second bytes differ from one another by exactly 2 -- the length of a short conditional branch instruction.

7. Deobfuscating the Conditional Branch Obfuscation: Implementation

I started by copying and pasting my code from the last time I did something like this. I first deleted all the code that was specific to the last protection I broke with an IDA processor module extension. Since I've switched to IDA 7.0 in the meantime, and since IDA 7.0 made breaking changes vis-a-vis prior APIs, I had to make a few modifications -- namely, renaming the custom analysis function from deobX86Hook::custom_ana(self) to deobX86Hook::ev_ana_insn(self, insn), and replacing every reference to idaapi.cmd.ea with insn.ea. Also, my previous example would only run if the binary's MD5 matched a particular sum, so I copied and pasted the sum of my sample out of IDA's database preamble over the previous MD5.

From there I had to change the logic in custom_ana. The result was even simpler than my last processor module extension. Here is the logic for recognizing and deobfuscating the short form of the conditional branch obfuscation:

b1 = idaapi.get_byte(insn.ea)
if b1 >= 0x70 and b1 <= 0x7F:
    d1 = idaapi.get_byte(insn.ea+1)
    b2 = idaapi.get_byte(insn.ea+2)
    d2 = idaapi.get_byte(insn.ea+3)
    if b2 == b1 ^ 0x01 and d1-2 == d2:
        # Replace first byte of first conditional with 0xEB, the opcode for "JMP rel8"
        idaapi.put_byte(insn.ea, 0xEB) 
        # Replace the following instruction with two 0x90 NOP instructions
        idaapi.put_word(insn.ea+2, 0x9090)

Deobfuscating the long form is nearly identical; see the code for details.

8. Admiring My Handiwork, Cleaning up the Database a Bit

Now I copied the processor module extension to %IDA%\plugins and re-loaded the sample. It had worked! The VM entrypoint had been replaced with:

.text:00401950 loc_401950:
.text:00401950     jmp     loc_401C27

Though the navigation bar was still largely red and ugly, I immediately noticed a large function in the middle of the text section:

NavBar-Middle.png

Looking at it in graph mode, we can see that it's kind of ugly and not entirely as nice as analyzing unobfuscated X86, but considering how trivial it was to get here, I'll take it over the obfuscated version any day. The red nodes denote errant instructions physically located above the valid ones in the white nodes. IDA's graphing algorithm includes any code within the physically contiguous region of a function's chunks in the graph display, regardless of whether they have incoming code cross-references, likely to make displays of exception handlers nicer. It would be easy enough to remove these and strip the JMP instructions if you wanted to write a plugin to do so.

Graph.png

Next I was curious about the grey areas in the .text section navigation bar held. (Those areas denote defined data items, mixed in with the obfuscated code in the .text section.) I figured that the data held there was most likely related to the obfuscator. I spent a minute looking at the grey regions and found this immediately after the defined function:

.text:00402AE0     dd offset loc_402CF2
.text:00402AE4     dd offset loc_402FBE

; ... 30 similar lines deleted ...

.text:00402B60     dd offset loc_4042DC
.text:00402B64     dd offset loc_40434D

34 offsets, each of which contains code. Those are probably the VM instruction handlers. For good measure, let's turn those into functions with an IDAPython one-liner:

for pFuncEa in xrange(0x00402AE0, 0x00402B68, 4):
    idaapi.add_func(idaapi.get_long(pFuncEa))

Now a large, contiguous chunk of the navigation bar for the .text section is blue. And at this point I realized I had forgotten to create a function at the original dispatcher location, so I did that manually and here was the resulting navigation bar:

NavBar-Final.png

Hex-Rays doesn't do a very good job with any of the functions we just defined, since they were originally written in assembly language and use instructions and constructs not ordinarily produced by compilers. I don't blame Hex-Rays for that and I hope they continue to optimize for standard compiler-based use cases and not weird ones like this.

Lastly, I held PageDown scrolling through the text section to see what was left. The majority of it was VM entrypoints like those we saw in section 3. There were a few functions that appeared like they had been produced by a compiler.

So now we have assessed what's in the text section -- a VM with 34 handlers, 125+ virtualized functions, and a handful of unvirtualized ones. Next time we'll take a look at the VM.

9. Preview of Parts 2 and 3, and Beyond

After this I spent a few hours analyzing the VM entrypoint and VM instruction handlers. Next, through static analysis I obtained the bytecode for the VM program contained within this sample. I then wrote a disassembler for the VM. That's part two.

From there, by staring at the disassembled VM bytecode I was able to write a simple pattern-based deobfuscator. After that I re-generated the X86 machine code, which was not extremely difficult, but it was more laborious than I had originally anticipated. That's part three.

After that, I re-inserted the X86 machine code into the original binary and analyzed it. It turned out to be a fairly sophisticated dropper for one of two second-stage binaries. It was fairly heavy on system internals and had a few tricks that aren't widely documented, so I may publish one or more of those as separate entries, and/or I may publish an analysis of the entire dropper.

Finally, I analyzed -- or rather, still am analyzing -- the second-stage binaries. They may or may not prove worthy of publication.
 

The Synesthesia Shellcode Generator: Code Release and Future Directions

Synesthesia is an idea that I published at EkoParty last year (slides and video are available) regarding automated shellcode generation under encoding restrictions. The presentation walked through an extended tutorial on program synthesis, and showed how the idea would be implemented using those techniques. I promised to release code; this blog entry is the notification of such release, and some explanation of what the code is, what it is not, and what I hope it shall be in future releases. Here is the GitHub repository.

Limitations of current release

As envisioned in that presentation, the ideal implementation of Synesthesia is a stand-alone compiler with three modes: 1) generate shellcode under input restrictions given a specification for its behavior; 2) re-compile existing shellcode under input restrictions; and 3) encode and generate decoder loops for existing, non-encoded shellcode binary blobs. No matter which mode, in the ideal fully-automated implementation, the user should be able to write programs that dictate their requirements on the shellcode, invoke the Synesthesia compiler, and recieve machine code as output. As with any compiler, Synthesthsia should be a black box: to use it, the user should not have to be an expert user of SMT solvers or possess advanced education in theoretical computer science and mathematics. At present, the current implementation falls short of the goal of being an actual compiler: the process is not fully (or even largely) automated.

The current implementation of Synesthesia is a set of .ys files, scripts written in the YICES 2.x SMT solver presentation language. Each one defines an SMT query corresponding to an example given in the presentation. To obtain a result, the user must pass the .ys files into yices 2.x with the --mode=ef command-line option, and then manually interpret the results. To solve a different problem from the ones given in the presentation, the user must manually create a .ys file (perhaps using the existing ones as a template) and pass it to Yices for solving.

Lately, I have been working on (and enjoying) bringing Synesthesia closer to the ideal vision. The in-progress version is a legitimate, stand-alone compiler with its own programming language, where the machine language decoding routines are written in that language. However, it is in early development at the moment and is not yet suitable for release.

Example code

One of the most interesting files in the repository is the x86 version. It implements several of the examples from the presentation. However, understanding it may be difficult before reading the implementations for the earlier tutorial examples given in the presentation, and those examples also contain sophisticated techniques regarding loops. They are described subsequently.

The first part of the presentation walks through a simple example of synthesizing C programs. The .ys implementation can be found here. Hopefully, with the comments in the code and its short length, this should be easy to understand if you read the presentation.

Next, the presentation extends the ideas first to synthesizing assembly language programs, and then to synthesizing machine language programs. For demonstration, it uses two imaginary languages, the "Simple" assembly language and its "SimpleMC" machine code. The first example involves synthesizing the "increment" operator in Simple assembly language. That example is found here

Shortly after in the presentation, I demonstrate how to obtain the longest or shortest program satisfying the constraints.

Much of the remaining material is dedicated to synthesizing decoder loops. The first examples involve simple loops. The next two examples synthesize complex  decoders that take two bytes of input to produce one byte of output. The first example restricts the input to printable bytes; the second restricts to alphanumeric bytes.

That's all for now; I hope somebody finds it useful.

Synesthesia: Modern Shellcode Synthesis (Ekoparty 2016 Talk)

Here are the slides and (soon!) code for my recent presentation at Ekoparty 2016. The full abstract can be found below. In brief, this research involved automatically generating shellcodes when there are restrictions on legal encodings. We explore examples ranging from well-known ones (no NULL bytes, no '%' character, printable, alphanumeric, all letters uppercase, etc.) to weird and challenging ones (bytes must alternate even and odd, no duplicate bytes, all words are prime numbers, etc). We also explore automated encoding and decoder generation: e.g., given some existing shellcode, transform it into (for example) alphanumeric bytes, and generate a decoder to revert the encoding at run-time. We also explore tasks like finding the shortest or longest solutions, re-writing existing shellcodes into a given encoding, exploiting known facts about the input state, and integration with automated exploit generation.

SLIDES, CODE, AND INFORMATION

Slides: HERE
Slides with presenter notes: HERE
Video: HERE (does not include the material on decoder loop synthesis due to time constraints)
Code:  HERE
Information about the training classes mentioned in the presentation: HERE

ABSTRACT

The problems of shellcode generation and of memory corruption exploit development share a birthday. In brief, memory corruption exploits must trick a program into executing machine code ("shellcode") provided as input. Each individual exploit scenario may place constraints upon the allowable machine code bytes: NULL bytes (or any arbitrary bytes) may be disallowed; the input may be constrained to be alphanumeric; all ASCII characters may be required to be uppercase; certain characters may be filtered; the input may be transformed in arbitrary ways; the input may be required to lie within UTF-8 or UTF-16; and so on.

Historically, the security community has dealt with these problems on a case-by-case basis. Many papers were written regarding various processor architectures and some common encoding restriction schema. Generally, these publications describe patterns for performing common operations (setting registers to constants, obtaining the program counter, etc.) within the given encoding restriction. From these publications came shellcode encoding; rather than writing the entire shellcode within the encoding restriction, we encode a shellcode blob within the encoding, and generate a machine code program within that encoding to decode the blob and execute it. Shellcode encoders are useful, but they suffer from a number of issues. They expand the size of the shellcode blob, which can render an exploit unworkable. They often contain common sequences of machine code, for which IDS detections are readily available. They are not guaranteed to find an encoding and a decoder, even if one exists. In short, shellcode generation is still a real-world problem, despite the existence of shellcode encoders.

In this publication, we provide a novel technique based on program synthesis for creating machine code programs that fall within a given encoding. Essentially, Synesthesia is a compiler whose inputs are a specification of the desired functionality, together with a specification of the allowable encodings. Synesthesia enjoys a number of nice theoretical properties: it is guaranteed to find an encoding for the desired functionality if one exists; the user can search for the shortest program in terms of byte length or number of instructions; it does not rely upon pattern databases of any sort, so each run can potentially produce an entirely unique output; and it can produce self-modifying code. The ideas behind Synesthesia are not tied to any specific processor architecture, and do not require emulation, access to such a processor, or brute-forcing.

This presentation will discuss the context wherein Synesthesia exists, the concepts behind its design, case studies on more than one assembly language, a performance evaluation, and a discussion of the theoretical limitations (i.e., permanent issues) and practical ones (i.e., limitations of contemporary SMT solvers). Synesthesia shall be made available as open source.

VIEWING INSTRUCTIONS

Open the video on half of your screen and the slides on the other, switching through the slides as I do so during the video.  The presentation uses a lot of in-frame animations, so for best results, you will want to view the PDF in contiguous rather than contiguous mode.  I.e., only one slide should be on-screen at a time (no fractions of subsequent slides visible), and that advancing the slide should bring up an entirely new slide.  This is easy to accomplish with the full-page mode of standalone PDF viewers.  In Chrome's PDF viewer, you can use the left and right arrow keys to advance or retreat one slide at a time.  Alternatively, there is an icon bar at the bottom right of each slide.  The first two buttons from the left retreat and advance by one slide, respectively.  Failing all of these options, use a different PDF viewer.

 

 

SMT-Based Binary Program Analysis Course Sample: X86 Assembly/Disassembly

Möbius Strip Reverse Engineering is pleased to offer a representative sample of its new course offering on SMT-based binary program analysis.  The first public offering of this course is the week of September 21st, 2015 in Northern Virginia; see the link for details.

The sample is meant to demonstrate the lecture format, the nature of the code provided in the course, and the nature of the exercises done throughout the class.  Accordingly, the course sample consists of four components:

  • Lecture slides, describing every facet of the x86 instruction set, and the process of assembling and disassembling it.
  • Python code for an entire x86 assembler and disassembler suite.  It is heavily documented, with accompanying external HTML documentation, and thorough test suites.
  • The implementation manual, which describes the design philosophy behind each module of the code.  Additionally, the first chapter is an introduction to Python to assist those familiar with other programming languages.
  • The exercise manual, which contains two varieties of exercises.  
  1. Hand-written exercises involving encoding and decoding X86 instructions and their operands.
  2. Programming exercises involving completing portions of the Python code.  As with all programming exercises in the course, the student can check their progress by running the provided, comprehensive test suites.

We at Möbius Strip Reverse Engineering hope you find this material educational and valuable.

Memory Lane: Hacking Renovo

Renovo is an automated "unpacking" tool developed by BitBlaze at UC Berkeley.  The notion behind Renovo is that packers frequently encrypt and/or compress regions of code at the time of protection, and decrypt/decompress these regions while the packer executes.  The Renovo paper terms these regions "hidden code", and the goal of Renovo as a system is to retrieve the hidden code regions generated throughout packer execution.

(Note that Renovo is not what we might consider a truly automated unpacker, as it does not attempt to reconstruct working executables.  I.e., protections regarding imported symbols are not resolved, and other forms of protection such as virtualization obfuscation are ignored.)

Renovo is built atop QEMU, and performs a watered-down form of dynamic taint analysis.  Namely, every time the packer code writes to memory, the written addresses are considered "dirty", with such information being recorded in a table.  Then, for every instruction executed throughout the course of packer execution, Renovo queries the dirty-address table to determine whether the instruction's address has previously been overwritten.  If this is the case, Renovo considers this moment in time as beginning the execution of "hidden code".  It makes a note of the event, and dumps the surrounding dirty regions.  This simple technique is very effective in tracking execution within memory regions that have previously been written.

For a brief period of time, the folks at BitBlaze put Renovo online for public evaluation.  It had a web interface allowing the user to upload malicious binaries.  The system then ran the binaries through Renovo, collected all of the hidden code regions gathered on a particular run, and emailed the results to the user.

Since I am a reverse engineer, I could not resist the temptation to screw around with this system.  In particular, I wanted to know whether there was any secret sauce running inside of the emulated environment (beyond the modifications to QEMU).  The nature of the public demonstration allowed me to run code of my choosing within the Renovo environment.  I.e., I could enumerate the file system, the drivers, registry keys of my choosing, and so on.  But how was I going to exfiltrate the results?

After some thought, I realized that if I could turn the data into code and execute it, Renovo would happily email it back to me, because that is exactly what it was designed to do.  

In particular, suppose I wish to exfiltrate the contents of some buffer filled with reconnaissance data.  First, allocate RWX memory of an appropriate size.  Now, let's consider our data buffer as a collection of 32-bit integers.  Take the first integer, dw0, and use it to create the instruction "add eax, dw0", i.e., "05 XX YY ZZ WW", where dw0 is 0xWWZZYYXX.  Repeat this process for all integers within the buffer.  At the end, write a "retn" instruction, i.e., 0xC3.  Now execute this piece of freshly-generated code.  

Renovo will detect this as "hidden code" executed by the process, and send the entire piece of allocated memory back to me.  From there, it is a simple matter to strip out the "05" bytes (corresponding to "add eax, dword") and the trailing "C3" byte ("retn"), and reconstruct the original data buffer.  The code looked roughly as follows:

 

#define DWORD_AS_CODE(ptr,value) {\
  *ptr++ = 0x05;\ // ADD EAX, DWORD
  *(long *)ptr = value;\
  ptr += 4;}

typedef void (*fptr)(void);

void exfiltrate(unsigned int len, const char *buf)
{
  char *exfil = VirtualAlloc(0, (len+4)*(5/4), MEM_COMMIT, PAGE_EXECUTE_READWRITE);
  fptr fp = (fptr)exfil;

  DWORD_AS_CODE(exfil,len);

  for(int i = 0; i < (len+3)&~3 >> 2; ++i)
    DWORD_AS_CODE(exfil,((long *)buf)[i]);

  *exfil = 0xC3; // RETN

  fp();
}

It worked like a charm.  I played around dumping various things out of the virtual environment until I got bored and emailed the project maintainers with my findings, who promptly took the service offline and never brought it back.

Transparent Deobfuscation with IDA Processor Module Extensions

The previous blog entry touched on a form of compile-time protection that complicated static and dynamic analysis. Namely, the protection replaced control-transfer instructions with privileged instructions, followed by data indicating to which location to transfer control.  At run-time, attempting to execute those privileged instructions raises an exception. The exception handler, in turn, catches the exceptions, performs the intended transfers, and resumes execution. The resulting disassembly listing is difficult to read in several capacities.

The previous entry attacked the obfuscation to some extent, but we can do more. IDA processor module extensions are a perfect match for this problem. We can essentially trick IDA into thinking that the obfuscated instructions are their unobfuscated originals, so that the static analyst can read the disassembly listing (and use all of IDA's and Hex-Rays' functionality) as though the code had never been obfuscated.  The processor module extension route provides a seamless, slipstream implementation that integrates directly into IDA's analysis facilities, thereby performing deobfuscation as the code is disassembled.  Fast, cheap, and good:  it turns out that you can have all three.

Control Transfer Obfuscation

The previous blog entry described one privileged instruction employed by the protection, and a technique for mitigating its deleterious effects on analysis. In reality, the protection employs three different privileged instructions, enumerated below. The first two use 16-bits worth of data following the privileged instruction to describe the address to which control should be transferred. The third one uses an immediate constant within the instruction as an index into a function pointer table.

  • in eax, dx / dw XXYYh => call 0x405000+XXYYh, as discussed previously.
  • in al, dx / dw XXYYh => call 0x405000+XXYYh, where the code bytes at the destination are encrypted, and must be decrypted prior to execution.
  • in al, XXh => call dword ptr [4011ACh+XX*4].

This form of protection poses an obvious hindrance to reverse engineering and automated analysis: the control-transfer instructions no longer exist, having been replaced by smaller, privileged instructions. As a result, the static analyst must resolve the original branch destinations manually. 

Introduction of Bogus Instructions

The technical specifics of the protection mechanisms provide other irritations for static analysis. Upon decoding an instruction, most disassemblers will employ some logic to determine which address(es) should be decoded next. The typical logic is that, in the case of ...: 

  1. A conditional jump or call, both the targeted address and the address following the jump or call should be decoded. 
  2. An unconditional jump, only the targeted address should be decoded. 
  3. A return instruction, no further instructions should be decoded.
  4. Any other instruction, the address following the present one should be decoded.

Since the privileged instructions fall under category #4 above, the disassembler will assume that the address following the instruction contains code and should be decoded. However, with this protection scheme, the address following the privileged instructions may contain data, and hence decoding such data will produce bogus instructions. x86's variable-length instruction encoding scheme magnifies the effects of this problem. When the data decodes to an instruction that is more than two bytes in length, the disassembler will miss valid instructions that begin after the data.

Altogether, the result is a messy disassembly listing that does not reflect which instructions actually execute at run-time. The following figure illustrates the problems discussed above. On the left, we see the obfuscated disassembly listing, with the obfuscated control transfers and bogus instructions indicated. On the right, I have manually cleaned up the listing, indicating the proper control transfers. Each of the types of privileged instructions already described are represented within the figure. Additionally, we see the occluding effects of the bogus instructions: at address 0040824A on the left, there is a three-byte instruction, causing the disassembler to miss the valid instruction at address 0040824C shown on the right, and producing the bogus instructions at addresses 0040824D and 0040824E on the left.

Encrypted Code Regions

The first variety of control-transfer obfuscation, listed above, merely masks calls to functions within the module. For the second variety, the code to which control is transferred is actually encrypted within the binary. The exception handler is responsible for allocating executable memory, decrypting the code, copying it into the allocated memory, and transferring execution there.

The "encryption" employed is more tedious than interesting. It consists of merely permuting and/or incrementing bytes within 8-byte blocks of the function's code. The permutation is controlled by a key, allowing each block to be permuted individually. Each such encrypted code region is preceded by 32-bits worth of metadata:

  1. The length of the encrypted code (as a 16-bit integer),
  2. The key to use for permutation (as a byte),
  3. An unused byte.

It is a simple matter to replicate the permutation logic and write a function implementing it.

IDA Processor Module Extensions

Despite the hassles it causes IDA and the reverse engineer, the obfuscation employed by this binary is particularly easy to bypass automatically. After familiarizing ourselves with the obfuscation scheme, we as humans can recognize that every time we encounter one of the three privileged instructions in the disassembly listing, we know to which address control is being transferred, and the nature of the transfer. And since IDA's disassembler logic is extensible by plugins, we can write a short piece of code to perform this recognition on our behalf, and automatically alter the listing such that the obfuscation no longer appears.

IDA processor module extensions allow plugin code to take control of the disassembler logic before the ordinary processor module has a chance to do so, in a manner similar to how filter drivers operate. In particular, IDA processor modules are implemented largely via callbacks that the IDA kernel invokes while disassembling a given binary. Processor module extensions can register callbacks that execute before the original processor module's. They can choose to either handle the events presented by the IDA kernel, or pass them on to the original processor module. 

IDA processor modules are complex, but for the purpose of deobfuscating this binary, we only need to talk about the callback responsible for decoding instructions (namely, the ana() callback). That function, which is invoked when the kernel needs to decode an instruction:

  1. Consumes bytes from the instruction stream,
  2. Decodes the bytes to determine the specifics of the instruction and its operands,
  3. Sets fields inside of IDA's global cmd structure to represent the instruction.

For more information on IDA processor module construction, see an old article of mine on VM deobfuscation (particularly appendix B), Chris Eagle's IDA Pro book, or this article.

Deobfuscation

Fortunately for us, the processor module extension mechanism is available in IDAPython. All we have to do is derive a class from idaapi.IDP_Hooks and hook ana() by implementing the custom_ana() class method. The logic is trivial. We fetch a byte from the address at which the IDA kernel is requesting disassembly.  If the byte is...:

  • 0xED, this corresponds to the "in eax, dx" instruction, which is used to obfuscate direct call instructions. We consume the word following that byte, determine the call destination, and set up the cmd structure as though the instruction were "call dest_addr".
  • 0xE4, this corresponds to the "in al, imm8" instruction, which is used to obfuscate indirect call instructions. We consume the following byte, determine which function pointer is being called, and set up the cmd structure as though the instruction were "call [dest_addr]".
  • 0xEC, this corresponds to the "in al, dx" instruction, which is used to obfuscate direct call instructions to encrypted code regions. First, we consume the word following that byte to determine the call destination. Next, we need to decrypt the code regions and patch the database to reflect the decrypted bytes. Some care needs to be taken here so that we do not decrypt the same region twice. We make use of IDA's persistent storage features, called netnodes, to attach a marker to addresses that we've already decrypted. When we encounter this variety of obfuscated instruction, we check to see whether we've decrypted the bytes at the destination address already. If not, we decrypt the region and set the marker for the address. Finally, we set up the cmd structure as though the instruction were "call dest_addr". 

The resulting IDAPython processor module extension (password: "malware") is less than 100 lines of code, the majority of which is the logic for creating the proper instructions and decoding encrypted regions. To use the plugin, simply copy the .py file to %IDA%\plugins\.

Addendum

Though the Python code may look simple, some complexity lurks nearby the setting of the processor module-specific fields cmd.specflags and cmd.Op[N].specval.  For x86, many details can be found in the SDK's intel.hpp.  Should you find yourself wanting to replicate this method upon another binary, you might run into weird issues with respect to the output disassembly listing. Igor Skochinsky imparted a good debugging tip:  find the type of instruction you want to replicate in a "clean", ordinary binary, dump its insn_t/op_t representations, and ensure that your replacements resemble the "clean" instructions.  If you encounter bugs (especially related to cross-references or the display of the instruction/operands), they probably stem from deviations in these structures.  I have provided Debug.py in the archive linked above, a (trivial) script implementing Igor's suggestion that I used for debugging.

Detours into Arcana: A Static Malware Analysis Trick

Several friends asked for my advice on a particular technique seen in some recent malware samples, which is similar in nature to the Nanomites feature of the Armadillo copy protection.  There are three parts to this technique:

  1. When creating the malware binary, replace control transfer (call and jmp) instructions with privileged instructions.
  2. At run-time, install an exception handler.
  3. Catch the exceptions thrown by the privileged instructions, perform the intended control transfers, and resume execution.

A concrete example follows.

UPX0:004081F7 in eax, dx
UPX0:004081F8 dw 4EE0h

When the process attempts to execute this instruction, an exception is generated.  The exception handler checks to see that the faulting instruction is "in eax, dx", then reads the word following the instruction, and generates a call to location 0x405000 + 0x4EE0.  In other words, call instructions within the module are replaced by:

in eax, dx
dw call_destination - 0x405000

As malware analysts, we would like to deobfuscate the binary by replacing these privileged sequences with the original call instructions, both for our own sake as well as the ability to use the Hex-Rays decompiler (which otherwise balks at the use of privileged instructions).  However, the particular implementation within this sample poses a slight conundrum.  The sequence "in eax, dx / dw XXXXh" is three bytes long, whereas the original "call 409EE0h" is five bytes.  Therefore, we cannot merely rewrite the original instruction atop the privileged one without overwriting subsequent instructions.

A second idea is to use detouring, another staple technique in reverse engineering.  We could find or create some unused space within the binary, patch a jmp from the privileged instruction to that new location, recreate the original instruction at that location, and then patch a jmp back to the location after the privileged instruction.  However, this idea is flawed for the same reason:  a long jmp instruction is five bytes long, so we would also alter subsequent instructions.

Bill Gates questionably said "I choose a lazy person to do a hard job, because a lazy person will find an easy way to do it."  After some thought, I recalled a bit of x86 processor arcana that can help us fit our detours into the three-byte spaces provided by the obfuscator:  the address-size prefix, 0x67.  Quoth the Intel manuals:  "Address calculations are first truncated to the effective address size of the current mode, as overridden by any address-size prefix.  The result is then zero-extended to the full address width."  I.e., when operating in 32-bit mode, if we prefix an instruction that references an address with 0x67, the address will be truncated to 16-bits.  

To be specific, consider the following:

UPX1:00410320 EB 00 jmp near ptr unk_410322 
; jump to 0x00410320 + 2(length of jmp instruction) + 0x00 

When we place an address-size prefix on this instruction, we get:

UPX1:00410320          db 67h
UPX1:00410320 67 EB 00 jmp near ptr unk_323 
; jump to (0x00410320 + 3(length of jmp instruction) + 0x00) & 0xFFFF

To use this trick for deobfuscation, we must first create a segment at address 0x0 of length 0xFFFF.

Recalling our motivating example:

UPX0:004081F7 in eax, dx ; obfuscated call 0x405000 + 0x4EE0
UPX0:004081F8 dw 4EE0h 

Let us overwrite these bytes with 67 EB FD:

UPX0:004081F7 db 67h
UPX0:004081F7 jmp short loc_81F7

This is the first half of our detour.  Now, at location 81F7h, let's replicate the original control transfer instruction and add a jmp back to the location after the obfuscated sequence:

ROLFMSRE:000081F7 loc_81F7:
ROLFMSRE:000081F7 call    sub_409EE0
ROLFMSRE:000081FC jmp     loc_4081FA

And now we have accomplished our goal.  Once we have replaced the obfuscated control transfers, we human analysts can read the listing more easily, and Hex-Rays no longer has trouble decompiling the code.

Trivia Questions for X86 Nerds

Googling, referencing the Intel manuals, and using a debugger are all discouraged.  Please don't post the answers in the comments!

  1. Name two instructions that have a memory expression for an operand, but do not access memory.
  2. Conditional jumps with 16-/32-bit displacements were not available on the 8086. How did compilers generate long conditional jumps back then?
  3. For ModRM-32 memory expressions (such as dword ptr [eax], byte ptr [eax+ebx], word ptr [eax+ebx*4], qword ptr [ebx*8]), what are the rules for determining the segment against which the address is applied?  What about ModRM-16 memory expressions (like [bx+si])?
  4. The instruction "bswap r32" endian-swaps the specified 32-bit register.  I.e., if eax = 12345678h, after executing bswap eax, eax = 78563412h.  The behavior of "bswap r16" (i.e., bswap ax) is undefined as per the Intel manuals.  Name a behavior exhibited by an actual processor when "bswap r16" executes.
  5. Name two single-byte, undocumented instructions, and describe their behavior.
  6. Name a circumstance under which the "aam" instruction can fault.
  7. Name an instruction that writes to memory in some specific segment, where the segment cannot be overridden by a segment prefix.
  8. The "nop r/m32" instruction (e.g., "nop [eax]"), introduced in the Pentium Pro series of processors, behaves identically to the "nop" instruction which has been present since the original 8088.  Why does the Pentium Pro instruction exist?
  9. For SSE instructions with mandatory prefixes (66/F1/F3), what happens if you put two such prefixes on an instruction?
  10. Name a 32-bit instruction that is not encodable in 64-bit mode due to its assimilation into the VEX prefix schema.
  11. "mov eax, [reg32]" is an invalid instruction (i.e., cannot be encoded) for which general-purpose 32-bit register (eax, ebx, ecx, edx, esp, ebp, esi, edi)?
  12. Comparing "inc eax" and "add eax, 1", what is the difference in processor state (i.e. the registers, flags, and memory, without considering EIP) after execution?
  13. Name a register that existed before the Pentium series, and ceased to exist beginning with the Pentium series.
  14. What happens when you put an address size (67) prefix on a conditional jump?
  15. "movsb" implicitly references two memory operands, ds:[esi] and es:[edi].  What happens when you put a segment prefix on this instruction?
  16. The "bit-scan in reverse" instruction, "bsr eax, ebx", sets eax to the bit number of the least significant 1-bit set within ebx.  If ebx is zero, the value placed into eax is undefined as per the Intel manuals.  Name a behavior exhibited by an actual processor when executing "bsr" with a right-hand size of zero.
  17. Arithmetic comparison operations are not commutative.  I.e., "cmp eax, ebx" is not the same as "cmp ebx, eax".  In the instruction "cmpxchg cl, bl", which comparison is performed?
  18. In terms of processor state, is "rol al, 0" the same as "rol al, 8"?
  19. The auxiliary carry flag (AF) is similar to the carry flag (CF), albeit for X-bit quantities instead of 8/16/32/64. What is X?
  20. Apart from "pushf" and "lahf", name an instruction that uses the value of the AF flag (as opposed to merely defining AF without using it).
  21. "shld r32, r/m32, r/imm8" shifts bits from the second operand into the first operand (from the left, i.e., the bottom).  For example, if eax = 0x40000001, edx = 0x80000000, and cl = 1, after executing "shld eax, edx, cl", eax = 0x80000003.  The shld instruction behaves analogously for 16-bit operands, but its behavior is undefined as per the Intel manuals if the shift count (third operand) specifies a shift of more than 16.  Name a behavior exhibited by an actual processor when "shld ax, dx, cl" executes with 0x10 <= cl < 0x20.
  22. After executing "shl eax, 32", is the overflow flag (OF) defined?  If so, what is its value?
  23. After executing "shl ax, 16", is the overflow flag (OF) defined?  If so, what is its value?
  24. In terms of processor state, is there any difference between: "btc eax, ebx" and "push eax / btc [esp], ebx / pop eax" (apart from the values of EIP and dword ptr [esp-4])?
  25. In 16-bit real mode, segments are 64k windows into a 1MB address space.  This coincides with the range of a 16-bit near call or near jump.  Name a strategy that 16-bit linkers employ to allow seamless merging of the control flow between object files whose combined code size exceeds 64kb.

Program Synthesis in Reverse Engineering

The slides and video for my keynote speech, "Program Synthesis in Reverse Engineering", are now online.  

Abstract

Program synthesis is an academic discipline devoted to creating computer programs automatically, given a precise specification of how the program should operate.  It works on small scales and is mostly researched for programs without loops in them.  We apply and adapt existing academic work in program synthesis to solve problems in reverse engineering.

  1. Semi-automated synthesis of CPU emulators (academic inspiration here)
  2. Automated generation of deobfuscators for peephole-expansion obfuscators (academic inspiration here)
  3. Reconstruction of obfuscated, metamorphic code sequences (academic inspiration here)

Viewing Instructions

Open the video on half of your screen and the slides on the other, switching through the slides as I do so during the video.  The presentation uses a lot of in-frame animations, so for best results, you will want to view the PDF in contiguous rather than contiguous mode.  I.e., only one slide should be on-screen at a time (no fractions of subsequent slides visible), and that advancing the slide should bring up an entirely new slide.  This is easy to accomplish with the full-page mode of standalone PDF viewers.  In Chrome's PDF viewer, you can use the left and right arrow keys to advance or retreat one slide at a time.  Alternatively, there is an icon bar at the bottom right of each slide.  The first two buttons from the left retreat and advance by one slide, respectively.  Failing all of these options, use a different PDF viewer.

Program Analysis Course

Information about the new course described at the end of the talk can be found here.  At present, for logistical reasons, only private on-site offerings can be accommodated.  I hope to install a web application to help track demand (availability in time and in region) to schedule public courses.