Weekend Project: A Custom IDA Loader Module for the Hidden Bee Malware Family

Here's a half-day project that I did this weekend for my own edification. Perhaps someone will benefit from the source code in the future.

While reading hasherezade's research on the Hidden Bee malware family's custom file format (samples here), I was struck with the thought that this use-case seemed particularly well-suited for an IDA custom loader module. The IDA loader module approach has a few advantages over the previous approach: it's fully automated, requiring no additional programs, plugins, or scripts; the imports have proper names and type information, allowing IDA's ordinary P.I.T. algorithms to propagate the information; and the user can relocate the database to an arbitrary base address.


Given that custom loaders are the only variety of IDA plugin that I haven't yet written, this seemed like a nice small-scope project for the weekend to round out my knowledge. My very minor contribution with this entry is the IDA custom loader for the Hidden Bee format, which can be found on my GitHub. The IDAPython code requires that Ero Carrera's pefile module be installed, say via pip. 

Hidden Bee

In brief, the Hidden Bee malware family distributes payloads in a customized file format, which is a majorly stripped-down version of the PE file format. You can see all of the details in hasherezade's write-up. I did no original malware analysis for this project; I merely read her blog entry, figured out how to convert the details into a loader plugin, and then debugged it against the sample links she gave. As usual, Chris Eagle's The IDA Pro Book, 2nd Edition was useful. Some details about the loader API have changed with the IDA 7.x API port, but Hex-Rays' porting guide was informative, and the loader examples in the IDA 7.1 SDK have also been ported to the newest API.

IDA Loader Modules in Brief

An IDA loader module is simply an IDA plugin with a well-defined interface. IDA loader modules will be called when loading any file into IDA. They have two primary responsibilities: 

  1. Given access to the bytes of a file, determine whether the file is of a format that the loader module can handle. Every IDA loader module must export a function named accept_file for this purpose. This function returns 0 if it can't recognize the file format, or a non-zero value if it can.
  2. If the file type can be loaded by the module, and the user chooses to use this module to load the file, perform the actual loading process e.g. creating segments within the IDB, copying bytes out of the file into the segments, processing relocations, parsing imports, adding entrypoints, and so on. Every IDA loader module must export a function named load_file for this purpose.

Both of these functions take as input an "linput_t *" object that behaves like a C FILE * object, which supports seeking to specified positions, reading byte arrays out of the file, and so on. Since Hidden Bee's format includes relocations, I chose to implement a third, optional IDA loader module function: move_segm. This function will be called by the IDA kernel when the user requests that the database be relocated to another address.

Writing a Loader Module for Hidden Bee

After reading the aforementioned write-up, I figured that the only difficulties in loading Hidden Bee images in IDA would be A) that the Hidden Bee customized header specifies API imports via hash rather than by name, and B) that it includes relocation information. Relocations and import lookup via hash are simple enough conceptually, but the precise details about how best to integrate them with IDA are not obvious. Sadly, I did not feel confident in these tasks even after reading the loader module examples in the SDK. Four out of the five hours I spent on this project were reverse engineering %IDADIR%\loaders\pe.dll -- the loader module for the PE file format -- focusing in particular on its handling of relocations and imports. As expected, the results are idiosyncratic and I don't expect them to generalize well. 


For dealing with the imports by hash, hasherezade's toolchain ultimately generates a textual file with the addresses of the import hash names and their corresponding plaintext API string. Then, she uses one of her other plugins to create repeating comments at the addresses of the import hash DWORDs. Instead, I wanted IDA to show me the import information the same way it would in a normal binary -- i.e., I wanted IDA to set the proper type signature on each import. I figured this might be difficult, but after a few hours reverse engineering the virtual functions for the pe_import_visitor_t class (partially documented in %IDASDK%\ldr\pe\common.hpp), it turns out that all you have to do to trigger this functionality is simply to set the name of the DWORD to something from a loaded type library.

Here's a screenshot showing IDA successfully applying the type information to the APIs:



For the IMAGE_REL_BASED_HIGHLOW relocations common in PE files, each can ultimately be processed via straightforward translation of the relocation information into IDA's fixup_data_t data structures, and then passing them to the set_fixup API. The SDK examples did not give a straightforward idea of what I needed to do to handle PE IMAGE_REL_BASED_HIGHLOW relocations properly, so I reverse engineered pe.dll to figure out exactly what needed to happen with the relocations. (Fortunately, reverse engineering IDA is trivial due to the availability of its SDK.) If you wish, you can see the results in the do_reloc function. Don't ask me to explain why it works; however, it does work.

Here's a before and after comparison of rebasing the database from base address 0x0 to base address 0x12340000. Note particularly that the red underlined bytes change. Before:




The Atredis BlackHat 2018 CTF Challenge

This post covers my solution to the Atredis BlackHat 2018 challenge, for which I won second place and a ticket to BlackHat. I'd like to express my gratitude to the author, the increasingly-reclusive Dionysus Blazakis, as well as Atredis for running the contest.

Initial Recon

As you can see from the screenshot in the tweet linked above, and reproduced below, once you connect to the network service on arkos.atredis.com:4444, the system prints information about the firmware and hardware version, as well as the memory map. Following that is the message-of-the-day, which contains a reference to 8-bit computing. 


It also prints the contents of an email, which makes reference to an attachment having been written to disk. It seems like the immediate goal of the challenge is clear: retrieve the attachment from disk.


At about 8:30PM, I logged into the system for the first time. The challenge had been released at 2PM and I figured many people would have already solved it. Not knowing what to do at the prompt, I typed "help". The system replied by printing information about two commands: "help" and "read", with an example of how to use "read":

'read F400' reads the byte at $F400

I ran that command and it came back with the value 31, or '1' in ASCII. I ran the command for a few adjacent addresses and this location seemed to contain the hardware version string "1.0.3 rev A", part of what was printed initially in the first messages upon connecting to the service.

Dumping the Firmware

At first blush, there wasn't much more to do than read bytes off of the system at specified addresses. However, this challenge was not my first rodeo in the embedded reverse engineering space. I was immediately reminded of an exploit I once wrote for a SOHO router, where, once I obtained its memory map, I used an arbitrary read vulnerability to dump its memory contents so I could further analyze the software in IDA. I decided to do something very similar here, minus the need for an arbitrary read vulnerability.

Although I don't like Python much as a programming language, I do have to credit it for having an absurdly large standard library. In particular, while previously writing the aforementioned exploit, I made use of Python's Telnetlib module to automate interaction with the router. Nothing seemed to be stopping me from doing the same thing in this situation, so I spent about 10 minutes writing a 30-or so line Python script to log into the device, repeatedly send "read" commands, parse the resulting output, and save the results to a binary file. That functionality combined with the memory map printed by the device upon connection was all that was needed. You can find the dumped memory as .bin files here.

My script took nearly four hours to dump the memory. I don't know how much of that was related to the crappy Wi-Fi at my hotel, and how much had to do with other contestants hammering the server. Nevertheless, by the time I had the memory dump, it was 12:15AM and I had a business engagement in the morning. I needed to finish quickly so I could sleep.

Inspecting the Firmware

I began by briefly inspecting the dumped memory contents in a hex editor. The three regions which, in total, encompassed the range 0x0000-0x1FFF, were entirely zero. Only a few bytes within 0xF000-0xFEFF were non-zero. In the range 0xFF00-0xFFFF, only the final three words were non-zero.

The most interesting dumped memory range was 0x4000-0xEFFF. It began with the strings that the server printed upon connection, as well as others that had not been printed. The most interesting strings were "write" and "call", which seem like they might have been commands that the user could execute on the system. After these strings, the memory dump had zeroes until address 0x8000.


At address 0x8000, there were 0x542 bytes of unknown binary data, with the remainder of the region being zeroes. Now that I've inspected the entire memory space, if this thing has any code in it at all, it must be the bytes at 0x8000-0x8542. The only other non-zero, unknown data is the few sporadic, isolated bytes previously mentioned in the range of 0xF000-0xFFF.

I connected to the system again and tried executing the "call" command I had discovered in the strings. I provided it the address 8000, which seemed to be the beginning of the code in the memory image. The thing printed:

JSR $8000

Apart from that, nothing happened. Next, I executed "call 7FFF" and the system reset. I took that as a positive sign.

Determining the Architecture

At this point, I did not know what architecture I was dealing with. I had two hints. First, the message-of-the-day string made reference to 8-bit computing. Second, the "call" command had printed the string "JSR", which on some architectures is a mnemonic for "jump to subroutine" (i.e., the same functionality known as "call" on x86). The best logical guess right now is that we are dealing with an 8-bit architecture where the mnemonic for the call instruction is "JSR".

I wish I could tell you I used a systematic procedure in searching for such an architecture, but I would be lying. In retrospect, I could have loaded %IDASDK%\include\allins.hpp and searched for "jsr". The string occurs 32 times in that file, which would have given me a pile of possibilities:

  • Angstrem KR1878
  • DEC PDP-11
  • Freescale M68HC12 or M68HC16 or HCS12 
  • Java bytecode
  • Mitsubishi 740 or 7700 or 7900
  • MOS M65 [e.g. 6502] or M65816
  • Motorola 6809 or 68000
  • Motorola DSP56000
  • Panasonic MN102
  • Renesas H8, H8500, SuperH, or M16C
  • Rockwell C39

Instead what I ended up doing was searching Google for things like "assembly jsr". For any architecture suggested by one of my search results, I loaded the .bin file for the memory at 0x4000, changed the processor type to the one I was testing, and tried to disassemble the file using the relevant processor module. I ran through a few theories, among them System/360 and 68K. For each of my theories, IDA either refused to disassemble most of the code, or the resulting disassembly was gibberish. For example, when loaded as 68000, IDA refused to disassemble much of the code:


I almost gave up after about 15 minutes, since I had to be somewhere in the morning. But, a final Google search for "8-bit assembly jsr" brought me to a Wikibooks page on 6502 assembly language, which also has a "jsr" instruction. I loaded the binary as 6502, and lo and behold, the resulting disassembly listing looked legitimate. There were, for example, several loads followed by compares, followed by relatively-encoded short jumps to the same nearby location; loads of 0 or 1 immediately before a RTS (return) instruction, and so on.


It looked like I had found my winner.

Loading the Binary Properly

Now that I seemed to know the architecture, I needed to load all of the memory dumps into a single IDA database for analysis. This is easy if you know what you're doing. If you do, you may as well skip this section. If you don't know how to do this, read on.

First, I loaded the file for memory range 0x4000-0xEFFF into IDA and changed the processor type to M6502.


Next I had to change the loading offset and the region that IDA considered as containing the ROM.


From there, I loaded each of the remaining memory dumps as additional binary files.


For each of those, I had to change their loading offset so they'd be represented at the proper location within the database.


That's all; now the IDB has all of the memory ranges properly loaded. The clean IDB can be found here.

Reverse Engineering the Binary

I only have a few notes on the process of statically reverse engineering this particular binary. That's because, for the most part, I used the same process as I would while statically reverse engineering any binary on any architecture. In particular, I followed the same procedure that I teach in my static reverse engineering training classes. The whole process took 30-40 minutes. You can find the final IDB here.

Reset Vector

It's often useful to know where execution begins when a system or program executes. For embedded devices, the "system entrypoint" is often held as a pointer in the system's reset vector. For example, on 16-bit x86, the memory location F000:FFF0 contains the address to which the processor should branch upon reset.

For this binary, I noticed that the final three words in the memory dump -- at addresses 0xFFFA, 0xFFFC, and 0xFFFE -- all contained the value 0x8530, which happens to be near the end of the region with the code in it. It seems likely that this is our entrypoint. I renamed this location BOOTLOC and began reading.


6502 Documentation and Auto Comments

Now my job is to analyze and document 1300 bytes of 6502 code. A slight problem is that I don't know anything in particular about 6502 beyond some tutorials I read years ago. Nevertheless, this was not a major impediment to reverse engineering this binary. Two things helped. First, the Wikibooks web page I had found previously (which had tipped me off that the binary might be 6502) was a reasonably good layman's guide to the instruction set. Whenever I wanted to know more about a particular mnemonic, I searched for it on that page. Most of my questions were answered immediately; a few required me to correlate information from a few different parts within the document. It was good enough that I didn't have to consult any outside sources.

The second feature that helped me was IDA's "auto comments" feature, under Options->General->Auto Comments.


Once enabled, IDA leaves brief one-line descriptions of the mnemonics in the same color that it uses for repeating comments. You can see an example in the screenshot below. Although the comments may confuse you if you don't know 6502, the Wikibooks link above was enough to fill in the missing details, at which point the auto comments assisted in rapidly disassembling the binary.


Disjointed References

One slightly abnormal feature of this binary -- probably 6502 binaries in general -- is the fact that the instructions usually operate on 8-bit values, but the memory addresses are 16-bits. Thus, when the code needs to write a value into memory that is more than 8 bits, it does so one byte at a time. The following screenshot illustrates:


Lines 0x817D-0x8183 write the constant 0x405F into memory (at location 0x80) as a little-endian word. Lines 0x8188-0x818E writes the constant 0x406C to the same location. Lines 0x8193-0x8199 write the constant 0x4081 to the same location.

Those constants stuck out to me as being in the area where the strings are located. Indeed, here's the view of that location:


Thus, the addresses written by the code previously are those of the first three strings making up the textual description of the memory map that is printed when connecting to the system. (The string references also tip us off that the called function likely prints strings.)

Although splitting larger constants into smaller ones is not unique to M6502 -- for example, compiled binaries for most processors with fixed-width instructions do this -- nevertheless, as far as I know, IDA's support for recognizing and recombining these constructs into macroinstructions must be implemented per processor (as part of the processor module). Evidently, the 6502 processor module does not support fusing writes into pseudo-operations. Therefore, we have to resolve these references manually while reverse engineering. Nevertheless, this is not difficult (and we only have about 1300 bytes of code to analyze). Simply use the "G" keyboard shortcut to jump to the referenced locations when you see them. For strings, just copy and paste the string at the referencing location.

Memory-Mapped I/O

When the system prints its memory map upon startup, the region from 0xF000-0xFF00 is labeled "MMIO", which presumably is short for memory-mapped I/O. Memory-mapped I/O addresses are effectively global variables stored in memory. When you write to a memory-mapped I/O location, the data is accessible to other components such as peripherals (i.e., a screen controller or a disk controller). Similarly, peripherals can also share data with the 6502 component via memory-mapped I/O, such as keyboard input or the results of disk I/O.

Reverse engineering the variables in this range is effectively the same as reverse engineering accesses to any global variable, although you must keep in mind that MMIO output locations might never be read, and MMIO input locations might never be written.

System Description

Once familiar with the basics of 6502, and the particular points described above, I statically reverse engineered the binary's scant 25 functions.

  • Upon boot, the system prints the diagnostic messages we see in the screenshot at the top of this post. 
  • Next, it loads the message-of-the-day off of the file system (from the file /etc/motd) and prints it. 
  • Next, it checks mail (under the file /var/mail/spool/atredis) and prints it. 
  • Finally, it enters into an infinite loop processing user input and dispatching commands.

Reading the code for the final step -- command dispatch -- we can see indeed that "write" and "call" are valid, undocumented commands. The "write" command is ultimately very straightforward: it converts the address and value arguments into hexadecimal, and then write the value to the address. The "call" command is also straightforward, but I found it neat. It creates 6502 code at runtime for a JSR and RTS instruction, which it then invokes. My daily work does not usually involve examining JIT code generation on archaic platforms.


File I/O

It's been a while since we mentioned it, but recall from the introduction that when we connected to the system, it printed the most recent mail, and dropped a hint about the attachment having been written to disk. Our ultimate goal in reverse engineering this binary is to find and exfiltrate this mystery file from the disk. The message did not actually give us a filename, or anything of that sort. Let's take a closer look at how the system loads and prints files off of the disk, such as the message-of-the-day and the most recent email.

The full details can be found in the IDB, but the process generally works like this:

  • PrintMOTD() writes a pointer to "/etc/motd" into a global variable X, then calls Load()
  • Load() reads, in a loop, each of the 0x40 disk sectors into a global array Y and compares the first bytes of Y against X
  • Once found, Load() returns 1. The contents of the disk sector are still in Y.
  • If Load() succeeded, PrintMOTD() calls PrintDiskSectors().
  • At this point, the global buffer Y contains not only the name of the file, but also two words; let's call them Z and W. Z indicates the number of the first disk sector which contains the file's contents, and W is the number of sectors that the file occupies.
  • PrintDiskSectors() then consults Z and W from within the global array. Beginning at sector Z, it prints the raw contents of the file onto the screen, and repeats for W sectors.

(My analysis at the time was slightly sloppier than the above. I did not fully understand the role of the values I've called Z and W.)

Enumerating the Disk Contents, Finishing the Challenge

I now had a rough understanding of the mechanism of how files were read from disk and printed to the screen. My understanding also indicated that I could dump the underlying sectors of the disk without needing to know the names of the files comprising those sectors.

In particular: PrintDiskSectors(), at address 0x822C, reads its two arguments from global memory. Those arguments are: 1) Z, i.e., which sector to begin dumping data from, and 2) W, i.e., how many sectors to dump.

And so it was immediately obvious that I could use the undocumented "write" and "call" commands to write whatever I wanted into Z and W and then invoke PrintDiskSectors(). I tried it out in netcat and it worked on the first try -- I was able to dump sector #0.

Thus, I incorporated this functionality into my scripts based on Python's Telnetlib that I had previously used to dump the memory. Since my understanding at the time was a bit off, I ended up with a loop that executed 0x40 times (the number of sectors), which wrote 0x0000 in for Z (i.e., start reading at sector 0), and each iteration of the loop wrote an increasing value into W, starting with 1 and ending with 0x3F. The script would dump the returned data as a binary file, as well as printing it onto the screen. You can find the script here, and its output here.

I let my script run and once it got to the final iteration of the loop, there was a message congratulating me and telling me where to send an email, as well at what the contents should be. I immediately sent an email at 1:35AM, about 80 minutes after I'd first dumped the memory off of the device. Shortly thereafter, I received a personalized email congratulating me on my second-place finish.

Concrete and Abstract Interpretation, Explained through Chess

I've decided to release my presentation (two slide decks) on the theoretical foundations of abstract interpretation, illustrated through the game of chess. It has been collecting dust on my hard drive for five years, so I figured I may as well give it a proper burial.

This was the first thing I wrote as part of what eventually became my SMT-Based Program Analysis training, back when I planned for that course to cover abstract interpretation as well as logical analysis. The idea to use chess as a medium came to me in a nightmare. I spent about six weeks working on it, polishing it, and extending my original ideas to cover broader areas of abstract interpretation than what my nightmare originally entailed. Writing it was a useful exercise; it forced me to muddle through all of the formalism you see in the preliminaries section of an abstract interpretation paper, and digest them well enough that I thought I could present them to other people. And I personally liked what I wrote, especially the detailed figures.

Next, I sent it off to proofreaders. Two of them -- people with graduate degrees in program analysis and mathematics -- also liked it. The other 18 or so of them -- bright people, mostly professional security researchers -- seemed to utterly despise it. I think they were angry at me for sending it to them with a straight face and expecting a response. You know who you are, and I'm sorry for putting you through that. There were some more changes I wanted to make, but the response made me finally realize that you need at least an undergraduate education in mathematics to understand abstract interpretation, no matter how hard I might try to make it friendly. The experience ultimately convinced me that my time was better spent in teaching people about SMT solvers.

Now it's my turn to preemptively offer my apologies, and best wishes for luck, to anybody else who would like to read this. Chances are very good that you are not going to like it, either. The second deck could use some more polishing, but that's the least of your worries. The bigger problem is that the audience for this document is virtually non-existent.

After that compelling interlude, I present:

P.S. if you find any corrections, don't bother sending them since I have no intention of working on this anymore.

Bonus picture, a preview of what awaits you.


FinSpy VM Unpacking Tutorial Part 3: Devirtualization. Phase #4: Second Attempt at Devirtualization

[Note: if you've been linked here without context, the introduction to Part #3 describing its four phases can be found here.]

1. Introduction

In Part #3, Phase #1, we deobfuscated the FinSpy VM bytecode program by removing the Group #2 instructions. In Part #3, Phase #2, we made a first attempt to devirtualize the FinSpy VM bytecode program back into x86 code. This was mostly successful, except for a few issues pertaining to functions and function calls, which we examined in Part #3, Phase #3.

Now we are ready to take a second stab at devirtualizing our FinSpy VM sample. We need to incorporate the information from Part #3, Phase #3 into our devirtualization of X86CALLOUT instructions. After having done so, we will take a second look at the devirtualized program to see whether any issues remain. After addressing one more major observation and a small one, our devirtualization will be complete.

2. Devirtualization, Take Two

We are finally ready to choose an address in the original FinSpy sample at which to insert the devirtualized code, devirtualize the FinSpy VM program, and copy the devirtualized machine code into the original binary. I chose the address 0x500000, for no particular reason other than that it was after any of the existing sections in the binary.

If everything we've done so far has worked correctly, now we have all of the information we need to generate proper functions in our devirtualized program. We have a set containing the non-virtualized functions called by the FinSpy VM program. For virtualized function targets, we have a list containing tuples of the function addresses, the VM instruction key corresponding to the first virtualized instruction in the function, and a list of prologue bytes to prepend before the devirtualization of the first virtualized instruction. 

We derive two dictionaries from the virtualized function information. 

  1. The dictionary named X86_VMENTRY_TO_KEY_DICT maps an X86CALLOUT target to the VM instruction key corresponding to the beginning of the virtualized function body. 
  2. The dictionary named KEY_TO_PROLOGUE_BYTES_DICT maps the VM instruction key to the copied x86 prologue machine code bytes for the function beginning at the VM instruction with that key.

Now we make two changes to our instruction-by-instruction devirtualization process:

  • In the loop that iterates over all VM bytecode instructions and produces the devirtualized output, consult KEY_TO_PROLOGUE_BYTES_DICT to see if the instruction corresponds to the beginning of a virtualized function. If so, insert the prologue bytes before devirtualizing the instruction.
  • When devirtualizing X86CALLOUT instructions, look up the address of the target in the NOT_VIRTUALIZED set. 
    • If the target is in the set, then nothing special needs to be done to devirtualize the X86CALLOUT instruction; emit an x86 CALL instruction with a dummy displacement DWORD of 0x0, and a fixup to later replace the 0x0 value with the proper distance from the source to the target (similarly to how we devirtualized VM jump instructions).
    • If the target is not in the set, then we need to generate an x86 CALL instruction to the devirtualized address of the target's VM instruction key. Emit a dummy x86 CALL instruction as before. Also generate a fixup specifying the offset of the dummy 0x0 displacement DWORD in the x86 CALL instruction, and the target of the X86CALLOUT instruction.

After the instruction-by-instruction devirtualization process, we need to process the fixups generated for the two varieties of X86CALLOUT instructions mentioned above (i.e., based on whether the destination is virtualized or not).

Here is partial code from the second approach at devirtualization.

# This is the same devirtualization function from before, but
# modified to devirtualize X86CALLOUT instructions and insert
# function prologues where applicable.
# It has a new argument: "newImageBase", the location in the
# FinSpy-virtualized binary at which we emit our devirtualized
# code.
def RebuildX86(insns, newImageBase):

    # These are the same as before:
    mcArr  = [] # Machine code array into which we generate code
    locsDict = dict() # VM location -> x86 position dictionary
    keysDict = dict() # VM key -> x86 position dictionary
    locFixups = []    # List of fixup locations for jumps

    # New: fixup locations for calls to virtualized functions
    keyFixups = []

    # New: fixup locations for calls to non-virtualized functions
    binaryRelativeFixups = []

    # Same as before: iterate over all instructions
    for i in insns:

        # Same as before: memorize VM position/key to x86 mapping
        currLen = len(mcArr)
        locsDict[i.Pos] = currLen
        keysDict[i.Key] = currLen

        # New: length of prologue instructions inserted before
        # devirtualized FinSpy VM instruction. Only obtains a
        # non-zero value if this instruction corresponds to the
        # beginning of a virtualized function.   
        prologueLen = 0

        # New: is this VM instruction the beginning of a
        # virtualized function?
        if i.Key in KEY_TO_PROLOGUE_BYTES_DICT:

            # Get the prologue bytes that should be inserted
            # before this VM instruction.
            prologueBytes = KEY_TO_PROLOGUE_BYTES_DICT[i.Key]

            # Increase the length of the instruction.
            prologueLen += len(prologueBytes)

            # Copy the raw x86 machine code for the prologue
            # into the mcArr array before devirtualizing the
            # instruction.

        # Now devirtualize the instruction. Handling of
        # "Raw X86", "Indirect Call", and jumps are identical
        # to before, so the code is not duplicated here.

        # Is this an "X86CALLOUT" ("Direct Call")?
        if isinstance(i,RawX86Callout):

            # New: emit 0xE8 (x86 CALL disp32)

            # Was the target a non-virtualized function?
            if i.X86Target in NOT_VIRTUALIZED:

                # Emit a fixup from to the raw target

            # Otherwise, the target was virtualized
                # Emit a fixup to the devirtualized function body
                # specified by the key of the destination

            # Write the dummy destination DWORD in the x86 CALL
            # instruction that we just generated. This will be
            # fixed-up later.
            mcArr.extend([0x00, 0x00, 0x00, 0x00])

The Python code above generates additional fixup information for devirtualized X86CALLOUT instructions. The two cases of the destination being virtualized or not are handled similarly, though they are placed in two different lists ("keyFixups" for virtualized targets, and "binaryRelativeFixups" for non-virtualized targets). After the main devirtualization loop shown above, we must process the fixups just generated, the same way we did for the jump instructions. The process of applying the fixups is nearly identical to what we did for jump instructions, except that for virtualized targets, we need to determine the VM instruction key corresponding to the x86 address of the X86CALLOUT target. Here is the code for fixing up calls to virtualized functions:

# Fixups contain:
# * srcBegin: beginning of devirtualized CALL instruction
# * srcFixup: distance into devirtualized CALL instruction
#             where displacement DWORD is located
# * dst:      the X86CALLOUT target address
for srcBegin, srcFixup, dst in keyFixups:

    # Find the machine code address for the source
    mcSrc = locsDict[srcBegin]

    # Lookup the x86 address of the target in the information
    # we extracted for virtualized functions. Extract the key 
    # given the function's starting address.
    klDst = X86_VMENTRY_TO_KEY_DICT[dst]
    # Find the machine code address for the destination
    mcDst = keysDict[klDst]

    # Set the displacement DWORD within x86 CALL instruction
    StoreDword(mcArr, mcSrc+srcFixup, mcDst-(mcSrc+srcFixup+4))

Next, and more simply, here is the code for fixing up calls to non-virtualized functions:

# Same comments as above
for srcBegin, srcFixup, dst in binaryRelativeFixups:

    # Find the machine code address for the source
    mcSrc = locsDict[srcBegin]
    # Compute the distance between the end of the x86
    # CALL instruction (at the address at which it will
    # be stored when inserted back into the binary) and
    # the raw x86 address of the X86CALLOUT target
    fixup = dst-(newImageBase+mcSrc+srcFixup+4)
    # Set the displacement DWORD within x86 CALL instruction
    StoreDword(mcArr, mcSrc+srcFixup, fixup)

3. Inspecting the Devirtualization

Now we are in a similar place to where we were after our initial devirtualization attempt in Part #3, Phase #2; let's look at the devirtualized code in IDA and see if anything jumps out as being obviously incorrect.

IDA's navigation bar shows a few things.

  • The first third of the binary -- in the transparently-colored regions -- contains data defined as arrays. IDA has not identified code in these regions.
  • The red-colored areas have properly been indentified as code, but don't have any incoming references, therefore they have not been defined as functions.

These two issues are related: if the regions currently marked as data are actually code, and if they make function calls to the code in red, then perhaps IDA can tell us that the red regions do have incoming references and should be defined as functions. I selected the entire text section, undefined it by pressing 'U', and then selected it again and pressed 'C' to turn it into code. The result was much more pleasing:


Now the whole devirtualized blob is defined as code. There is still an obvious cluster of functions that don't have incoming references.

Next we list the remaining issues that appear when inspecting the new devirtualization.

3.1 Some Functions Don't Have Incoming References

As we just saw from the navigation bar, there is a cluster of functions with no incoming references. Furthermore, inspecting these functions shows that they all lack prologues, like we noticed originally for all functions in our first devirtualization. If we turn them into functions, IDA makes its objections well-known with its black-on-red text:


So apparently, our prologue extraction scripts have missed these functions. We'll have to figure out why.

3.2 Many Call Instructions Have Invalid Destinations

IDA's "Problems" window (View->Open Subviews->Problems) helpfully points us to another category of errors. Many function calls have unresolved addresses, which IDA highlights as black-on-red text.


These issues have an innocuous explanation. In this Phase #4, we made the decision to choose the address 0x500000 as the base address at which to install the devirtualized code within the original binary. The x86 CALL instructions targeting non-virtualized functions are thus computed relative to an address in that region of the binary. Since we are currently inspecting the .bin file on its own, its base address is 0x0, and not 0x500000 like it will be when we insert it the devirtualized code into IDA. The x86 CALL displacements are indeed nonsensical at the moment, but we'll double check on them after we've inserted the devirtualized code back into the binary.

3.3 One Call in Particular is Weird

All of the x86 CALL instructions described in the previous issue have displacements that begin with the nibbles 0xFFF....., indicating that the destination of those CALL instructions lies at an address located physically before the CALL instruction. However, one x86 CALL instruction at the beginning of a devirtualized function has a positive displacement, also colored black-on-red:

seg000:00000E70 sub_E70 proc near
seg000:00000E70     push    0B6Ch
seg000:00000E75     push    40B770h
seg000:00000E7A     call    near ptr 6D85h ; <- bogus destination

I looked at the corresponding function in the original binary from which this prologue had been copied, and the situation became clear.

.text:00404F77         push    0B6Ch
.text:00404F7C         push    offset stru_40B770
.text:00404F81         call    __SEH_prolog
.text:00404F86         xor     esi, esi
.text:00404F88         mov     [ebp-20h], esi
.text:00404F8B         push    edi        ; Save obfuscation register #1
.text:00404F8C         push    ebp        ; Save obfuscation register #1 
.text:00404F8D         mov     ebp, offset word_411A6E ; Junk obfuscation
.text:00404F92         shrd    edi, esi, 0Eh           ; Junk obfuscation

The prologue for the pre-virtualized function installed an exception handler by calling __SEH_prolog. Our prologue extraction script simply copied the raw bytes for the prologue. Since x86 CALL instructions are encoded relative to their source and destination addresses, we can't simply copy a CALL instruction somewhere else without updating the destination DWORD; if we don't, the destination will be incorrect.

Since this issue appeared only once, instead of re-architecting the prologue extraction functionality to deal with this special case, I decided to just manually byte-patch my devirtualized code once I've copied it into the original binary. If I wanted to write a more fully-automated FinSpy devirtualization tool, I would tackle this issue more judiciously.

3.4 What are These Function Pointers?

The second devirtualization contains many pointers that reference hard-coded addresses within the original FinSpy binary from which we extracted the VM bytecode. For example, the following example references a function pointer and an address in the .text section:

seg000:00004BB9     push    0
seg000:00004BBE     push    0
seg000:00004BC3     push    400h
seg000:00004BC8     push    0FFFFh
seg000:00004BCD     call    dword ptr ds:401088h
seg000:00004BD3     mov     eax, ds:41FF38h
seg000:00004BD8     push    0
seg000:00004BDD     call    dword ptr [eax+38h]

Since we are examining the devirtualized code in isolation from the original binary, IDA cannot currently provide us meaningful information about the addresses in question. We can check the addresses in the FinSpy sample IDB to see if they make any sense; for example, here's the address referenced by the CALL: 

.idata:00401088     ; LRESULT __stdcall SendMessageW(HWND hWnd, UINT Msg, WPARAM wParam, LPARAM lParam)
.idata:00401088     extrn SendMessageW:dword

Things look good; we see four arguments pushed in the code above, and the function pointer references a function with four arguments. Once we've inserted our devirtualization back into the original binary, IDA will resolve the references seamlessly, and allow us to make full use of its normal facilities for cross-referencing, naming, type inference, and parameter tracking.

I also noticed that some of the instructions made reference to items within the original binary's .text section. 

seg000:0000333E     mov     dword ptr [esi+4], 4055D5h
seg000:00003345     mov     dword ptr [esi], 40581Eh

; ... later ... 

seg000:000034C4     mov     dword ptr [esi+4], 40593Ch
seg000:000034CB     mov     dword ptr [esi], 4055FEh

; ... later ...

seg000:000034DC     mov     dword ptr [esi+4], 40593Ch
seg000:000034E3     mov     dword ptr [esi], 405972h

; ... more like the above ...

Looking at these addresses in the original binary, I found that they corresponded to virtualized functions in the .text section. For example, here are the contents of the first pointer from the snippet above -- 0x4055D5 -- within the original binary:

.text:004055D5     mov     edi, edi                ; Original prologue
.text:004055D7     push    ebp                     ; Original prologue
.text:004055D8     mov     ebp, esp                ; Original prologue
.text:004055DA     push    ebp                     ; Push obfuscation register #1
.text:004055DB     push    esi                     ; Push obfuscation register #2
.text:004055DC     mov     esi, offset word_41CCBA ; Junk obfuscation
.text:004055E1     mov     ebp, 7C9E085h           ; Junk obfuscation
.text:004055E6     bswap   ebp                     ; Junk obfuscation
.text:004055E8     pop     esi                     ; Pop obfuscation register #2
.text:004055E9     pop     ebp                     ; Pop obfuscation register #1
.text:004055EA     push    5A329Bh                 ; Push VM instruction entry key
.text:004055EF     push    ecx                     ; Obfuscated JMP
.text:004055F0     sub     ecx, ecx                ; Obfuscated JMP
.text:004055F2     pop     ecx                     ; Obfuscated JMP
.text:004055F3     jz      GLOBAL__Dispatcher      ; Enter FinSpy VM

And it turns out that the VM key pushed by this sequence, namely 0x5A329B, references one of the functions in the devirtualized binary which otherwise did not have a incoming reference. Great! We would like to extract the addresses of the pointed-to functions so that we can process them with the scripts we developed in Part #3, Phase #3 in order to extract their prologues. We'd also like to alter the raw x86 instructions that reference the function pointers to make them point to their devirtualized targets within the devirtualized blob instead.

4. Next Step: Function Pointers

At this point, only two issues remain. First, we noticed that some devirtualized functions still don't have prologues. The explanation for this behavior must be that the addresses of their virtualized function stubs must not have been passed to the scripts. If we had provided those virtualized functions' addresses, our scripts would have found something for their prologues, even if it was an incorrect prologue. Yet the scripts found nothing.

Secondly, we noticed that the devirtualized code contained function pointers referencing the addresses of the prologue-lacking functions from the previous paragraph. We would like to replace the raw function pointers within the x86 instructions with the addresses of the corresponding functions in the devirtualized code.

4.1 Extracting Function Pointers from the VM Bytecode Disassembly

It seems like the first step in resolving both issues is to locate the function pointers within the FinSpy VM. I took a look at the raw FinSpy VM instructions from the snippet above with the function pointer references.

Here's the first VM instruction:

0x030630: X86 mov dword ptr [esi+4h], 4055D5h

Here's the raw bytes that encode that VM instruction:

seg000:00030630 dd 5A5C54h ; <- VM instruction key
seg000:00030634 db  1Bh    ; <- Opcode: raw x86
seg000:00030635 db    7    ; <- Length of x86 instruction: 7
seg000:00030636 db    3    ; <- Fixup offset: #3
seg000:00030637 db    0    ; <- Unused
seg000:00030638 db 0C7h    ; <- x86: mov dword ptr [esi+4], 55D5h
seg000:00030639 db  46h    
seg000:0003063A db    4
seg000:0003063B db 0D5h    ; <- 3: offset of 0x55D5 in x86 instruction
seg000:0003063C db  55h
seg000:0003063D db    0
seg000:0003063E db    0

The important thing to notice is that the x86 machine code contained within this instruction disassembles to:

mov dword ptr [esi+4], 55D5h

Whereas the x86 instruction shown in the VM bytecode disassembly listing is, instead:

mov dword ptr [esi+4h], 4055D5h

The difference is in the DWORD value (55D5h in the raw VM bytecode versus 4055D5h in the VM bytecode disassembly). 

The reason for this difference lies in the line labeled "Fixup offset: #3". You may recall from part two that all FinSpy VM instructions have two byte fields at offsets +6 and +7 into the VM instruction structure that were named "RVAPosition1" and "RVAPosition2". To quote the description of those fields from part two:

"Some instructions specify locations within the x86 binary. Since the binary's base address may change when loaded as a module by the operating system, these locations may need to be recomputed to reflect the new base address. FinSpy VM side-steps this issue by specifying the locations within the binary as relative virtual addresses (RVAs), and then adding the base address to the RVA to obtain the actual virtual addresses within the executing module. If either of [RVAPosition1 or RVAPosition2] is not zero, the FinSpy VM treats it as an index into the instruction's data area at which 32-bit RVAs are stored, and fixes the RVA up by adding the base address to the RVA."

In a bit of unplanned, happy serendipity, when I was writing my FinSpy VM bytecode disassembler, I made a Python class called "GenericInsn" that served as the base class for all other Python representations of FinSpy VM instruction types. Its Init() method is called in the constructor for every VM instruction type. And in particular, Init() includes the following code:

if self.Op1Fixup:
    ApplyFixup(self.Remainder, self.Op1Fixup & 0x7F, self.Pos)
if self.Op2Fixup:
    ApplyFixup(self.Remainder, self.Op2Fixup & 0x7F, self.Pos)

Thus, we are in the fortunate position where FinSpy VM helpfully tags all pointers to items in the original binary by setting these RVAPosition1 and RVAPosition2 fields. And furthermore, our existing function "ApplyFixup" already receives all of these values when we disassemble a FinSpy VM bytecode program. Thus, all we need to do to extract the function pointers is to include some logic inside of ApplyFixup that detects when one of these embedded RVAs refers to a function pointer, and if it does, to store the virtual address of the function pointer into a global set. The logic I used to determine function pointers was simply checking whether the virtual address was between the beginning of the first function in the .text section, and the last address in the .text section.

To wit, I changed my implementation of ApplyFixup as follows:

# New: constants describing the first function in the
# .text section and the end of the .text section.

# New: a global dictionary whose keys are fixed-up
# virtual addresses, and whose values are lists of 
# VM instruction positions whose bodies reference
# those virtual addresses.
from collections import defaultdict
ALL_FIXED_UP_DWORDS = defaultdict(list)

# Existing ApplyFixup function
def ApplyFixup(arr, FixupPos, InsnPos):
    # New: Python scoping statement
    # Existing ApplyFixup logic
    OriginalDword = ExtractDword(arr, FixupPos)
    FixedDword = OriginalDword + IMAGEBASE_FIXUP
    StoreDword(arr, FixupPos, FixedDword)
    # New: if the fixed-up DWORD is in the .text 
    # section, save it in ALL_FIXED_UP_DWORDS and
    # add the VM instruction position (InsnPos) to
    # the list of positions referencing that DWORD.
    if FixedDword >= TEXT_FUNCTION_BEGIN and FixedDword <= TEXT_FUNCTION_END:

4.2 Extracting Prologues from Virtualized Function Pointers

Next, I also wanted to treat these virtual addresses as though they were the beginnings of virtualized functions, so that my existing machinery for extracting function prologues would incorporate them. In Part #3, Phase #3, I had written a function called "ExtractCalloutTargets" that scanned the FinSpy VM instructions looking for X86CALLOUT instructions and extracted their target addresses. This was then passed to the function prologue extraction scripts to collect the data that was used in this Phase #4 to devirtualize X86CALLOUT instructions and insert the function prologues from virtualized functions into the devirtualization.

It seemed natural to modify ExtractCalloutTargets to incorporate the virtual addresses we collected in the previous subsection. To wit, I modified that function as such:

# Existing ExtractCalloutTargets function
def ExtractCalloutTargets(insns, vmEntrypoint):
    # New: Python scoping statement
    # New: initialize the set to the function pointer 
    # addresses collected in ApplyFixup
    calloutTargets = set(ALL_FIXED_UP_DWORDS.keys())
    # Existing: add vmEntrypoint to set
    # Existing: extract X86CALLOUT targets
    for i in insns:
        if isinstance(i,RawX86Callout):
            if i.X86Target not in NOT_VIRTUALIZED:
    # Existing: return list of targets
    return list(calloutTargets)

Now I ran the function prologue extraction scripts from Part #3, Phase #3 again, to re-generate the virtualized function prologue and VM instruction entry keys for the function pointers in addition to the existing data for the X86CALLOUT targets. I then pasted the output data back into the second devirtualization program we wrote in this Phase #4 (remembering to copy in the data I'd generated manually for those virtualized functions without junk obfuscation sequences), and ran the devirtualization again. This time, the unreferenced functions had proper prologues, though they were still unreferenced.

4.3 Fixing the Function Pointers in the Devirtualized x86 Code

The last remaining issue is that the devirtualized x86 instructions which reference the function pointers still use the addresses of the virtualized functions in the .text section, whereas we want to modify them to point to their devirtualized equivalents instead. This is implemented in the RebuildX86 devirtualization function after the machine code array for the devirtualized program has been fully generated.

Fortunately for us, we already know which VM instructions reference the function pointers -- we collected that information when we modified ApplyFixup() to locate and log virtualized function pointers. Not only did we log the virtual addresses of the purported function pointers, but we also logged a list of VM instruction positions referencing each such function pointer in the ALL_FIXED_UP_DWORDS dictionary.

4.3.1 A Slight Complication

A slight complication lead me to a solution that perhaps could have been more elegant. Namely, we collect the positions of the VM instructions referencing function pointers within ApplyFixup() at the time that we disassemble the VM bytecode program. However, the simplifications in Part #3, Phase #1 can potentially merge instructions together when collapsing patterns of VM instructions into smaller sequences. Therefore, it might be the case that the VM instruction positions that we have collected no longer refer to valid locations in the VM program after the simplifications have been applied. However, we'd still expect the function pointers to appear in the machine code for the VM instructions into which the removed instruction was merged.

To work around this issue, I made use of the locsDict dictionary that we generated through devirtualization. Namely, that dictionary recorded the offset within the devirtualized x86 blob of each VM instruction processed in the main devirtualization loop. We find the offset within the devirtualized x86 machine code array of the prior VM instruction with an entry within locsDict, and we find the devirtualized offset of the next VM instruction with an entry within locsDict. This gives us a range of bytes to search in the devirtualized machine code looking for the byte pattern corresponding to the function pointer for the virtualized function. Once found, we can replace the raw bytes with the address of the devirtualized function body for that virtualized function.

4.3.2 Locating the Function Pointers in the Devirtualized Blob

Here is the code for locating function pointers as just described; if it is still unclear, read the prose remaining in this subsection.

# dword: the virtual address of a virtualized function
# posList: the list of VM instruction positions 
# referencing the value of dword
for dword, posList in ALL_FIXED_UP_DWORDS.items():
    # For each position referencing dword:
    for pos in posList:
        # Set the low and high offset within the
        # devirtualized blob to None
        lowPos,highPos = None,None
        # posSearchLow is the backwards iterator
        posSearchLow = pos
        # Continue while we haven't located a prior
        # instruction with a devirtualization offset
        while not lowPos:
            # Does posSearchLow correspond to a 
            # devirtualized instruction? I.e., not
            # something eliminated by a pattern
            # substitution.
            if posSearchLow in locsDict:

                # Yes: get the position and quit
                lowPos = locsDict[posSearchLow]

                # No: move to the previous instruction
                posSearchLow -= INSN_DESC_SIZE

        # Now search for the next higher VM position
        # with a devirtualization offset
        posSearchHigh = pos+INSN_DESC_SIZE

        # Continue while we haven't located a later
        # instruction with a devirtualization offset
        while not highPos:

            # Does posSearchLow correspond to a 
            # devirtualized instruction? I.e., not
            # something eliminated by a pattern
            # substitution.
            if posSearchHigh in locsDict:
                # Yes: get the position and quit
                highPos = locsDict[posSearchHigh]
                # No: move to the next instruction
                posSearchHigh += INSN_DESC_SIZE

For each instruction position X that references one of the pointers to virtualized functions, I locate the last VM instruction at or before X in the locsDict array. This is implemented as a loop that tries to find X in locsDict. If locsDict[X] exists, we save that value -- the offset of the corresponding devirtualized instruction within the devirtualized blob. If locsDict[X] does not exist, then the VM instruction must have been removed by one of the pattern simplifications, so we move on to the prior VM instruction by subtracting the size of an instruction -- 0x18 -- from X. We repeat until we find an X that has been devirtualized; if X becomes 0x0, then we reach the beginning of the VM instructions, i.e., devirtualized offset 0x0.

We do much the same thing to find the next VM instruction with a corresponding devirtualized offset: add the size of a VM instruction -- 0x18 -- to X and look it up in locsDict. If it's not a member of locsDict, add 0x18 and try again. Once we find it, If X ever exceeds the last legal VM location, set the offset to the end of the machine code array.  Once we've found the next VM instruction's devirtualized position, we record it and stop searching.

4.3.3 Rewriting the Function Pointers

Immediately after the code just shown, we have now found a suitable range within the x86 machine code array that ought to contain the raw bytes corresponding to the virtual address of a virtualized function referenced via pointer. Next we simply byte-search this portion of the array looking for that address, and once found, replace the address with that of the corresponding devirtualized function body. There is nothing especially complicated about this; we simply consult the book-keeping metadata that we gathered through devirtualization to locate the devirtualized offset of the virtualized function pointer, add the offset of the new image base at which we are inserting the devirtualized blob within the binary, and store the DWORD value at the found position within the machine code array.

4.3.4 Done

After writing the code just described, now the pointers to virtualized functions have been modified to point to their devirtualized counterparts. Here again are two references to virtualized functions before the modifications just described:

seg000:00003555     mov     dword ptr [esi+4], 4055D5h
seg000:0000355C     mov     dword ptr [esi], 40581Eh

And, the same code after modification:

seg000:00003555     mov     dword ptr [esi+4], 50154Ah
seg000:0000355C     mov     dword ptr [esi], 5017B3h

5. Inserting the Devirtualized Code back Into the Binary

The last step before we can analyze the FinSpy dropper is to re-insert our devirtualized blob back into the binary. We have already chosen an address for it: 0x500000, which was important in generating the devirtualized code.

At this point I struggled to load the devirtualized code with with IDA's File->Load File->Additional binary file... and Edit->Segments->Create Segment menu selections. Although both of these methods allowed me to load the raw devirtualized machine code into the database, I experienced weird issues with both methods. Namely, the cross-section data references and/or function cross-references were broken. IDA might display the correct addresses for data items, and allow you to follow cross-references by pressing "enter" over an address, but it would not show symbolic names or add cross-references. For example we might see something like this:

.data:00504C3C call    dword ptr ds:401088h

Rather than what we see when things are working properly:

.data:00504C3C call    ds:SendMessageW

I tried screwing with every option in the two dialogs mentioned, especially the segment attributes ("CODE" instead of "DATA"). For some attempts, the code references worked properly but the data references didn't; and for other attempts, the opposite was true. More often neither would work. Igor Skochinsky from Hex-Rays has always been very helpful to me, but this time he was away from his keyboard and did not hear my cries of anguish until it was too late. (Thanks anyway, Igor.)

That being the case, although it wasn't my first choice, I ended up enlarging the .data section via Edit->Segments->Edit Segment and then loading the binary contents with a one-liner in IDC:

loadfile(fopen("mc-take2.bin", "rb"), 0, 0x500000, 26840);

And this time, everything worked. You can see the IDB with the devirtualized code here.

From there I reverse engineered the devirtualized FinSpy dropper program. Whereas I was not impressed with the FinSpy VM, the dropper was more sophisticated than I was expecting. You can see a mostly-complete analysis in the linked IDB; start reading from address 0x50207E, the address of WinMain() within the devirtualized code. I've tried to comment most of the assembly language, but a lot of the action is inside of Hex-Rays (i.e., look at those functions inside of Hex-Rays to see a lot of comments that aren't in the assembly language view).

6. Conclusion 

FinSpy VM is weak. It's closer in difficulty to a crackme than to a commercial-grade VM. (I suppose it is slightly more polished than your average VM crackme.) Having a "Raw x86" instruction in the FinSpy VM instruction set, where those instructions make up about 50% of the VM bytecode program, makes devirtualization trivial. 

I almost didn't publish this series because I personally didn't find anything interesting about the FinSpy VM. But, hopefully, through all the tedium I've managed to capture the trials and tribulations of writing a deobfuscator from scratch. If you're still reading at this point, hopefully you found something interesting about it, and hopefully this slog wasn't all for nothing. Hopefully next time you come across a FinSpy-protected sample, this series will help you make short work of it.

FinSpy VM Unpacking Tutorial Part 3: Devirtualization. Phase #3: Fixing the Function-Related Issues

[Note: if you've been linked here without context, the introduction to Part #3 describing its four phases can be found here.]

1. Introduction

At the end of Part #3, Phase #2, we noted that our first attempt at devirtualization is still subject to two remaining major issues.

  • We had deferred generating proper displacements when devirtualizing the FinSpy VM X86CALLOUT instructions into x86 CALL instructions.
  • We discovered that the functions in the devirtualized code were missing their prologues, since the prologues of virtualized function execute in x86 before entering the VM.

Resolving these issues was the most difficult part of devirtualizing FinSpy, and took about half of the total time I spent on devirtualization. I dreamed up one "nice" solution, and implemented instead one "hacky" solution, but both of them fall short of full automation.

Phase #3, this entry, discusses how FinSpy virtualizes functions and function calls. We illustrate the problems that FinSpy's design decisions present us in properly devirtualizing functions and function calls. We state precisely what problems need to be solved. Then, we discuss several ideas for solving them, and eventually write an IDAPython script to collect the information we need. We finish by attending to some issues that arise in the course of writing and using this script.

Part #3, Phase #4, the next entry, will incorporate this information into the devirtualization process. After fixing a few more issues, our journey in devirtualizing FinSpy will be complete.

2. How FinSpy Virtualizes Functions and Calls

Ultimately, our major remaining task in devirtualizing FinSpy VM bytecode programs is to devirtualize the X86CALLOUT instructions. To do that, we need to attend to the several issues described in Part #3, Phase #2 and in the introduction above. It turns out that these issues arise due to how the FinSpy VM virtualizes functions, and due to the mechanics of the X86CALLOUT VM instruction. 

In looking at the FinSpy VM bytecode, we can see that function calls have been virtualized in a way that is very natural, identical to what we'd expect to see in an x86 binary compiled with ordinary compilation techniques. For example:

0x006ff0: X86 push ebx
0x007020: X86 push dword ptr [ebp+0FFFFFFFCh]
0x007098: X86 push eax
0x0070c8: X86CALLOUT 0x40ae74

This sequence almost looks like an x86 function call sequence already. And for this particular example, if we inspect the code at the destination of the X86CALLOUT instruction -- 0x40ae74 -- we see a normal x86 function body:

.text:0040AE74     ; void *__cdecl memcpy(void *Dst, const void *Src, size_t Size)
.text:0040AE74     memcpy proc near
.text:0040AE74     Dst= dword ptr  4
.text:0040AE74     Src= dword ptr  8
.text:0040AE74     Size= dword ptr  0Ch
.text:0040AE74 000 jmp     ds:__imp_memcpy
.text:0040AE74     memcpy endp

If we were to devirtualize the "X86CALLOUT" VM instruction from above into an x86 CALL instruction to the same location, this is indeed exactly what we would see in a normal binary. The obvious approach to devirtualize the X86CALLOUT instruction in the snippet above would be to replace it with an ordinary x86 CALL instruction targeting the function above, and there seem to be no drawbacks or serious complications with doing so in this case.

However, other X86CALLOUT instructions tell a different tale. Here's another, similar sequence of FinSpy VM instructions:

0x00f570: X86 push 40h
0x00f5a0: X86 push 1000h
0x00f5d0: X86 push eax
0x00f600: X86 push esi
0x00f630: X86CALLOUT 0x408810

In this example, at the location specified in the X86CALLOUT instruction -- 0x408810 -- we do not see an ordinary x86 function. Rather, like we discussed in part one, we see that the original, pre-virtualized function has been overwritten by a sequence of code that ultimately enters the FinSpy VM. To re-state: virtualized functions still reside at their original addresses in the binary, and can be executed by calling that location as usual; however, virtualized functions have been modified to transfer control to the suitable location inside of the VM bytecode program. (I suspect the FinSpy authors chose to retain virtualized function addresses in order to preserve certain metadata in the PE64 header surrounding exception records and function start addresses.) The x86 code at the location from above shows:

.text:00408810     mov     edi, edi                ; Ordinary prologue #0
.text:00408812     push    ebp                     ; Ordinary prologue #0
.text:00408813     mov     ebp, esp                ; Ordinary prologue #0
.text:00408815     push    ecx                     ; Ordinary prologue #0
.text:00408816     push    ecx                     ; Ordinary prologue #0
.text:00408817     push    eax                     ; Save obfuscation register #1.1
.text:00408818     push    ebx                     ; Save obfuscation register #2.1
.text:00408819     mov     ebx, offset unk_41B164  ; Junk obfuscation
.text:0040881E     mov     eax, 0B3BF98Dh          ; Junk obfuscation

; ... more junk obfuscation ...

.text:0040883E     bswap   eax                     ; Junk obfuscation
.text:00408840     stc                             ; Junk obfuscation
.text:00408841     pop     ebx                     ; Restore obfuscation register #2.2
.text:00408842     pop     eax                     ; Restore obfuscation register #1.2
.text:00408843     push    5A7314h                 ; Push VM instruction entry key #3
.text:00408848     push    eax                     ; Obfuscated JMP
.text:00408849     xor     eax, eax                ; Obfuscated JMP
.text:0040884B     pop     eax                     ; Obfuscated JMP
.text:0040884C     jz      GLOBAL__Dispatcher      ; Enter FinSpy VM #4

As in the code above, virtualized functions have been overwritten by sequences of x86 instructions that do up to four things.

  1. Execute the original prologue of the pre-virtualized function (on the lines labeled #0 above)
  2. After saving two registers on the lines labeled #1.1 and #2.1, perform junk obfuscation (meaningless computations), and then restore the values of those registers on the lines labeled #2.2 and #1.2.
  3. Push the VM instruction key for the virtualized function body onto the stack (on the line labeled #3 above)
  4. Jump to the FinSpy VM interpreter (on the line labeled #4 above)

The most important observations from the code above are:

  1. Whenever an X86CALLOUT instruction targets the x86 location 0x408810, the result is that the VM begins executing VM instructions starting with the one whose key is 0x5A7314.
  2. For the virtualized function at x86 location 0x408810 (at the VM instruction keyed 0x5A7314), the virtualized function's prologue is the first five x86 instructions from the sequence above (all of which are labeled #0).

2.1 Comments on Devirtualization Strategy

In discussing X86CALLOUT VM instructions to non-virtualized targets, we suggested that we could devirtualize them into an x86 CALL instruction directly targeting the specified location. However, that strategy will not produce good results when the targeted address is a virtualized function. If we were to devirtualize the X86CALLOUT instruction in the second example above with an x86 CALL instruction to the second targeted function, our resulting "devirtualization" would still contain obfuscation and references to the VM. I.e., upon encountering an x86 CALL instruction in the "devirtualized" code, we as analysts would have to analyze the x86 code shown above to determine which VM instruction ended up executing as a result, and to inspect the prologue for the targeted function, which would not be present at the site of the devirtualized function body. This is hardly a "devirtualization", given that we would still need to analyze obfuscated code involving the FinSpy VM, and that IDA and Hex-Rays will balk at functions with no prologues.

Therefore, to truly "devirtualize" FinSpy VM programs, we need to replicate what happens when a virtualized function is called: namely, we need to replicate the original function's prologue from before the VM entry sequence, and also determine the VM location at which the virtualized function's body resides, in order to emit an X86 CALL instruction from the site of the X86CALLOUT VM instruction to the devirtualized address of the destination, thereby entirely eliminating reliance on the FinSpy VM.

Hence, it is clear that we need two different strategies for devirtualizing X86CALLOUT instructions. For non-virtualized targets, we could simply emit x86 CALL instructions to the destination. For virtualized targets, we would need to determine the VM address of the VM bytecode translation for the function, locate the devirtualized versions of those VM instructions within our devirtualized array, and replace the X86CALLOUT VM instruction with an x86 CALL instruction pointing to the proper location within the devirtualized code.

Additionally, for each virtualized function, we need to extract its function prologue from the .text section, and insert those instructions before the devirtualized body of the corresponding function.

3. Stating Our Task Precisely

To devirtualize X86CALLOUT instructions, we need the following information:

  • A list of all x86 locations used as call targets, and whether or not the functions at those locations are virtualized.
  • For each virtualized function:
    • The VM instruction key pushed before entering the VM
    • The raw machine code comprising the function's prologue

With this information, we could solve our remaining issues as follows:

  • When devirtualizing any FinSpy VM instruction, look up its VM instruction key to see if it's the first VM instruction implementing a virtualized function body. If so, insert the prologue bytes from the corresponding x86 function in the .text section before devirtualizing the instruction as normal.
  • When devirtualizing X86CALLOUT instructions, check to see whether the destination is a virtualized function or not. 
    • If the destination is virtualized, emit an x86 CALL instruction whose displacement points to the x86 address of the devirtualized function body within the blob of devirtualized code. 
    • If the destination is not virtualized, emit an X86 CALL instruction pointing to the referenced location in the .text section. This requires knowledge of the base address at which the devirtualized code shall be stored in the binary containing the FinSpy VM.

3.1 Challenges in Obtaining the Requisite Information

Unfortunately for us, we've got a bit of work ahead of us, since the FinSpy VM does not readily provide the information we're interested in. 

  • The FinSpy VM metadata does not contain a list of called function addresses (virtualized or not). We have to perform our own analysis to determine which x86 functions may be called through VM execution, and whether or not the called functions are virtualized. 
  • The FinSpy VM makes no distinction between the cases where the destination address is virtualized, and the case in which it is not virtualized. In both cases, the target of an X86CALLOUT location is simply an address in the .text section. If the target is virtualized, then there will be a FinSpy VM entry sequence at the destination. If the target is not virtualized, there will not be a FinSpy VM entry sequence at the destination. Thus, the onus is upon us to determine whether a particular call target is virtualized or not, and if it is, to extract its function prologue and determine the VM key of the VM instruction that implements the virtualized function's body.

4. Initial Approach to Discovering Virtualized Function Addresses

Now that our tasks are laid out, let's set about solving them. For the first task -- obtaining the list of virtualized function addresses -- my first thought was to extract the targets from all of the X86CALLOUT instructions. That idea was easy enough to implement. I wrote a small Python script to dump all of the X86CALLOUT targets from the FinSpy VM bytecode program:

# Iterate through the VM instructions and extract direct call targets
# Inputs:
# * insns: a list of VM instruction objects
# * vmEntrypoint: the address of WinMain() in the original binary
# Output:
# * A list of call targets
def ExtractCalloutTargets(insns, vmEntrypoint):

    # Create a set containing just the address of the VM entrypoint
    calloutTargets = set([vmEntrypoint])

    # Iterate through all VM instructions
    for i in insns:
        # Was the instruction an X86CALLOUT?
        if isinstance(i,RawX86Callout):
            # Add it to the set

    # Return a list instead of a set
    return list(calloutTargets)

We add the address of the VM entrypoint because it isn't explicitly called from anywhere within the VM program -- WinMain() contains a VM entry sequence that causes it to execute.

The code above was a good start toward obtaining the list of x86 functions that may be called by virtualized code, and as far as I knew at this point, this list contained all such function addresses. I later discovered that some virtualized functions are not referenced via X86CALLOUT instructions (analogously to how WinMain()'s virtualized entrypoint is not the target of an X86CALLOUT instruction), and hence are not included in the list of virtualized functions generated above. Later it became clear that not including those virtualized function addresses not referenced by X86CALLOUT instructions causes serious problems while collecting the function information in preparation for devirtualization. Still, all of those problems were yet to materialize; this was my initial, blissfully ignorant approach.

5. A Nice Approach to Extracting Information from Virtualized Functions

For extracting the VM entry keys and function prologues, the first solution that came to mind was the best one I came up with. It is not, however, what I ended up actually doing.

Some modern reverse engineering tools (such as the Unicorn Engine) support custom emulation of assembly language snippets (as opposed to full-system emulation). The user supplies a starting address and an initial state for the registers and memory. From there, the emulation is controllable programmatically: you can emulate one instruction at a time and inspect the state after each to determine when to stop. (My own program analysis framework, Pandemic, part of my SMT training class, also supports this functionality.)

With access to a customizable x86 emulator, and assuming that we knew the address of the VM interpreter, we could extract the VM entry instruction keys for a virtualized function as follows:

  • Put the emulator's registers and memory into some initial state.
  • Set the initial x86 EIP to the beginning of the virtualized function's x86-to-VM entrypoint.
  • Emulate instructions until EIP reaches the address of the VM interpreter. Save each emulated instruction into a list.
  • Extract the DWORD from the bottom of the stack. This is the VM key corresponding to the beginning of the function.
  • Scan the list of emulated instructions in reverse order to determine which two registers are being used for junk obfuscation.
  • Scan the list of emulated instructions in forward order until the point at which those two registers are pushed. The instructions before these two x86 PUSH instructions comprise the prologue for the function beginning at the starting address.    

This is the best solution I have come up with so far, which mirrors the approaches I've taken for similar problems arising from different obfuscators. The emulation-based solution is very generic. For the task of recovering the VM entry key for a given entrypoint, it uses a semantic pattern -- no matter how the code at the VM entrypoint is implemented, the end result is that the VM entry key is pushed on the stack before the VM begins executing. Semantic pattern-matching is preferable in the extreme to syntactic pattern-matching. It would not matter if the FinSpy authors change the obfuscation at the virtualized functions, so long as the fundamental architecture of VM entry remains the same. I expect that this method would continue to work until and unless there was a major overhaul of the FinSpy VM entry schema.

The logic for extracting the function prologues is based on syntactic pattern-matching. Therefore it is more brittle, and would require tweaking if the FinSpy VM authors changed the VM entrypoint obfuscation.

The reason I did not pursue the emulation-based solution is not especially interesting: I had just gotten a new laptop and had not installed my development environment at that time, and I wanted a quick solution. Anyway, even if I had, I'm not sure I would have released the code for it. While staring at build environment errors, I started to wonder if there was another way to obtain the x86 call target/VM key information. What I came up to replace the VM entry key extraction did work, but in all fairness, it wasn't a very good solution and is not suitable for a fully-automated solution, since it does involve some manual work.

6. Hacky Solution, Overview

Remember, our overall goal is to discover, for each virtualized function: 

  1. The key for the VM instruction implementing the virtualized function's body
  2. The non-virtualized prologue instructions for the virtualized function

In lieu of the nicer, emulation-based solution, the hacky solution I came up with uses the IDA API to extract information in several separate parts before eventually combining them. The entire script can be found here.

  1. First, we find all x86 locations that jump to the FinSpy VM interpreter.
  2. Second, for each such referencing address, we extract the key pushed shortly beforehand, and also extract the names of the registers used by the junk obfuscation following the non-virtualized function prologue.
  3. For each virtualized function start address, match it up with the subsequent address that enters the VM (i.e., match it with one of the addresses described in #1). From step #2, we know which VM key is pushed before entering the VM interpreter; thus, we now know which VM key corresponds to the function start address. We also know which registers are used for the junk obfuscation for the virtualized function beginning at that address.
  4. Now that we know the identities of the registers used in junk obfuscation for a given virtualized function, scan forwards from its starting address looking for the beginning of the junk obfuscation -- i.e., two consecutive x86 PUSH instructions pushing those register values. Copy the raw bytes before the beginning of the junk obfuscation; these are the prologue bytes.
  5. Correlate the addresses of the virtualized functions with the prologue bytes and VM instruction key extracted in previous steps.

6.1 Steps #1 and #2: Extract VM Entry Keys and Obfuscation Registers

The implementation for the first two steps just described is not very sophisticated or interesting. The first function, ExtractKey, takes as input the address of an instruction that jumps to the VM entrypoint. I.e., its input is the address of an instruction such as the one labeled #4 below:

; ... prologue, junk obfuscation above this ...
.text:004083FC     pop     eax                     ; Pop obfuscation register #2.2
.text:004083FD     pop     ebx                     ; Pop obfuscation register #1.2
.text:004083FE     push    5A6C26h                 ; Push VM instruction key -- #3
.text:00408403     push    eax                     ; Obfuscated JMP
.text:00408404     xor     eax, eax                ; Obfuscated JMP
.text:00408406     pop     eax                     ; Obfuscated JMP
.text:00408407     jz      GLOBAL__Dispatcher      ; Enter FinSpy VM #4

From there, the script disassembles instructions backwards, up to some user-specifiable amount, and inspects them one-by-one looking for the first x86 PUSH instruction (i.e., the one on the line labeled #3 above) that pushes the VM key of the first instruction in the virtualized function's body. 

Once it finds the x86 PUSH instruction on line #3, it inspects the previous two x86 instructions -- those on lines #1.2 and #2.2 -- to see if they are x86 POP instructions, corresponding to the restoration of the registers the VM entry sequences use for obfuscation.

If all checks pass, the script returns a 4-tuple containing:

  • The address of the branch to the VM (i.e., that of line #4 above) 
  • The VM key DWORD (from line #3, e.g. 0x5A6C26 in the above)
  • The two x86 register names from line #1.2 and #2.2. 

That code is shown below (excerpted from the complete script):

# Given an address that references the VM dispatcher, extract
# the VM instruction key pushed beforehand, and the names of
# the two registers used for junk obfuscation. Return a tuple
# with all information just described.
def ExtractKey(vmXref):
    currEa = vmXref
    Key = None
    # Walk backwards from the VM cross-reference up to a 
    # specified number of instructions.
        # Is the instruction "PUSH DWORD"?
        if idc.GetMnem(currEa) == "push" and idc.GetOpType(currEa, 0) == o_imm:
            # Extract the key, exit loop
            Key = idc.GetOperandValue(currEa, 0)
        # Otherwise, move backwards by one instruction
            currEa = idc.PrevHead(currEa, 0)
    # After looping, did we find a key?
    if Key is not None:
        # Extract the first operands of the two subsequent instructions
        prevEa1 = idc.PrevHead(currEa,  0)
        prevEa2 = idc.PrevHead(prevEa1, 0)
        # Were they pop instructions?
        if idc.GetMnem(prevEa1) == "pop" and idc.GetMnem(prevEa2) == "pop":
            # Return the information we just collected
            return (vmXref,Key,idc.GetOpnd(prevEa1,0),idc.GetOpnd(prevEa2,0))
        # If not, be noisy about the error
            print "%#lx: found key %#lx, but not POP reg32, inspect manually" % (xref, Key)
            return (vmXref,Key,"","")
    # Key not found, be noisy
        print "%#lx: couldn't locate key within %d instructions, inspect manually" % (xref,MAX_NUM_OBFUSCATED_JMP_INSTRUCTIONS)
        return None

There is a second function, ExtractKeys, which iterates over all cross-references to the VM entrypoint, and calls ExtractKey for each. It is not interesting, and so the code is not reproduced in this document, though it is in the included source code.

The results of ExtractKeys -- the tuples from above -- are then saved in a Python variable called locKey. 

locKey = ExtractKeys(0x00401950)

This script can go wrong in several ways. Some of the references to the VM entrypoint might not have been properly disassembled, and so those references won't be in the set returned by IDA's cross-reference functionality. Secondly, if the FinSpy VM authors modified their virtualized entrypoint obfuscation strategy, it might necessitate changes to the strategy of simply walking backwards. Also, I found out later that not every x86-to-VM entry sequence used junk obfuscation, so the pattern-matching looking for the two x86 POP instructions failed. But, we're getting ahead of ourselves; I found and fixed those issues later on.

6.2 Step #3: Correlating Virtualized Function Beginnings with VM Entrypoints

Now that we have information about each location that enters the VM, we need to correlate this information with the list of X86CALLOUT targets corresponding to virtualized functions. I.e., given an address that is targeted by an X86CALLOUT instruction, we want to determine which address will subsequently transfer control into the VM (i.e., one of the addresses inspected in the previous steps). To assist in explanation, an example of a VM entry sequence is shown.

.text:00408360     mov     edi, edi                ; Original function prologue #0
.text:00408362     push    ebp                     ; Original function prologue #0
.text:00408363     mov     ebp, esp                ; Original function prologue #0
.text:00408365     sub     esp, 320h               ; Original function prologue #0
.text:0040836B     push    ebx                     ; Original function prologue #0
.text:0040836C     push    edi                     ; Original function prologue #0
.text:0040836D     push    ebx                     ; Push obfuscation register #1.1
.text:0040836E     push    eax                     ; Push obfuscation register #2.1

; ... junk obfuscation not shown ... 

.text:004083FC     pop     eax                     ; Pop obfuscation register #2.2
.text:004083FD     pop     ebx                     ; Pop obfuscation register #1.2
.text:004083FE     push    5A6C26h                 ; Push VM instruction key -- #3
.text:00408403     push    eax                     ; Obfuscated JMP
.text:00408404     xor     eax, eax                ; Obfuscated JMP
.text:00408406     pop     eax                     ; Obfuscated JMP
.text:00408407     jz      GLOBAL__Dispatcher      ; Enter FinSpy VM #4

In the previous section, we extracted information about each location -- such as the address labeled #4 above -- that enters the VM. Namely, we extract the VM instruction key for the virtualized function body from the line labeled #3, and the names of the two registers used for obfuscation on the lines labeled #2.2 and #1.2.

Now, given the beginning of a virtualized function's VM entrypoint -- such as the first address above labeled #0 -- we want to know the address of the instruction that ultimately transfers control into the VM, the one labeled #4 in the example above. Once we know that, we can look up the address of that instruction within the information we've just collected, which will then tell us which two registers are used for the junk obfuscation (the ones popped on the lines labeled #2.2 and #1.2 -- namely, EAX and EBX). From there, we can scan forward in the prologue (the instructions labeled #0 above) until we find x86 PUSH instructions that save those two registers before the junk obfuscation begins (namely the x86 PUSH instructions on the lines labeled #1.1 and #2.1 above). Once we find those two registers being pushed in that order, we know that we've reached the end of the prologue. Therefore, every instruction leading up to that point is the virtualized function's prologue. We also know the VM instruction key for the first instruction in the virtualized function's body, the one labeled #3 in the sequence above.

Correlating virtualized function addresses with VM entry addresses is simple. Since there is no control flow in the junk obfuscation, given the address of a virtualized function's entrypoint, the address of the sequentially-next JZ instruction that enters the VM is the corresponding VM entry address.

Therefore, I chose to sort the VM entry information from the previous step -- called locKey -- by the address of the JZ instruction, and stored the sorted list in a Python variable called sLocKey:

sLocKey = sorted(locKey, key=lambda x: x[0])

And then I wrote a small function that, given the start address for a virtualized function, iterates through sLocKey and finds the next greater address that enters the VM, and returns the corresponding tuple of information about the VM entry sequence.

# Given:
# * ea, the address of a virtualized function
# Find the entry in sLocKey whose VM entry branch
# location is the closest one greater than ea
# Ouput:
# The tuple of information for the next VM entry
# instruction following ea
def LookupKey(ea):
    global sLocKey
    for i in xrange(len(sLocKey)):
        if sLocKey[i][0] < ea:
        return sLocKey[i]
    return sLocKey[i-1]

Now by passing any virtualized function start address into the LookupKey function above, we will obtain the VM key and junk obfuscation register information from the next-subsequent address after the start address that enters the VM.

Two functions collate the virtualized function addresses with the register data:

# Given the address of a virtualized function, look up the VM key and
# register information. Return a new tuple with the virtualized function
# start address in place of the address that jumps to the VM interpreter.
def CollateCalloutTarget(tgt):
    assocData = LookupKey(tgt)
    return (tgt, assocData[1], assocData[2], assocData[3])

# Apply the function above to all virtualized function addresses.
def CollateCalloutTargets(targets):
    return map(CollateCalloutTarget, targets)

The only thing that can go wrong here is if we haven't located all of the JZ instructions that enter the VM. If we don't, then the information returned will correspond to some other virtualized function -- which is a real problem that did happen in practice, and required some manual effort to fix. Those issues will be discussed later when they arise.

6.3 Step #4: Extract Prologues for Virtualized Functions

At this point, for every virtualized function's starting address, we have information about the registers used in the junk obfuscation, as well as the VM instruction key for the virtualized function body. All that's left to do is extract the x86 function's prologue.

This is a very simple affair using standard pattern-matching tricks, very similar to the techniques we used to extract the VM instruction key and junk obfuscation register names. We simply iterate through the instructions at the beginning of a function, looking for two x86 PUSH instructions in a row that push the two registers used in junk obfuscation. (I noticed that the first junk instruction after the two x86 PUSH instructions was "mov reg2, address", where "address" was within the binary. I added this as a sanity check.) The very simple code is shown below (excerpted from the complete script).

# Given:
# * CallOut, the target of a function call
# * Key, the VM Instruction key DWORD pushed
# * Reg1, the name of the first obfuscation register
# * Reg2, the name of the second obfuscation register
# Extract the prologue bytes and return them.
def ExtractPrologue(CallOut, Key, Reg1, Reg2):
    # Ensure we have two register names.
    if Reg1 == "" or Reg2 == "":
        return (Key, CallOut, [])
    currEa = CallOut
    # Walk forwards from the call target.
    for i in xrange(MAX_NUM_PROLOG_INSTRUCTIONS):
        # Look for PUSH of first obfuscation register.
        if idc.GetMnem(currEa) == "push" and idc.GetOpnd(currEa, 0) == Reg1:
            nextEa = idc.NextHead(currEa, currEa+16)

            # Look for PUSH of second obfuscation register.
            if idc.GetMnem(nextEa) == "push" and idc.GetOpnd(nextEa, 0) == Reg2:
                thirdEa = idc.NextHead(nextEa, nextEa+16)

                # Sanity check: first junk instruction is "mov reg2, address"
                if idc.GetMnem(thirdEa) == "mov" and idc.GetOpnd(thirdEa, 0) == Reg2:
                    destAddr = idc.GetOperandValue(thirdEa, 1)
                    # Was "address" legal?
                    if destAddr <= PROG_END and destAddr >= PROG_BEGIN:
                        return (Key, CallOut, [ idc.Byte(CallOut + j) for j in xrange(currEa-CallOut) ])
                    # If not, be noisy
                        print "# 0x%08lx/0x%08lx: found push %s / push %s, not followed by mov %s, address" % (CallOut, currEa, Reg1, Reg2, Reg2)

        # Move forward by one instruction
        currEa = idc.NextHead(currEa, currEa+16)

    # If we didn't find the two PUSH instructions within some
    # specified number, be noisy.
    print "# 0x%08lx: did not find push %s / push %s sequence within %d instructions" % (CallOut, Reg1, Reg2, MAX_NUM_PROLOG_INSTRUCTIONS)
    return (Key, CallOut, [])

Now we have all of the information we need to devirtualize X86CALLOUT instructions properly. One final function collates the information for virtualized function entrypoints with the information collected for jump instructions into the VM:

# Extract the prologues from all virtualized functions.
def GetPrologues(calloutTargets):
    return map(lambda data: ExtractPrologue(*data),CollateCalloutTargets(calloutTargets))

6.4 All Together

First, we run our script to extract the targets of the FinSpy VM X86CALLOUT instructions. This script uses the VM bytecode disassembler, which runs outside of IDA. Thus, we take the output of this script, and copy and paste it into the IDAPython script we developed above to extract function information.

Next, inside of the IDA database for the FinSpy sample, we run our scripts above. They first extract the VM entry instruction information, and then use the X86CALLOUT targets generated by the previous scripts to extract the function prologue and VM entry key for each virtualized function. We print that information out, and copy and paste it into the devirtualization program.

A portion of the output is shown below; the full data set can be seen in the complete Python file:

(0x5a4b3a, 0x406a02, [139, 255, 85, 139, 236, 81, 81, 83, 86, 87]),
(0x5a19e6, 0x40171c, [85, 139, 236]),
(0x5a7a1d, 0x408e07, [139, 255, 85, 139, 236, 81, 131, 101, 252, 0]),
(0x5a7314, 0x408810, [139, 255, 85, 139, 236, 81, 81]),
(0x5a8bf9, 0x409a11, [139, 255, 85, 139, 236, 83, 86, 87, 179, 254]),

Each tuple contains:

  1. The VM instruction key for the first instruction of the virtualized function's body
  2. The x86 address of the virtualized function
  3. A list of x86 machine code bytes for the virtualized function's prologue

6.5 Issues in Collecting Function Information

The goal of the script detailed in pieces above is to collect the VM instruction keys, junk obfuscation register names, and function prologues for each virtualized function. Ultimately we will use this information to devirtualize X86CALLOUT instructions and prepend the prologues to the devirtualized versions of the virtualized functions. The scripts above assist in substantially automating this process, but it still requires manual intervention if any of its assumptions are violated. We can detect these erroneous situations and manually edit the data. Below we discuss the failure cases for the scripts above.

(The scripts just created are the primary reason my approach is "semi-automated" and not "fully-automated".)

6.5.1 Issue #1: Not All Referenced Callout Targets Were Defined as Code

First, not every address extracted as the target of an X86CALLOUT instruction was defined as code in the IDA database for the FinSpy VM sample. This lead to two issues. First, the script for extracting the prologue for that location would fail. Second, since the subsequent reference to the VM entrypoint was consequently also not defined as code, that reference would not be included in the set of addresses generated by the script for extracting the key and register sequence. 

This situation was easy to identify. The script for extracting the prologue would give an error about not being able to find the PUSH instructions for the two obfuscation registers. I inspected the locations for which the prologue script failed, and if that location was not defined as code, I manually defined the locations as code and ran the scripts again. Thereafter, the scripts would properly process these locations automatically.

6.5.2 Issue #2: Not Every Virtualized Function Used the Same VM Entry Schema

Second, the FinSpy VM did not always generate identical schema for x86-to-VM entrypoints at the sites of virtualized functions. Since the FinSpy virtualization obfuscator tool overwrites the original function with a VM entry sequence, small functions might not provide enough space for all parts of the obfuscation sequences. Either there would be no junk obfuscation sequence after the prologue, or the JMP to the VM interpreter would not be obfuscated as an XOR REG, REG / JZ sequence. Also, small functions might not have prologues. 

The script to extract virtualized function VM key / obfuscation register information would fail for these functions. Two errors are shown, as well as the corresponding code at the erroneous locations.

0x408412: found key 0x5a6d38, but not POP reg32, inspect manually
0x408d17: found key 0x5a7952, but not POP reg32, inspect manually

.text:0040840D         push    5A6D38h
.text:00408412         jmp     GLOBAL__Dispatcher

.text:00408D0C         mov     edi, edi
.text:00408D0E         push    5A7952h
.text:00408D13         push    ecx
.text:00408D14         xor     ecx, ecx
.text:00408D16         pop     ecx
.text:00408D17         jz      GLOBAL__Dispatcher

Our ultimate goal with this script is to collect key and prologue information for all virtualized functions. Automation has failed us, because our assumption that each x86-to-VM entrypoint had the same format was wrong. Thus, for these functions, I manually collected the information for these functions. In particular, I added eight manually-created entries to the list generated by the scripts; the first two shown below correspond to the examples above. (For the second entry, the prologue bytes [0x8B, 0xFF] correspond to the "mov edi, edi" instruction on line 0x408D0C above.)


6.5.3 Issue #3: Not All Callout Targets Were Virtualized Functions

Third, not all of the targets of X86CALLOUT instructions were virtualized -- ten addresses used as X86CALLOUT targets were not virtualized. This included a handful of C runtime functions for string manipulation and exception handling support, but also a few important functions used by the virtualized program. These functions were easy to identify since the prologue extraction script would fail for them. The script issued ten errors, the first three of which are shown:

# 0x0040a07c: did not find push ebp / push edx sequence within 20 instructions
# 0x0040ae80: did not find push ebp / push edx sequence within 20 instructions
# 0x0040709e: did not find push ecx / push edx sequence within 20 instructions

I inspected these addresses manually, and upon determining that the functions weren't virtualized, I created a Python set called NOT_VIRTUALIZED containing the addresses of these functions. 


Then, in my second devirtualization attempt in Part #3, Phase #4, I used this set to determine whether or not to emit an x86 CALL instruction directly to the referenced target.

7. Conclusion

In Phase #3, we examined FinSpy's mechanism for virtualizing functions and function calls. We determined that we would need several pieces of information to devirtualize these elements correctly, and then wrote scripts to collect the information. We attended to issues that arose in running the scripts, and ended up with the data we needed for devirtualization.

In the next and final phase, Part #3, Phase #4, we will incorporate this information into devirtualization. After fixing a few remaining issues, we will have successfully devirtualized our FinSpy VM sample.

FinSpy VM Unpacking Tutorial Part 3: Devirtualization. Phase #2: First Attempt at Devirtualization

[Note: if you've been linked here without context, the introduction to Part #3 describing its four phases can be found here.]

1. Introduction

In the previous Part #3, Phase #1, we inspected our FinSpy VM bytecode disassembly listing and discovered that the Group #2 instructions were used for obfuscation. After discovering obfuscation patterns and replacing them with simpler sequences, we obtained a simpler FinSpy VM bytecode program without any Group #2 instructions. After some further small improvements to the FinSpy VM bytecode disassembly process, we resume with a FinSpy VM bytecode listing that is suitable for devirtualization.

This Phase #2 discusses our first attempt at devirtualization. In fact, given the work done heretofore, the devirtualization code comes to about 100 lines of thoroughly-commented Python code. We then inspect our devirtualization, and discover several remaining issues that require correction before our task is complete. This phase fixes one of them, and the subsequent Part #3, Phase #3 examines the genesis of the remaining issues before fixing them. Part #3, Phase #4 makes a second (and eventually successful) approach toward devirtualizing FinSpy VM bytecode programs.

2. Whole-Program Devirtualization, First Attempt

After the previous Part #3, Phase #1 substitutions, our remaining FinSpy VM bytecode contains only Group #3 VM instructions (containing raw x86 machine code) and Group #1 VM instructions (all 16 varieties of x86 conditional branch instructions, and the now-correctly-identified unconditional JMP VM instructions). The following snippet is fairly representative of our FinSpy VM bytecode program at present. It contains only "Raw X86" instructions, "X86CALLOUT", "X86JUMPOUT", and conditional/unconditional branch instructions -- the only categories of remaining VM instructions after the previous phase. It basically looks like x86 already, except the control flow instructions are FinSpy VM instructions instead of x86 instructions. 

0x008e50: X86 mov ebx, edi
0x008e80: JMP VM[0x008aa8]
0x008e98: X86 push 2E50340Bh
0x008ec8: X86 push dword ptr [420358h]
0x008f28: X86CALLOUT 0x408360
0x008f40: X86 test eax, eax
0x008f58: JZ VM[0x009108] (fallthrough VM[0x008f70])
0x008f70: X86 lea ecx, dword ptr [ebp+0FFFFFFFCh]
0x008fd0: X86 push ecx
0x009000: X86 push 0FFFFFFFFh
0x009030: X86JUMPOUT jmp eax
0x009048: X86 test eax, eax
0x009060: JZ VM[0x009108] (fallthrough VM[0x009078])
0x009078: X86 cmp dword ptr [ebp+0FFFFFFFCh], 0h
0x009090: JZ VM[0x009108] (fallthrough VM[0x0090a8])
0x0090a8: X86 xor eax, eax
0x0090c0: X86 inc eax
0x0090d8: X86 leave 
0x0090f0: X86 ret 
0x009108: X86 xor eax, eax
0x009120: X86 leave 
0x009138: X86 ret 

2.1 Instruction-By-Instruction Devirtualization

At this point -- very quickly and easily into the analysis of the FinSpy VM bytecode program -- I felt comfortable writing a tool to reconstruct an x86 machine code program from the VM bytecode program. (Perhaps I was too comfortable, considering that a lot of work remained after my first attempt.) All of the remaining instructions seemed amenable to one-by-one translation back into x86. Namely, I wrote a simple loop to iterate over each VM instruction and create an x86 machine code array for the devirtualized program. At this point, there are four cases for the remaining VM instructions: the three varieties of Group #3 instructions, and the Group #1 branch instructions. We show simplified code for the devirtualizer, and then discuss the cases for the instruction types in more detail.

# Given: insns, a list of FinSpy VM instructions
# Generate and return an array of x86 machine code bytes for the VM program
def RebuildX86(insns):
    # Array of x86 machine code, built instruction-by-instruction
    mcArr  = []
    # Bookkeeping: which VM position/key corresponds to which
    # position within the x86 machine code array mcArr above
    locsDict = dict()
    keysDict = dict()
    # List of fixups for branch instructions
    locFixups = []
    # Iterate through the FinSpy VM instructions
    for i in insns:
        # currLen is the current position within mcArr
        currLen = len(mcArr)
        # Bookkeeping: memorize the x86 position for the 
        # VM instruction's VM position and VM key 
        locsDict[i.Pos] = currLen
        keysDict[i.Key] = currLen

        # Is it "Raw X86" or "X86JUMPOUT"? Just emit the
        # raw x86 machine code if so
        if isinstance(i,RawX86StraightLine):
        elif isinstance(i,RawX86Jumpout):

        # Is it a branch instruction?
        elif isinstance(i, ConditionalBranch):

            # Get the name of the branch
            jccName = INSN_NAME_DICT[i.Opcode]

            # Is this an unconditional jump?
            if jccName == "JMP":

                # Emit 0xE9 (x86 JMP disp32)

                # Opcode is 1 byte
                dispPos = 1

            # Otherwise, it's a conditional jump
                # Conditional jumps begin with 0x0F
                # Second byte is specific to the condition code

                # Opcode is 2 bytes
                dispPos = 2

            # Emit the displacement DWORD (0 for now)
            mcArr.extend([0x00, 0x00, 0x00, 0x00])

            # Emit a fixup: the JMP displacement targets
            # the VM location specified by i.VMTarget

        # Is it X86CALLOUT?
        elif isinstance(i,RawX86Callout):
            # We aren't handling this just yet.
            # Emit E8 00 00 00 00 (CALL $+5)
            # Revisit later
            mcArr.extend([0x00, 0x00, 0x00, 0x00])

Now we discuss the cases above:

  • The instruction is Group #3 "Raw X86", which already contains raw x86 machine code in an array inside of the VM instruction object. In this case, we simply append the raw x86 machine code to the array.
  • The instruction is Group #3 "Call Indirect". As this instruction also contains raw x86 machine code, the initial idea and approach was to simply append it to the array identically to the previous case. (I later discovered that this case required more care, though it ended up not being very difficult.)
  • The instruction is a Group #1 conditional or unconditional jump. We discuss these in a subsequent subsection.
  • The instruction is "Call Direct" (symbolized as X86CALLOUT in the FinSpy VM disassembly listing). We discuss these later in this subsection.

2.2 Bookkeeping Cross-References

While translating the instructions as described above, we update two dictionaries to keep track of the position of each VM instruction within the x86 machine code array. Recall from the introduction in Part #3, Phase #1 that each VM instruction has two uniquely identifying characteristics: 1) its offset within the VM bytecode array, a multiple of 0x18 (the length of a VM instruction), and 2) a "key" DWORD used to locate instructions. The two dictionaries, called locsDict and keysDict respectively, map each instruction's position, and each instruction's key, to the offset within the x86 machine code array where the devirtualized instruction resides. Reproducing the pertinent snippets from the simplified code above:

# Given: insns, a list of FinSpy VM instructions
# Generate and return an array of x86 machine code bytes for the VM program
def RebuildX86(insns):
    # Array of x86 machine code, built instruction-by-instruction
    mcArr  = []
    # Bookkeeping: which VM location/key corresponds to which
    # position within the x86 machine code array mcArr above
    locsDict = dict()
    keysDict = dict()
    # Iterate through the instructions
    for i in insns:
        # currLen is the current position in mcArr
        currLen = len(mcArr)
        # Bookkeeping: memorize the x86 position for the 
        # VM instruction's VM position and VM key 
        locsDict[i.Pos] = currLen
        keysDict[i.Key] = currLen
        # ... code to devirtualize was shown above ... 

For example, the first VM instruction (the one at position 0x0 within the VM instruction array) is emitted at position 0 within the x86 machine code blob. Thus, we update the dictionary locsDict with the binding locsDict[0x0] = 0x0 (i.e., VM bytecode position 0x0 -> x86 machine code position 0x0). The first instruction has key 0x5A145D, so we update the dictionary keysDict with the binding: keysDict[0x5A145D] = 0x0.

The second VM instruction shall be devirtualized after the first instruction. The devirtualized first instruction was three bytes in length, thus the second VM instruction begins at position 3 within the x86 machine code array. Since the second VM bytecode instruction corresponds to position 0x78 within the VM bytecode array (after the simplifications from the previous Part #3, Phase #1), we update our dictionary with the information: locsDict[0x78] = 3 (i.e., VM bytecode position 0x78 -> x86 machine code instruction position 0x3). The second instruction has key 0x5A1461, so we update the dictionary keysDict with the binding: keysDict[0x5A1461] = 0x3. 

This process continues sequentially for every instruction that we translate. We start a counter at 0x0, add the length of the devirtualized x86 machine code after each step, and at each iteration we associate the current value of the counter with the instruction's key and position before generating x86 machine code.

2.3 De-Virtualizing Branch Instructions

We have deferred discussing conditional branches; now we shall take the subject up again. 

x86 branch instructions are encoded based on the distance between the address of the branch instruction and the address of the target. The addresses themselves of the source and destination locations aren't important, only the difference between them. The opcode of each of the 16 x86 conditional branch types is 0F 8x for some value of x; the opcode for an unconditional branch is E9. To encode a jump targeting the address 0x1234 bytes after the JMP instruction, this would be encoded as E9 34 12 00 00. If we needed to encode an x86 "JZ" instruction, whose destination was 0x1234 bytes after the JZ instruction, the encoding would be 0F 84 34 12 00 00.

Since we are devirtualizing VM instructions one-by-one in forward order, if the branch target is in the reverse direction, that means we've already devirtualized the destination, and hence we could immediately write the proper displacement. However, if the branch target is in the forwards direction, that means the destination lies at an address of a VM instruction that we haven't translated yet, and so we don't know the address of the destination, and so we can't immediately generate the x86 branch instruction. This conundrum -- common in the world of linkers -- necessitates a two-phase approach.

  1. Phase #1: During devirtualization, the first thing to do is to emit the opcode for the jump (recalling the previous examples, e.g., E9 for an unconditional JMP, or 0F 84 for a conditional JZ). This is easy; we just look up the proper opcode for a given jump type within a dictionary that maps VM branch instruction types to x86 opcodes. After the x86 opcode for the branch type, we write a displacement DWORD of 0x0. Crucially, we also generate a "fixup": we add an entry to a list that contains both the position of the displacement DWORD within the x86 machine code array, as well as the VM bytecode instruction EIP to which the jump must eventually point. These fixups are processed by phase two. (Note that the code for phase #1 was shown in the previous section.)
  2. Phase #2: After devirtualizing the entire FinSpy VM bytecode program, the "locsDict" dictionary described in the previous section on bookkeeping now has information about where each VM instruction lies within the devirtualized x86 machine code array. We also have a list of fixups, telling us which locations in the x86 machine code program correspond to the dummy 0x0 DWORDs within the branch instructions that we generated in the previous phase, which now need to be fixed up to point to the correct locations. Each such fixup also tells us the position of the VM bytecode instruction (within the VM bytecode array) to which the branch should point. Thus, we simply look up the destination's devirtualized x86 machine code position within locsDict, and compute the difference between the position after the jump displacement DWORD and the position of the destination. We replace the dummy 0x0 DWORD with the correct value. Voila, the branches now have the correct destinations.

Here is the code for phase #2, which executes after the main loop in "RebuildX86" has emitted the devirtualized x86 machine code for all VM instructions:

# Fixups contain:
# * srcBegin: beginning of devirtualized branch instruction
# * srcFixup: distance into devirtualized branch instruction
#             where displacement DWORD is located
# * dst:      the position within the VM program where the
#             branch destination is located
for srcBegin, srcFixup, dst in locFixups:
    # Find the machine code address for the source
    mcSrc = locsDict[srcBegin]
    # Find the machine code address for te destination
    mcDst = locsDict[dst]
    # Set the displacement DWORD within x86 branch instruction
    StoreDword(mcArr, mcSrc+srcFixup, mcDst-(mcSrc+srcFixup+4))

(Pedantic optional note: x86 also offers short forms of the conditional and unconditional branch instructions: if the displacement of the destination fits within a single signed byte, there are shorter instructions that can be used to encode the branch. I.e., EB 50 is the encoding for unconditionally jumping to 50 bytes after the source instruction, and 75 F0 performs a JNZ to 0xFFFFFFF0 bytes after the source instruction. Because using the short forms requires a more sophisticated analysis, I chose to ignore the short forms of the branch instructions (and use only the long forms) when performing devirtualization. The only side effect of this is that some branch instructions in the devirtualized program are longer than necessary -- but in terms of their semantic effects, the long and short branches function identically. If you wanted to be less lazy about this, and emit small jumps when applicable, you should read this paper or the source code to an x86 assembler.)

2.4 Devirtualizing X86CALLOUT Instructions, Take One

The last variety of VM instruction that we need to handle is the X86CALLOUT instruction. These instructions specify an RVA within the .text section at which the targeted x86 function begins. x86 CALL instructions are encoded identically to x86 branch instructions -- following the x86 CALL opcode E8, there is a DWORD specifying the distance from the end of the x86 CALL instruction to the target address. Thus, like we just discussed for devirtualizing branch instructions, we need to know the source and destination addresses for the CALL to generate the proper displacement for the x86 CALL instruction. 

The targets of the virtualized branch instructions are specified as locations within the VM program. I.e., none of the jumps exit the VM. Thus, our task in fixing the branch instructions was simply to locate the distance between the branch instruction in the devirtualization and its target. Since x86 branch instructions are relatively-addressed, we do not need to take into account where within the original binary we will eventually reinsert the devirtualized code -- just the distance between the source and target address is enough information.

On the other hand, the targets of all X86CALLOUT instructions are outside of the VM. Like with jumps, we'll need to know the distance between the source and target addresses. Unlike with jumps, for the devirtualized CALL instructions, we do need to know where within the original binary our devirtualized code shall be located in order to compute the CALL displacement DWORD correctly.

Since I hadn't decided where I was going to store the devirtualized code, and yet I wanted to see if my devirtualization process was otherwise working, I decided to postpone resolving this issue. I decided for the time being to just generate dummy x86 instructions to devirtualize X86CALLOUT instructions. I.e., for each "Call Direct" VM instruction, at this stage, I chose to emit E8 00 00 00 00 (x86 CALL $+5) for its machine code. Clearly this is an incorrect translation -- to reiterate, this approach is just a placeholder for now -- and the devirtualized program will not work at this stage (and we won't see proper call destinations in the devirtualized program), but doing this allowed me to proceed without having to decide immediately where to put the devirtualized code. 

(Note that when it came time to fix this issue, I actually discovered a second and much more severe set of issues with devirtualizing X86CALLOUT instructions, which ended up consuming roughly half of my total time spent in the devirtualization phase. We will return to those issues when they arise naturally.)

3. Moment of Truth, and Stock-Taking

At last, we have our first devirtualized version of the FinSpy VM bytecode program for this sample. If you would like to see the results for yourself, you can load into IDA the binary for the first devirtualization. Now for the real test: did it work? How close to being done are we?

I took the output .bin file and loaded it into IDA. It looked pretty good! Most of it looks like something that was generated by a compiler. After the thrill of initial success wore off, I began looking for mistakes, things that hadn't been properly translated, or opportunities to improve the output. After this investigation, I'll go back to the drawing board and see what I can do about fixing them. Then I'll do the same thing again: look at the second iteration of the output, find more issues, fix them, and repeat. Hopefully, this process will eventually terminate. (It did.)

3.1 Problem #1: Call Targets are Missing

This problem was anticipated. As discussed in the last section, since I wasn't sure just yet how to handle the X86CALLOUT targets, I deliberately emitted broken x86 CALL instructions (namely E8 00 00 00 00, CALL $+5), with the plan to fix them later. Not surprisingly, those instructions were broken in the devirtualized output:

seg000:000005D6 BE 08 02 00 00        mov     esi, 208h
seg000:000005DB 56                    push    esi
seg000:000005DC 8D 85 C0 FB FF FF     lea     eax, [ebp-440h]
seg000:000005E2 57                    push    edi
seg000:000005E3 50                    push    eax
seg000:000005E4 E8 00 00 00 00        call    $+5 ; <- call target missing
seg000:000005E9 56                    push    esi
seg000:000005EA 8D 85 C8 FD FF FF     lea     eax, [ebp-238h]
seg000:000005F0 57                    push    edi
seg000:000005F1 50                    push    eax
seg000:000005F2 E8 00 00 00 00        call    $+5 ; <- call target missing
seg000:000005F7 56                    push    esi
seg000:000005F8 8D 85 A8 F5 FF FF     lea     eax, [ebp-0A58h]
seg000:000005FE 57                    push    edi
seg000:000005FF 50                    push    eax
seg000:00000600 E8 00 00 00 00        call    $+5 ; <- call target missing
seg000:00000605 83 C4 24              add     esp, 24h

Therefore I don't know where any of the calls are going, and so there are no function call cross-references. Obviously we will need to stop kicking the can down the road at some point and properly attend to this issue.

3.2 Observation #2: What are These Indirect Jump Instructions?

Another thing I noticed was an odd juxtaposition of indirect jump instructions where the surrounding context indicated that an indirect call would have been more sensible. For example:

seg000:00000B51     push    dword ptr [ebp-0Ch]
seg000:00000B54     mov     eax, ds:41FF2Ch
seg000:00000B59     jmp     dword ptr [eax+14h]
seg000:00000B5C     mov     eax, ds:41FF38h
seg000:00000B61     push    esi
seg000:00000B62     jmp     dword ptr [eax+90h]
seg000:00000B68     mov     eax, ds:41FF38h
seg000:00000B6D     jmp     dword ptr [eax+74h]

We see arguments being pushed on the stack, as though in preparation for a call. Then, instead of a call, we see a jump. Then, the assembly language indicates that another call should be taking place, but instead there's another indirect jump. There are no intervening cross-references jumping to the locations after the indirect jumps. What's up with that?

3.3 Problem #3: None of the Functions Have Prologues

Although a lot of the functions looked exactly like x86 machine code emitted by Microsoft Visual Studio to me, something was off. IDA tipped me off to the problem, since every function had a red message indicating a problem with the function's stack frame. Here is a screenshot of IDA, showing the colored message (and with the stack pointer shown for illustration):


Most of the code looks good, but the "leave" instruction is supposed to pop the EBP register off the stack -- and yet the prologue does not push the EBP register onto the stack. Also, the first instruction makes reference to "[EBP+8]", which assumes that EBP has previously been established as the frame pointer for this function -- and yet, being the first instruction in the function, clearly the devirtualized code has not yet established EBP as the frame pointer. Which memory location is actually being accessed by this instruction?

In other functions, we see the epilogues popping registers off the stack -- but inspecting the beginnings of those functions, we see that those instructions were not previously pushed onto the stack. Here's an example from the beginning of a function:

seg000:000004FB     mov     ebx, [ebp+8]     ; Overwrite EBX
seg000:000004FE     mov     edi, [ebx+3Ch]   ; Overwrite EDI
seg000:00000501     add     edi, ebx
seg000:00000503     mov     eax, [edi+0A0h]
seg000:00000509     test    eax, eax         ; Early return check
seg000:0000050B     jz      return_ebx       ; JZ taken => fail, return

; ... more function code ...

seg000:000005A3 return_ebx:
seg000:000005A3     pop     edi              ; Restore EDI (NOT PREVIOUSLY SAVED)
seg000:000005A4     mov     eax, ebx
seg000:000005A6     pop     ebx              ; Restore EBX (NOT PREVIOUSLY SAVED)
seg000:000005A7     leave                    ; Restore EBP (NOT PREVIOUSLY SAVED)
seg000:000005A8     retn    8                ; Return

We're going to need to know why these functions seemingly are missing instructions at their beginnings.

I can see why IDA is complaining about these functions -- despite being mostly coherent x86 implementations of C functions, they are clearly slightly broken. But how are they broken, and how can I fix them?

At this point I remembered something about the original binary. Back in part one, we analyzed a few VM entrypoints, and noticed that they began with normal-looking function prologues, before a segue into gibberish instructions, before finally pushing a VM key on the stack and transferring control to the VM entrypoint. To wit, here is the x86 code corresponding to the virtualized function from the screenshot above:

.text:00401340     push    ebp                     ; Ordinary prologue
.text:00401341     mov     ebp, esp                ; Ordinary prologue
.text:00401343     push    esi                     ; Save obfuscation register #1
.text:00401344     push    edx                     ; Save obfuscation register #2
.text:00401345     mov     edx, offset word_403DFE ; Junk obfuscation
.text:0040134A     xor     esi, esp                ; Junk obfuscation
.text:0040134C     shl     esi, cl                 ; Junk obfuscation
.text:0040134E     shr     esi, 1                  ; Junk obfuscation
.text:00401350     shl     esi, cl                 ; Junk obfuscation
.text:00401352     pop     edx                     ; Restore obfuscation register #2
.text:00401353     pop     esi                     ; Restore obfuscation register #1
.text:00401354     push    5A145Dh                 ; Push VM instruction key
.text:00401359     push    edx                     ; Obfuscated JMP
.text:0040135A     xor     edx, edx                ; Obfuscated JMP
.text:0040135C     pop     edx                     ; Obfuscated JMP
.text:0040135D     jz      GLOBAL__Dispatcher      ; Enter VM

That's where the missing function prologues went -- they are still at the locations where at which original functions resided within the binary. The prologues have not been virtualized. Upon calling an x86 function that has been virtualized, the function prologue will execute in x86 as usual, before the virtualized body of the function is executed within the VM. So in order to properly devirtualize the functions within the VM bytecode, we will need to extract their prologues from the x86, and during devirtualization, prepend those prologue bytes before the devirtualization of the function bodies. 

4. Fixing the Indirect Jumps

Now that we've identified a few issues with the devirtualized code, our next task is to fix them. We'll start with the lowest-hanging of the fruit, the weird indirect jumps, reproduced here for coherence:

seg000:00000B51     push    dword ptr [ebp-0Ch]
seg000:00000B54     mov     eax, ds:41FF2Ch
seg000:00000B59     jmp     dword ptr [eax+14h]
seg000:00000B5C     mov     eax, ds:41FF38h
seg000:00000B61     push    esi
seg000:00000B62     jmp     dword ptr [eax+90h]
seg000:00000B68     mov     eax, ds:41FF38h
seg000:00000B6D     jmp     dword ptr [eax+74h]

The x86 JMP instructions were copied verbatim from machine code stored within X86JUMPOUT instructions. So, in a sense, the JMP instructions are "correct". Nevertheless, this disassembly listing looks wrong. The indirect jump does not push a return address, so the code at the destination doesn't know where to resume executing. Consequently, the instruction following one of the indirect jumps will never execute unless its address is referenced somewhere else in the devirtualized code -- but that didn't seem to be the case, and it would have been weird if it was the case, since none of the situations in which a compiler would have emitted an address contained within the body of a function (a switch case, or an exception handler) seemed to apply. Another possibility would have been tail calls to function pointers, but the context of the indirect jumps did not indicate that a tail call was about to take place.

It struck me that these indirect jumps probably ought to be indirect calls instead. To investigate, I looked again at the disassembly of the VM instruction handler for the "Indirect Call" instructions. The "Indirect Call" FinSpy VM instruction is one of the ones that generates code dynamically while executing. Reproduced from part two, here's the code that the FinSpy VM generates dynamically when interpreting an "Indirect Call" instruction:

popf                          ; Restore flags from VMContext
popa                          ; Restore registers from VMContext
push offset @RESUME           ; PUSH RETURN ADDRESS (@RESUME)
[X86 machine code for indirect jump, copied from VM instruction]
@RESUME:                      ; <- RETURN LOCATION
push offset NextVMInstruction ; Resume at next VM insn
pusha                         ; Save registers
pushf                         ; Save flags
push offset VMContext
pop ebx                       ; EBX := VMContext *
push offset fpVMReEntryReturn
ret                           ; Re-enter VM

So indeed, before executing the indirect jump instruction from the x86 machine code stored within the instruction, the FinSpy VM "Indirect Call" handler pushes a return address on the stack. The return address points to the dynamically-generated code at the @RESUME label in the snippet above, which then continues execution at the next VM instruction following the "Indirect Call" instruction. That's why the devirtualized code isn't pushing a return address -- the VM takes care of that.

I expect that the FinSpy VM authors have code to translate x86 indirect call instructions into x86 indirect jump instructions. Thus I will need to do the reverse process: I will need to take the raw machine code contained in the "Indirect Call" VM instructions, and convert the x86 indirect jump instructions into call instructions instead. I wrote the following function, and added a call to it in the constructor for the Python object representing an "Indirect Call" VM instruction:

# This function disassembles the raw machine code for indirect jump instructions,
# changes the instructions therein to indirect call instructions, re-assembles
# them, and returns the new machine code with its textual disassembly.
# Input: bytes, an array of machine code for an indirect jump.
# Output: a tuple (string: disassembly, bytes: machine code for indirect call)

def ChangeJumpToCall(bytes):

    # Decode the x86 machine code
    i2container = X86Decoder(StreamObj(bytes)).Decode(0)

    # Fetch the instruction from the decoded bundle
    insn = i2container.instr

    # Ensure it's an indirect jump!
    assert(insn.mnem == XM.Jmp)

    # Change the mnemonic to call
    insn.mnem = XM.Call

    # Return new textual disassembly and machine code
    return (str(insn),EncodeInstruction(insn))

In reality, using my disassembler library was overkill here. There are only a few ways to encode indirect calls and indirect jumps in x86 machine code; simple pattern-replacement could have identified and re-written them very easily. But despite being overkill, there are no technical drawbacks with the solution based on rewriting and reassembling the instructions.

The devirtualized binary after the above changes can be found here.

5. Conclusion

We began this Phase #2 with a deobfuscated FinSpy VM bytecode disassembly listing. From there, we formulated a strategy to devirtualize the program instruction-by-instruction. This was mostly straightforward, except we deliberately did not devirtualize the X86CALLOUT instructions from Group #3. After devirtualization, we discovered several remaining issues, all relating to functions and function calls. This phase ended by examining and fixing one of those issues.

In the next Part #3, Phase #3, we will examine the source of the function-related issues in our current devirtualization. In so doing, we will learn that we need to obtain some specific information about the X86CALLOUT VM instructions and virtualized functions that we shall need to obtain and incorporate into our devirtualization process. Part #3, Phase #3 will then write scripts to collect this information. The final Part #3, Phase #4 will then incorporate this information into the devirtualization process, and then discover and correct a few remaining issues before producing the final devirtualized listing for the FinSpy VM bytecode in our running example.

FinSpy VM Unpacking Tutorial Part 3: Devirtualization. Phase #1: Deobfuscating FinSpy VM Bytecode Programs

[Note: if you've been linked here without context, the introduction to Part #3 describing its four phases can be found here.]

1. Introduction

In part one of this series, we analyzed the obfuscation on the x86 implementation of the FinSpy VM, and wrote a tool to deobfuscate it to allow easier analysis. In the second part of this series, we analyzed the VM instruction set, wrote a disassembler that works for the running example, and obtained the VM bytecode listing. Now we are left with devirtualization: we want to regenerate a facsimile of the original x86 program prior to virtualization.

This task winds up being fairly lengthy, so we have divided the approach into four phases, which corresponds to the work I did in devirtualizing FinSpy in the same order in which I did it. (That being said, don't mistake "lengthy" for "difficult".) We begin Phase #1 by inspecting the FinSpy VM bytecode program, discovering obfuscation involving the "Group #2" instructions, and removing it via pattern-substitution.

2. Devirtualizing FinSpy: Initial Observations and Strategy

At this point in the process, I had my first listing of the VM bytecode program in disassembled form. I began by inspecting the VM bytecode program and identifying ways to simplify the bytecode. This ultimately resulted in the wholesale elimination of the Group #2 instructions.

You may wish to inspect the initial VM bytecode disassembly, which we shall progressively refine throughout this Phase #1. Along the way, after discovering and applying simplifications, we will gradually obtain new VM bytecode disassembly listings, which will be linked throughout.

2.1 Recap of FinSpy VM Instruction Set and Bytecode Programs

To summarize part two, the FinSpy VM uses a fixed-length instruction encoding, with each VM instruction represented by a structure that is 0x18 bytes in length. 

Each instruction within a FinSpy VM program has two uniquely identifying characteristics associated with it. For one, we can refer to the raw position of the VM instruction within the array of VM bytecode instructions. So for example, the first instruction is located at position 0x0. Since each instruction is 0x18 bytes in length, the second instruction is located at position 0x18, and generally instruction #N is located at position 0x18*N within the bytecode array.

The second identifying characteristic for an instruction is its "key", a 32-bit value (the first DWORD in an encoded FinSpy VM instruction) that can be used to locate a particular VM instruction. Specifically, before entering the VM, the x86 code that executes before transferring control to the FinSpy VM interpreter first pushes a key DWORD onto the stack. Upon entering the VM, the FinSpy VM initalization code loads the key DWORD pushed by the x86 portion, searches through the VM bytecode array to locate the VM instruction with that key, and then begins interpreting VM instructions starting from that location. (It turns out that FinSpy VM instruction keys cause some complications in devirtualization, as we shall see later.)

Most FinSpy VM instructions are assumed to execute sequentially. That is, after one VM instruction completes, the x86 code for the instruction handler adds 0x18 to the current VM EIP, to advance it to the next VM instruction. VM control flow instructions may behave differently, e.g., the conditional branch instructions are relatively-addressed like their x86 counterparts (if the branch is taken, the VM instruction adds a displacement, a multiple of 0x18, to VM EIP; if the branch is not taken, execution continues at the VM instruction 0x18 bytes subsequent to the current instruction). The FinSpy VM also has an instruction for direct calls, and another one for indirect calls. These call instructions behave like you'd expect; they push a return address onto the stack to re-enter the VM after the called function returns.

The FinSpy VM's instruction set consists of three instruction groups:

  • Group #1: Conditional and unconditional jumps, (namely, JMP and all 16 of x86's conditional branches such as JZ, JNS, and JP).
  • Group #2: VM instructions that access the FinSpy VM's single dedicated register.
  • Group #3: VM instructions that contain raw x86 instructions inside of them.

2.2 Preliminary Thoughts on Devirtualization

After analyzing the VM's instruction set and perusing the VM bytecode program, it seemed like the group #3 VM instructions -- those with embedded x86 machine code blobs -- would be easy to convert back into x86. There are three VM instructions in this group: "Raw X86", "Call Direct", and "Call Indirect". (In practice, the latter two VM instructions wound up presenting more complications than anticipated, with direct calls being the most difficult aspect of the VM to devirtualize.)

Group #1, the conditional and unconditional branch instructions, seemed similarly easy to translate back to x86, since their implementations are virtually identical to how x86 conditional branches operate internally. Since conditional branches use relative displacements to determine the address of the "taken" branch, the only challenge is to determine the relative locations of each devirtualized instruction. Thus, once we know exactly how far the branch instruction is from its destination, we can simply compute the displacement. (Indeed, in practice, this was easy.)

Group #2, the set of VM instructions that access the FinSpy VM's single dedicated register, was the wild-card. Although from analyzing the FinSpy VM harness I knew the raw functionality of these instructions, I did not know before analyzing the VM program how these instructions would be used, and nor how to devirtualize them. For ease of reference, those instructions are summarized below from their more detailed expositions in part two:

mov scratch, 0                      [Operand: none]
mov scratch, imm32                  [Operand: imm32]
shl scratch, imm32                  [Operand: imm32]
add scratch, imm32                  [Operand: imm32]
mov scratch, savedReg32             [Operand: index of savedReg32]
add scratch, savedReg32             [Operand: index of savedReg32]
mov savedReg32, scratch             [Operand: index of savedReg32]
mov dword ptr [scratch], savedReg32 [Operand: index of savedReg32]
mov scratch, dword ptr [scratch]    [Operand: none]
mov dword ptr [scratch], imm32      [Operand: imm32]
mov dword ptr [imm32], scratch      [Operand: imm32]
push scratch                        [Operand: none]

Given that the other two groups seem straightforward, and the heretofore-unknown nature of Group #2, this seems like a good first task to devirtualize our FinSpy VM program: analyze the usage patterns of group #2 instructions, and formulate a strategy for converting them back to x86 machine code. (This turned out to be very easy and intuitive in practice, with no real complications to speak of.)

3. Step #1: Tweak the Output to Print the x86 Disassembly

Upon perusing the output of my FinSpy VM disassembler, my first observation was that my task would benefit from knowing the textual disassembly for the "Raw x86" instructions, which were being rendered as raw machine code bytes. For example, the first few instructions in the VM bytecode disassembly were:

0x000000: MOV SCRATCH, 0
0x000018: ADD SCRATCH, EBP
0x000030: ADD SCRATCH, 0x000008
0x000060: MOV EAX, SCRATCH
0x000078: X86 [199, 0, 215, 1, 0, 0] ; <- x86 machine code
0x000090: X86 [184, 0, 240, 65, 0]   ; <- would prefer x86 assembly
0x0000a8: X86 [201]
0x0000c0: X86 [194, 4, 0]

My proximate goal was to figure out how the Group #2 instructions (those involving the SCRATCH register, the first five in the listing above) were used within the VM program. For context, it would be nice to see what the X86 instructions on the four subsequent lines were doing. I.e., it would be convenient to have a textual representation for the x86 machine code, instead of the above display as lists of bytes. 

For this task, I was fortunate to have already written an x86 disassembler and assembler library in Python. (I have previously released this code to the public as part of the course sample for my SMT-Based Program Analysis training class.) For the two group #3 VM instructions with embedded x86 machine code -- "Raw X86" and "Call Indirect" -- I modified my FinSpy VM disassembler to invoke the x86 disassembler functionality when printing them. This was trivial to accomplish in the __str__() method for these Python classes, for example:

        d = X86Decoder(self.Remainder[0:self.DataLen])
        i2container = d.Decode(0)

After doing this, the VM bytecode disassembly listing was easier to follow. An example is shown below; the full disassembly listing can be found here.

0x000000: MOV SCRATCH, 0
0x000018: ADD SCRATCH, EBP
0x000030: ADD SCRATCH, 0x000008
0x000060: MOV EAX, SCRATCH
0x000078: X86 mov dword ptr [eax], 1D7h ; <- new: x86 assembly, not machine code
0x000090: X86 mov eax, 41F000h
0x0000a8: X86 leave
0x0000c0: X86 ret 4h

After some quick perusal, the x86 assembly language mostly looks like ordinary C code compiled with MSVC, though some functions use instructions like PUSHA and POPA that are not ordinarily generated by normal compilers, and hence were probably written in assembly language prior to virtualization.

4. Step #2: Pattern-Based Simplification for Scratch Register Access

Having a clearer FinSpy VM bytecode disassembly listing, I returned to trying to figure out what the Group #2 "dedicated register" instructions were accomplishing. It's easy enough to locate these instructions; all of them use the scratch register, which is denoted "SCRATCH" in the VM bytecode disassembly, so you can simply search for "SCRATCH" to see where these instructions are used. (Or just look at the disassembly listing -- group #2 instructions comprise a significant fraction of the VM instructions for our sample.)

Here I went a bit too fast and would have benefited from being a bit more deliberate. Inspecting the beginning of the VM bytecode program, I noticed a few different patterns involving loading x86 register values into the scratch register. For example, repeating the first two lines from the example in the last section, we see:

0x000000: MOV SCRATCH, 0
0x000018: ADD SCRATCH, EBP

This code first sets the scratch register to zero, and then adds the value of EBP. Effectively, this two-instruction sequence sets the scratch register to EBP. However, elsewhere there were different patterns of VM instructions used to accomplish the same goal. Here was the second pattern:

0x03c270: MOV SCRATCH, 0
0x03c288: MOV SCRATCH, EDX

And still elsewhere in the FinSpy VM bytecode disassembly, we see more natural one-instruction sequences to load register values into the stack register, for example:

0x001b48: MOV SCRATCH, ECX

Faced with these observations, I mistakenly identified this as a form of obfuscation: it seemed as though there were several ways to load the value of an x86 register into the scratch register, and that FinSpy was choosing between them arbitrarily in order to introduce some randomness into the VM bytecode program. Such randomness could make pattern-recognition more laborious. It might end up that I would need to write multiple patterns to match these sequences, thereby multiplying the number of patterns I needed to write involving x86 register loads by the number of patterns that the FinSpy VM employed for loading an x86 register into the SCRATCH register.

I figured that I could save work for myself down the road if I coalesced the two-instruction sequences down into a canonical representation of loading a register into the stack, namely, one instruction sequences as in the third example. (I later realized that this isn't really a form of obfuscation -- the FinSpy VM developers were just being lazy.)

To be specific, I wanted to write a search-and-replace rule that would take VM instruction sequences of the form:

MOV SCRATCH, 0      | MOV SCRATCH, 0      

And replace them with one-instruction sequences of the form:


It was easy enough to accomplish the pattern recognition and replacement using Python's type introspection functionality. Since my Python FinSpy VM disassembler from part two used Python class types for these individual VM bytecode instruction types, I could simply invoke the "isinstance" function to find instances of these patterns in the VM bytecode disassembly listing. The code for this can be found in the function "FirstSimplify" (search the code for that name to find the exact point of definition).

For instance, here is the code to identify the "MOV SCRATCH, 0" VM instruction. It encompasses two different possibilities: the dedicated VM instruction that moves 0 into SCRATCH, and the VM instruction that moves an arbitrary DWORD value into SCRATCH, when the DWORD is zero.

def IsMovScratch0(insn):
    if isinstance(insn,MovScratch0):
        return True
    if isinstance(insn,MovScratchImm32):
        return insn.Imm32 == 0
    return False

The complete code for recognition and replacement is shown in the following sequence. Please take the time to examine it, as all of the pattern-matching and replacement in the rest of this section have similar implementations, and won't be discussed in much detail for the sake of brevity.

# If the first VM instruction is "MOV SCRATCH, 0"...
if IsMovScratch0(i1):

    # Check to see if the second instruction is "MOV SCRATCH, REG32". If 
    # this function returns None, it wasn't. If it returns something other
    # than None, then it returns the register number.
    mr = IsMovScratchReg(i2)
    # Otherwise, check to see if the instruction was "ADD SCRACH, REG32",
    # and get the register number if it was (or None if it wasn't).
    if mr is None:
        mr = IsAddScratchReg(i2)
    # Did one of the two patterns match?
    if mr is not None:
        # Yes: make a new VM instruction, namely "MOV SCRATCH, REG32" to 
        # replace the two instruction sequence we just matched. Use the same
        # file offset position from the first instruction in the sequence.
        newInsn = MovScratchDisp8([0]*INSN_DESC_SIZE, i1.Pos)
        # Save the register number into the new instruction.
        newInsn.Disp8 = mr
        # Use the same VM instruction key as for the first instruction.
        newInsn.Key = i1.Key
        # Add the new VM instruction to the output list.

If one of the two patterns match, the code above generates a new VM instruction to replace the existing two. Since each VM instruction is uniquely associated with a position and key, we decided to copy those attributes from the first VM instruction within the matched pattern, and to delete the second VM instruction outright. (I worried that I might encounter situations where other VM instructions would reference the second and now-deleted VM instruction by its key or EIP, and in fact I had to account for this in Part #3, Phase #4.)

The new VM bytecode disassembly listing can be found here.
(Note that I originally wrote these pattern replacements in OCaml, since ML-family languages are intentionally designed with robust facilities for these sorts of tasks. My first implementation had the Python program print out an OCaml representation of the FinSpy VM program, which an OCaml program then simplified via pattern-substitutions. However, given that only a few such pattern-replacements were necessary to deobfuscate FinSpy VM bytecode programs, I decided that the cognitive and programmatic overhead of serializing between the OCaml and Python programs was unnecessarily complicated, and so I re-wrote the pattern replacement code in pure Python, despite the tedium.)

4.1 The "MOV SCRATCH, REG32" / "PUSH SCRATCH" Pattern

Next, I continued to inspect the VM bytecode disassembly listing and look for further patterns involving the SCRATCH register. I found my next candidate shortly after the previous one. In particular:

0x000120: MOV SCRATCH, ECX
0x000138: PUSH SCRATCH

This VM instruction sequence obviously replaces the x86 instruction "push ecx". As with the previous step, I decided to codify this into a pattern simplification. Any time we see the VM instruction sequence:


We want to replace that with a single VM instruction representing "push reg32". Mechanically, this works nearly identically to the previous step; we use Python's isinstance() function to find occurrences of this VM instruction pattern. The complete code can be found in the "SecondSimplify" function in the Python source code (search for the function name). For ease of reference, here is how we identify "MOV SCRATCH, REG32" instructions:

def IsMovScratchReg(insn):
    if isinstance(insn,MovScratchDisp32):
        return insn.Disp32
    if isinstance(insn,MovScratchDisp8):
        return insn.Disp8
    return None

To replace these two-instruction sequences, we generate a FinSpy VM "Raw x86" instruction which contains the x86 machine code for the x86 PUSH operation. I use my x86 library to create the Python object for the replacement x86 instruction "push reg32": in this example, the Python object for "push ecx" can be constructed as X86.Instruction([], XM.Push, X86.Gd(mr,True)).

After generating the replacement x86 instruction object, I wrote a function to generate a FinSpy VM instruction containing the raw machine code. The function "MakeRawX86" from the simplification source code (search for the function name) is reproduced here for ease of presentation:

def MakeRawX86(Pos, Key, x86Insn):
    # Create a FinSpy VM "Raw X86" instruction with dummy
    # content at the correct position (specified by Pos)
    newInsn = RawX86StraightLine([0]*INSN_DESC_SIZE, Pos)

    # Set the key to be the key from the first of the two
    # instructions of the two-instruction matched sequence
    newInsn.Key = Key

    # Encode the x86 instruction into machine code, store
    # the bytes in the FinSpy VM instruction
    newInsn.Remainder = EncodeInstruction(x86Insn)
    # Cache the length of the x86 instruction's machine code
    newInsn.DataLen = len(newInsn.Remainder)

    # Cache the textual disassembly for that instruction
    newInsn.X86 = str(x86Insn)
    # Return the FinSpy VM instruction just constructed
    return newInsn

The FinSpy VM "Raw X86" instruction object returned by the function above is used as the replacement for the two-VM instruction sequence "MOV SCRATCH, REG32 / PUSH SCRATCH". As the code shows, its two uniquely-identifying characteristics (VM instruction position and VM instruction key) are duplicated from the first of the two VM instructions in the pattern.

After applying this substitution, the new VM bytecode disassembly listing can be found here.

4.2 The "MOV SCRATCH, REG32" / "MOV REG32, SCRATCH" Pattern

Shortly thereafter was another, similar VM instruction pattern:

0x000228: MOV SCRATCH, ESP
0x000240: MOV ESI, SCRATCH

Clearly, this sequence virtualizes the x86 instruction "mov esi, esp". More generally, when we see two adjacent VM instructions of the form:


We can replace this with an x86 instruction "mov reg32_2, reg32_1". As in the previous case, after using Python isinstance() checks to locate instances of this pattern, we generate a Python object representing the x86 instruction "mov esi, esp", and call the MakeRawX86 function (detailed in the previous section) to generate a new "Raw x86" VM instruction to replace the two VM instructions comprising the pattern instance. This process can be seen completely in the function "ThirdSimplify" from the simplification code (search for the function name).

After applying this substitution, the new VM bytecode disassembly listing can be found here.

4.3 The "MOV SCRATCH, 0" / "ADD SCRATCH, IMM32" Pattern

Continuing to look for patterns in the VM bytecode program, I saw the following:

0x0037b0: MOV SCRATCH, 0
0x0037c8: ADD SCRATCH, 0x420344

This sequence can obviously be replaced with the single-instruction sequence "MOV SCRATCH, 0x420344". The function "FourthSimplify" in the simplification code (search for the function name) implements the detection and substitution. Again, it's trivial to write this pattern-recognition and replacement code, though unlike the previously-described pattern-substitutions, the replacement code is a Group #2 "SCRATCH register" FinSpy VM instruction rather than a Group #3 "Raw x86" instruction, since the output involves the scratch register.

After applying this substitution, the new VM bytecode disassembly listing can be found here.

4.4 The "MOV SCRATCH, IMM32" / "PUSH SCRATCH" Pattern

More perusal of the VM bytecode disassembly turned up the following sequence:

0x006900: MOV SCRATCH, 0x000003
0x006918: PUSH SCRATCH

As with previous examples, we can replace this two-instruction sequence with the x86 instruction "push 3", by using a "Raw x86" VM instruction. The function "FifthSimplify" implements this -- it identifies "MOV SCRATCH, IMM32 / PUSH SCRATCH" sequences, and replaces them with a "Raw x86" FinSpy VM instruction object containing the machine code for an x86 "push imm32" instruction, again using identical techniques to those discussed previously.

After applying this substitution, the new VM bytecode disassembly listing can be found here.

4.5 Memory Address Patterns

After performing the aforementioned replacements, we start to see a lot less variety in the remaining Group #2 instructions, and their further purpose quickly became obvious. Namely, each remaining cluster of VM bytecode instructions using the SCRATCH register consisted of two consecutive components: 1) a sequence of FinSpy VM instructions to load a memory address into the SCRATCH register, using what I've called a "memory address pattern"; followed by 2) a sequence of FinSpy VM instructions utilizing using the memory address just generated, for a purpose such as reading from the address or writing to it, using what I've termed a "memory access pattern". We will discuss both such patterns in this section and the subsequent one.

For an example of both a memory address pattern and a memory access pattern, here is the very beginning of the VM bytecode program after all the prior replacements have executed:

0x000000: MOV SCRATCH, EBP                 ; Memory !address!
0x000030: ADD SCRATCH, 0x000008            ; Memory !address!
0x000048: MOV SCRATCH, DWORD PTR [SCRATCH] ; Memory !access!
0x000060: MOV EAX, SCRATCH                 ; Memory !access!

The first two VM instructions set the SCRATCH register to EBP+0x8. These two instructions comprise the "memory address pattern".

The last two VM instructions read a DWORD from the memory address contained in the SCRATCH register, and store the result into EAX. These two instructions comprise the "memory access pattern".

Clearly, these four VM instructions can be replaced by the single x86 instruction "mov eax, [ebp+8]" -- which looks very much like something we'd expect to see near the beginning of a function (suitably, given that the four VM instructions just shown are the first VM instructions in the bytecode program, and are situated toward the beginning of an x86 function). 

For a more complicated example:

0x03c270: MOV SCRATCH, EDX      ; Memory !address!
0x03c2a0: SHL SCRATCH, 0x000003 ; Memory !address!
0x03c2b8: ADD SCRATCH, EAX      ; Memory !address!
0x03c2d0: ADD SCRATCH, 0x000010 ; Memory !address!
0x03c2e8: MOV EAX, SCRATCH      ; Memory !access!

Let's trace the value that is written to EAX at the end. The first two VM instructions shift EDX left by 3; the third VM instruction adds EAX, and the fourth VM instruction then adds 0x10. Thus, the expression written to EAX is "EDX<<3 + EAX + 0x10". EDX<<3 is the same thing as EDX*8, so our expression is "EDX*8 + EAX + 0x10". The format of the memory expression should look familiar -- it can be encoded as the legal x86 ModRM/32 memory expression "[EAX+EDX*8+0x10]". These first four VM instructions comprise a "memory address pattern". The fifth and final VM instruction is the "memory access pattern" instance -- rather than reading from or writing to this memory location, we simply store the memory address itself into EAX. Thus this VM instruction sequence performs the equivalent of the x86 instruction "lea eax, [eax+edx*8+10h]".

Indeed, all of the memory address patterns appear to be creating memory addresses that would be legal to encode using X86 ModRM memory expressions. X86 ModRM memory expressions contain one or more of the following elements added together:

  • A 32-bit base register
  • A 32-bit scale register, optionally multiplied by 2, 4, or 8
  • A 32-bit displacement (an immediate DWORD)

By inspecting the VM bytecode program, all memory address sequences have an identical layout.

  • If an index register is being used, the first VM instruction of the sequence sets the SCRATCH register to the value of the base register. 
  • If the index register is multiplied by a scale factor, the next VM instruction is SHL SCRATCH, IMM32, where IMM32 is 1, 2, or 3 (to multiply the index register by *2, *4, or *8, respectively).
  • If a base register is used, the next instruction in the sequence adds the base register.
  • If a 32-bit displacement is used, the next VM instruction is an ADD SCRATCH, IMM32 instruction.

All of these elements are optional, though at least one must be present. So for example, if the memory address is a raw 32-bit value (e.g. dword ptr [401234h]), only the ADD SCRATCH, IMM32 VM instruction will be present. If the memory expression consists of only a register (e.g., dword ptr [eax]), then only the MOV SCRATCH, REG32 VM instruction will be present. Memory expressions that use more than one element will be virtualized by combining the VM instructions for the elements present.

4.6 Memory Access Patterns

After a memory address pattern just described creates a memory expression in the SCRATCH register, the next one or two VM instructions make up the memory access pattern, and dictate how the memory address is used through how they access the SCRATCH register. By examining the VM bytecode program, I found that there were four distinct VM instruction sequences used for memory access patterns.

4.6.1 Memory Access Case #1: Memory Read, Store Into Register

The first case reads from the memory address, and stores the result in a register:

; ... memory address pattern before this ...
0x000060: MOV EAX, SCRATCH

This corresponds to "mov eax, dword ptr [memExpression]", with the details of "memExpression" being dictated by the memory address pattern.

4.6.2 Memory Access Case #2: Memory Read, Push Result

The second case reads from the memory address, and then pushes the result on the stack:

; ... memory address pattern before this ...
0x004968: PUSH SCRATCH

This corresponds to "push dword ptr [memExpression]", with the details of "memExpression" being dictated by the memory address pattern.

4.6.3 Memory Access Case #3: Memory Write, Value Taken from Register

The third case stores a register into the memory location calculated by the memory address sequence:

; ... memory address pattern before this ...

This corresponds to "mov dword ptr [memExpression], eax", with the details of "memExpression" being dictated by the memory address pattern.

4.6.4 Memory Access Case #4: Store Address into Register

The fourth case does not dereference the memory location, but rather saves the result into a register:

; ... memory address pattern before this ...
0x000e28: MOV EAX, SCRATCH

This corresponds to "lea eax, dword ptr [memExpression]", with the details of "memExpression" being dictated by the memory address pattern.

4.7 All Together: De-Virtualizing Memory References

After having fully analyzed the remaining references to the SCRATCH register as described above, I wrote code to detect the memory address patterns and the following memory access patterns, analyze them, and convert them back into x86 machine code. The function "SixthSimplify" in the simplification code recognizes memory address patterns, inspects the following instructions to determine the memory access patterns, and combines these two pieces of information to determine which x86 instruction the two memory address/access sequences have replaced. Next, the code reconstructs an x86 instruction -- as a FinSpy VM "Raw X86" instruction -- with which it finally replaces the VM instructions comprising the memory address and memory access sequences. As before, the replacement "Raw x86" FinSpy VM instruction uses the key and position attributes from the first VM instruction as the unique attributes for the replacement instruction.

First, "SixthSimplify" invokes the function "DecodeAddressSequence". That function determines if a given position within a list of FinSpy VM instructions begins a memory address sequence. If not, the function returns None. If so, DecodeAddressSequence extracts details from the memory address sequence -- the base register, the index register and optional scale factor, and the 32-bit displacement -- and uses it to create a Mem32 object (the class used by my Python x86 library to represent 32-bit memory expressions), which it then returns along with the number of VM instructions comprising the memory address pattern. That code is shown below. 

# Given:
# * A list of FinSpy VM instructions in insnArr
# * An index within that list in idx
# Determine if insnArr[idx] contains a memory address sequence.
# If not, return None. If so, create an x86 memory expression
# operand using my x86 library, and return it along with the
# number of instructions in the address sequence.

def DecodeAddressSequence(insnArr, idx):
    # Save position of index within insnArr
    oldIdx = idx
    # The first VM instruction in the sequence is usually "MOV SCRATCH, REG32".
    r1 = IsMovScratchReg(insnArr[idx])

    # Was it?
    if r1 is not None:

        # Yes, it was, so increment the current index
        idx += 1

        # Is the next VM instruction "SHL REG, [1/2/3]"?
        if isinstance(insnArr[idx],ShlScratchImm32):
            # Yes, copy the scale factor
            scaleFac = insnArr[idx].Imm32
            assert(scaleFac == 1 or scaleFac == 2 or scaleFac == 3)
            # Increment the current index
            idx += 1
        # Otherwise, there is no scale factor
            scaleFac = None

        # Is the next VM instruction "ADD SCRATCH, REG32"?
        r2 = IsAddScratchReg(insnArr[idx])
        if r2 is not None:
            # Yes, increment the current index
            idx += 1

        # Is the next VM instruction "ADD SCRATCH, IMM32"?
        disp32 = IsAddScratchImm32(insnArr[idx])
        if disp32 is not None:
            # Yes, increment the current index
            idx += 1
        # Make a memory expression from the parts, and return the length
        # of the memory address sequence
        return (idx-oldIdx, MakeMemExpr(r2, r1, scaleFac, disp32))
    # The second possibility is that the memory expression is a raw address.
    imm = IsMovScratchImm32(insnArr[idx])
    # Was it?
    if imm is not None:
        # Yes: make a memory expression from the address, and return the
        # length of the memory address sequence (namely, 1).
        return (1, MakeMemExpr(None,None,None,imm))

    # If we are here, neither of the memory address patterns matched, so
    # signal match failure.
    return None

Next, a similar function called "DecodeAccessSequence" uses pattern-matching to determine which of the four memory access patterns -- mov reg32, [memExpr] / push [memExpr] / mov [memExpr], reg32 / lea reg32, [memExpr] -- follows the access sequence. The code is similar to the code just shown for decoding memory address sequences, and so is not duplicated here. See the source code for the complete details.

Finally, after recognizing a memory address sequence followed by a memory access sequence, we use the machinery we previously developed for regenerating x86 machine code -- namely, the function "MakeRawX86" -- to create a FinSpy VM instruction containing the machine code for the raw x86 instruction. We use this to replace the memory address and access sequences just recognized.

After running this simplification pass, and all prior ones described for Group #2 instructions, over the VM bytecode program, all Group #2 VM instructions disappear, and consequently the SCRATCH register disappears from the VM bytecode disassembly listing. The final disassembly listing of the simplified FinSpy VM program can be seen in its entirety here.

5. Step #3: Correcting a Small Error

Having dispensed with the Group #2 VM instructions, Groups #1 and #3 remained. Before addressing them, however, there was one VM instruction that didn't fit into either one of those groups. Namely, in producing my initial VM bytecode disassembly listing, I had incorrectly identified the FinSpy VM unconditional jump instruction as being an instruction to cause a deliberate crash. I disassembled this as "CRASH". My FinSpy VM bytecode disassembly listing was confusing with how frequently the CRASH instruction occurred:

0x005bb0: JNZ VM[0x005c88] (fallthrough VM[0x005bc8])
0x005bc8: X86 xor eax, eax
0x005be0: CRASH
0x005bf8: X86 mov eax, dword ptr [esi+4h]
0x005c70: CRASH
0x005c88: X86 push ebx

I discussed this issue at length in part two -- the implementation for unconditional jumps was considerably different from conditional jumps, and involved dynamic code generation, which was only used otherwise for Group #3 instructions. Closer analysis revealed that the true purpose of this instruction was as an unconditional jump and not a deliberate crash. Upon learning this, I updated my disassembler component to rename the "CRASH" instruction to "JMP", and to print the branch targets of these newly-renamed JMP instructions. The resulting VM bytecode disassembly listing was clearer:

0x005bb0: JNZ VM[0x005c88] (fallthrough VM[0x005bc8])
0x005bc8: X86 xor eax, eax
0x005be0: JMP VM[0x008df0]
0x005bf8: X86 mov eax, dword ptr [esi+4h]
0x005c70: JMP VM[0x008df0]
0x005c88: X86 push ebx

The entire VM bytecode disassembly listing after this modification can be found here.

6. Conclusion

In this Phase #1, we inspected our sample's initial FinSpy VM bytecode disassembly listing. We improved the listing by including x86 disassembly for the x86 machine code embedded in the Group #3 instructions. We inspected how the VM bytecode program used Group #2 instructions. A half-dozen simple pattern-replacement schema were sufficient to entirely remove the Group #2 instructions. Afterwards, we corrected the output by observing that the "CRASH" instructions were really unconditional jumps.

In the next Part #3, Phase #2, we are ready to attempt our first devirtualization for the FinSpy VM bytecode program. Our initial attempts wind up being insufficient. We study the deficiencies in Part #3, Phase #3, and remedy them in our second, final, devirtualization attempt in Part #3, Phase #4.

FinSpy VM Unpacking Tutorial Part 3: Devirtualization

1. Overview

This is the third and final part in my series on statically unpacking the FinSpy VM. After having deobfuscated the x86 implementation of FinSpy in part one and after having analyzed the VM and written a disassembler for the bytecode format for the particular sample in question in part two, we are now tasked with reconstructing x86 code corresponding to the pre-virtualized code. As before, the files are in the GitHub repository here. Of note:

  • The IDB with the devirtualized code added, and mostly analyzed: here
  • The devirtualization program: here
  • FinSpy VM bytecode disassembly listings with various simplifications: here
  • Binary files for the devirtualized code: here

Because the FinSpy VM is a weak and ineffective software protection, unpacking it is not very difficult. Over half of the VM instructions in the bytecode program for the sample we're analyzing already contain raw x86 machine code. It turns out that FinSpy only truly virtualizes a handful of x86 instruction types, and that simple pattern-matching is effective in reconstructing the original code. It's perhaps the easiest software protection I know of worthy of being called a "virtualization obfuscator" -- and that makes it good practice for the difficult ones. 

As with part one, I am attempting to write this in a walk-through tutorial style. We start from first principles in inspecting the FinSpy VM bytecode disassembly listing, and show all of the steps (along with the code I wrote) along the way. Hopefully I've managed to capture the process of observation, evaluation of design considerations, actions taken including mis-steps, and above all, the iterative, trial-and-error nature of the affair.

2. Organizational Overview

This part ended up being longer and more difficult to write than expected. In fact, writing this document as a tutorial was considerably more difficult than doing the work in the first place. I hope the effort pays off in terms of educational value. Because I personally don't enjoy huge documents whose sections aren't well-segregated from one another, I have decided to split the devirtualization process, and also this document, into a number of "phases", each in its own individual blog entry. All entries are currently online and available for reading; links to them are immediately below. Each individual part links to the code and binary artifacts used in that phase.

Here are the contents of each phase in brief, along with links to the individual parts:

  1. Part #3, Phase #1: analyzing and deobfuscating FinSpy VM bytecode programs. In part two, we analyzed the FinSpy VM and its instruction set, and ultimately wrote a disassembler for FinSpy VM programs. This phase begins by reviewing the FinSpy VM instruction set. Next, we work from first principles in analyzing the FinSpy VM bytecode program. First we add a useful feature to our FinSpy VM disassembler. Next, we analyze a set of VM instructions -- group #2, in the parlance -- used by the VM bytecode program for obfuscation purposes. We determine several patterns, and write code to detect and simplify instances of those patterns within FinSpy VM bytecode programs. After one more small tweak to the disassembler, the FinSpy VM bytecode program is now ready to be devirtualized back into x86 machine code.
  2. Part #3, Phase #2: initial devirtualization. The previous phase left the FinSpy VM bytecode program in a suitable form for devirtualization. This phase begins by discussing the theory behind devirtualizing the individual FinSpy VM instructions. Next, we write a tool to devirtualize FinSpy VM bytecode programs -- this tool takes as input the disassembly of a FinSpy VM program, and produces as output a blob of x86 machine code that no longer relies upon the VM. After producing output, we then inspect our initial results in devirtualization to determine a few deficiencies and remaining tasks before our devirtualization can be considered complete. This phase ends by fixing one of the issues revealed by inspecting the devirtualized output.
  3. Part #3, Phase #3: devirtualizing function calls. The devirtualization generated by the previous stage still has a few deficiencies that need to be fixed before the output is fully usable. As it turns out, all of these deficiencies relate to how the FinSpy VM deals with function calls and function pointers. We discuss the source of the difficulties and determine what needs to be done to fix the issues. Then we discuss a few strategies that we might employ to solve the problems, including a somewhat sophisticated one involving x86 emulation. We settle for something less than that. 
  4. Part #3, Phase #4: second devirtualization. The previous phase collected information about virtualized functions in preparation for a second attempt at devirtualization. This phase incorporates that information into the devirtualization process, and then re-inspects the devirtualized code for defects. After fixing one more remaining major issue and two small ones, we now have our complete devirtualized blob of x86 machine code for the FinSpy VM program in our sample. We trick IDA into loading our devirtualized code, and now our devirtualization journey is complete. 

After phase #4 was complete, I next completely analyzed the devirtualized FinSpy dropper binary. I may publish some techniques I saw that weren't widely documented, and I may publish a complete analysis of the dropper. However, that writing remains to be done, and is not part of this blog series on devirtualization.


FinSpy VM Part 2: VM Analysis and Bytecode Disassembly

1. Overview

This is part two of my three-part series on analyzing and de-virtualizing the FinSpy virtual machine. Part one was a tutorial focused on removing the obfuscation from the x86 implementation of the FinSpy VM to facilitate analysis. This part describes the actual analysis of the FinSpy VM, both the results of it as well as the process of analyzing it. The third part will provide tools to devirtualize FinSpy VM-protected programs.

The same GitHub repository as last time has been updated to contain:

2. Mea Culpa

I endeavored to write a tutorial-style walkthrough for the entire process of analyzing the FinSpy VM, instead of the more traditional "results-oriented" presentations that are so common in the virtualization deobfuscation literature and security in general. In my opinion, I succeeded at this for the first part, and the third part is also amenable to that style of presentation. The second part, however, proved more difficult to write in that style. Ultimately, the non-linearity of static reverse engineering makes it difficult to present textually. Static reverse engineering involves working with incomplete information, and gradually refining it into complete information. This involves a lot of jumping around and writing notes that are cryptic at first, yet become clear after multiple passes of the same territory. Writing this into a textual document would most likely be incoherent and extremely lengthy (even moreso than currently).

I should have known better: when I teach static reverse engineering, I do so with long hands-on walkthroughs where I start at the beginning of some functionality, and describe in a step-by-step fashion what insights I obtain from the code, and how I translate that into annotations of the IDA database. In lieu of that, a video would probably be the second-best option. (Videos aren't interactive like classroom discussions, but at least they would show the process with mistakes and all.)

I've gone through a few drafts of this document trying to hybridize the standard results-oriented presentations of virtual machine deobfuscation with a hands-on style. After being dissatisfied with my drafts, I've settled on a compromise between the results-oriented and walk-through styles. Namely, I am including both. The first part of the document describes the conclusions I reached after analyzing the FinSpy VM. The second part shows the analysis as best as I can present it, including interstitial and externally-linked notes, assembly language snippets for every part of the VM, and decompilations of the assembly language.

I don't know if I've succeeded in writing something accessible. If you find this document too heady, I hope you will return for part three, which will be once again written like a standard tutorial. And since publishing part one, Filip Kafka has also published an analysis of the FinSpy VM, and two researchers from Microsoft also published an analysis of FinSpy. Hopefully between these sources, anyone interested in reverse engineering the FinSpy VM can follow along.

Part One: High-Level Analysis Summary

3. Basics of VM Architectures

Virtual machine software protections are interpreters. Namely, a virtual machine software protection translates the original x86 code into its own proprietary bytecode language (a process called "virtualization"). At run-time, whenever the original pre-virtualized x86 code would have run, the virtual machine interpreter instead executes the bytecode into which the pre-virtualized x86 was translated. The benefit of this protection technique is that the analyst can no longer look directly at the original x86 instructions, with which they are presumably familiar. Instead, they must first analyze the interpreter, and then use the knowledge gained to understand the bytecode into which the x86 was translated. This process is usually considerably complicated through a number of anti-analysis techniques:

  • Obfuscating the interpreter on the x86 level
  • Obfuscating the virtualized program
  • Using a different language for each sample, with the goal of requiring manual work for each
    • Randomizing the map between opcode bytes and instructions
    • Including polymorphic or metamorphic elements into the instructions

FinSpy VM is at the low end of the difficulty spectrum.

3.1 Virtual Machines as Interpreters

Since virtual machine software protections are interpreters, they have a lot in common with ordinary interpreters for more typical programming languages.

  • Instruction encoding. Interpreters need their own standardized ways to represent instructions. For bytecode languages, this is a binary encoding of the instruction and its necessary associated data (such as the instruction's operands). In practice, interpreters usually define encoded instructions in terms of data structures or classes.
  • VM context. Interpreters always have some state associated with them, also known as the "VM context structure". For example, interpreted programs might have access to registers specific to the language, and/or perhaps a stack or data area. (These are common examples; in practice, the state may contain only some of these elements, and/or other elements specific to the language.) The interpreter state is usually packaged up into a structure or object to allow multiple copies of the interpreter to exist throughout the program's execution (e.g., allowing multiple threads to run VM instances).
  • Interpretation. The implementations for the interpreter's instructions, then:
    • Take as input a pointer to the interpreter's context structure; 
    • Query the structure to obtain whatever information it needs to execute (e.g. if the instruction makes use of a value of a register, and the registers are held in the context structure, the instruction must retrieve the register's value from the context structure);
    • Perform whatever action is specified by the instruction and its operands;
    • Update the context structure with the results, for example, storing some value into a register, into the data section, or onto the context's stack.

Putting it all together, then, the skeleton of an interpreter generally looks like this:

void Interpret(VMContext *vmContext)
      // Instruction: add r0, r1, r2 [r0 <- r1+r2]
      case VMOpcode_AddRegReg:
        // Updates vmContext->pCurrInsn and vmContext->bShouldExit

      // Instruction: add r0, r1, imm32 [r0 <- r1+imm32]
      case VMOpcode_AddRegImm:
      /* ... more instructions ... */

An example instruction handler may look something like this:

void VMAddRegReg(VMContext *vmContext)
  // Fetch register values
  DWORD lhsVal = vmContext->regVals[vmContext->pCurrInsn->lhsReg];
  DWORD rhsVal = vmContext->regVals[vmContext->pCurrInsn->rhsReg];

  // Perform addition, update context
  vmContext->regVals[vmContext->pCurrInsn->destReg] = lhsVal + rhsVal;

  // Update VMEip to next instruction
  vmContext->pCurrInsn += sizeof(VMInstruction);

  // Don't signal VM exit
  vmContext->bShouldExit = false;

3.2 Virtualization Obfuscator Architecture Archetype

Virtual machines usually stick quite closely to the archtetype just described, though a few variants exist in practice. The following figure shows the basic model for virtual machine execution. Not all VMs follow this schema exactly. For example, some of them have a pre-initialization phase, and some don't; some of them use a "direct-threaded" architecture where each instruction transfers control to the next one, without having a centalized dispatcher invoking function pointers in a loop.


In brief, here are the roles of the phases illustrated in the diagram, and the considerations in analyzing them:

  • Save CPU state. Every VM begins by saving the registers and flags before the VM executes. This usually involves pushing them onto the stack, but depending on architecture, may involve copying them to a fixed data area. Analyzing this phase most often provides little information.
  • Pre-initialization. Before executing for the first time, the VM might have to allocate data structures (either globally and/or on a per-thread basis), or fix up pointers specified as relative virtual addresses (RVAs) by adding the process' image base address. This phase is usually guarded by a check against a static data item, such as a pointer to allocated memory that is initially set to NULL, or a boolean initialized to false, before pre-initialization has executed for the first time. After pre-initialization, the static pointer/boolean is set to a non-zero value. Analyzing this phase can provide insight into the data structures used by the VM, as well as where within the binary specific parts of the VM interpreter's code and data are located.
  • Initialization. After pre-initialization, the VM must prepare its context structure to begin executing the VM instructions specified by the pointer-type value passed to the VM entrypoint. The details of this phase are highly dependent upon low-level implementation choices for the VM. Analyzing this phase usually provides insight into how the VM passes the registers and flags from the host CPU to the virtual CPU, and how the VM interpreter translates the VM bytecode pseudo-pointers into actual pointers to VM instructions (if applicable). It may also provide a wealth of insight into the VM context structure.
  • Fetch. After preparing the context to execute, the VM must load the details of its first instruction. Analyzing this phase provides insight into how the instructions are stored, and how the VM locates them.
  • Decode. After locating the first instruction, the VM might have to perform special processing to obtain real instructions. The most common example is when the instructions are stored in an encrypted format, and must be decrypted prior to execution. Other examples include modifications to constant operand values encoded within an instruction. The constants may be encrypted as just discussed, or there may be other mechanisms for transforming constants (e.g. converting RVAs into virtual addresses, converting "keys" into values, etc). Analyzing this phase provides crucial, yet usually incomplete, information into how instructions are decoded, which is necessary to understand before writing a disassembler for the VM bytecode.
  • Dispatch. Finally, the VM must determine which bytecode operation is specified by the current instruction, and transfer control to x86 code responsible for performing the specified operation. This is most often accomplished by extracting a one-byte opcode from the instruction and using it as an index into a function pointer table. Analyzing this phase provides insight as to which part of the instruction specifies the operation, as well as illuminating how the VM translates that information into a pointer to code to execute (i.e., we learn where the function pointer table for the VM opcode implementations is held). More heavily-obfuscated VMs such as VMProtect may not store raw pointers in a table; those pointers may be encrypted somehow, and must be decrypted prior to dispatch.
  • Instruction execution. Once the VM has transferred control to x86 code that implements a given VM instruction, that x86 code then must perform all necessary operations to query and update the state so as to affect the execution of the virtual instruction. Analyzing this phase is when the details finally all come together. Quite often we will observe modifications to the VM state throughout the initialization phase that we are unable to decipher because we don't know how the instructions will use and modify those parts of the VM state. Analyzing the instructions will comprehensively test and extend our understanding of the VM architecture. This is also the most crucial phase for understanding what functionality is actually offered by the VM -- without this information, we can't write a disassembler, as we won't know what any of the instructions actually do.
  • VM continuation and/or exit. After an instruction has executed, it might be necessary to execute the next instruction sequentially within the VM bytecode stream. Or, it might be necessary to branch to another location within the VM due to a conditional or unconditional jump (which may be relatively addressed, or absolutely addressed). Or, if the VM implements function calls, we might have to save a return address (or something else that enables continuation of execution) somewhere before branching. Finally, all VMs have the ability to stop executing them and switch back to normal x86 execution. Analysis of this phases stresses understanding of the VM instruction pointer and addressing mode, as well as providing insight into how the VM communicates with x86 code outside of the VM. Re-entry after a VM exit is usually particularly interesting.

4. The FinSpy VM

The FinSpy VM follows the archetype just presented fairly closely. The entire VM Context structure can be seen here.

  • The FinSpy VM context structure contains only one dedicated register, used to virtualize 32-bit memory operands within certain x86 instructions.
  • The FinSpy VM does not copy the x86 state -- the registers and flags at the time of VM entry -- into the VM context structure. Instead, it saves these onto the host x86 stack, saves the value of the ESP register into the context structure, and when it needs to use or modify a register, it indexes the saved ESP value to read or write the register value.
  • The FinSpy VM has a stack area, and it points the host ESP value to this stack area upon entering the VM. I.e. any x86 instruction such as "push eax" executed as part of one of the VM instruction handlers will modify the stack data area within the VM, as opposed to the x86 stack as it exists outside the VM.
  • The FinSpy VM also has a specialized stack that it uses for dynamic code generation. About 50% of the instructions in the VM bytecode program for this sample result in x86 machine code being generated on the fly and executed. This is unusual as far as virtualization obfuscators go, and will be discussed more subsequently.

4.1 Pre-Initialization and Initialization

The FinSpy VM is designed to allow multiple threads to utilize the VM at the same time. In fact, the FinSpy VM maintains a global array of pointers, one per thread ID, each one pointing to a VM Context structure for that particular thread. Upon entering the VM, FinSpy checks to see whether a VM Context structure has already been allocated for that thread. If not, it allocates a new one and initializes it.
Namely, if a thread did not already have a VM Context structure allocated within the global array, the pre-initialization phase allocates one, and the two initialization phases fill in values within the VM Context structure:

  • Stores the module's image base into the VM Context structure.
  • Initializes the x86 stack area-pointer to point to the last DWORD within the x86 stack area.
  • Copies the host ESP value into the VM Context structure. Since the initialization phase pushes the flags and the registers onto the stack, VM instructions can access the values of saved registers and flags by indexing off of this saved ESP value.
  • Sets the x86 ESP register to point to the current bottom of the VM x86 stack region.
  • Initializes the dynamic code stack pointer to point to the beginning of that region. (The dynamic code stack grows upwards, unlike the x86 general stack, which grows downward.)
  • Installs five function pointers within the context structure that the VM instructions use to transfer control to other VM instructions. Some of these function pointers just continue executing the next instruction; some of them implement conditional and unconditional jumps; and one of them is used to return after a VM function that was called has finished executing. The use of so many function pointers for VM continuation is another unusual feature of FinSpy.
  • Checks to see whether the VM bytecode program stored in the binary is compressed, and if so, whether it has already been decompressed. If not, it decompresses the VM bytecode program and caches the results.
  • Uses the DWORD pushed before entering the VM to locate the VM instruction within the decompressed VM bytecode program that should execute first, and stores the pointer to the instruction as the VM EIP.

4.2 VM Instruction Encoding

The VM bytecode program is stored as a blob at the end of the program's .text section. The blob may be compressed with the APLib compression library, in which case the instructions are decompressed during pre-initialization. The VM instruction structure has a fixed size: namely, 0x18 bytes. This makes it easy to calculate the subsequent VM instruction, in case the instruction does not involve a control flow transfer: just add 0x18 to the VM instruction pointer.

After decompression, the VM bytecode instructions are still encrypted. There is a fixed XOR key within the .text section, which can change on a per-sample basis. Upon executing an instruction, the FinSpy VM instruction fetch portion copies an 0x18-byte instruction from VM EIP into the VM Context structure, and then XORs each DWORD with the fixed key. (The first DWORD is skipped by this XOR process, as it contains the instruction's key.)
The first 8 bytes of each instruction type has the same layout, while the last 0x10 bytes contains the VM instruction operands, and the format differs depending upon which category the instruction falls into.

struct VMInstruction
  DWORD Key;
  BYTE Opcode;
  BYTE DataLength;
  BYTE RVAPosition1;
  BYTE RVAPosition2;
  BYTE Data[16];

The fields serve the following purposes:            

  • DWORD Key: Each instruction is associated with a 32-bit "key". The VM can look up instructions by key, as it does to locate the first instruction before the VM begins executing (the caller pushes a key on the stack before jumping to the VM entrypoint). The VM instruction set includes function calls, and function call targets can be specified by key.
  • BYTE Opcode: This byte is in the range of 0x00 to 0x21, and specifies one of the 34 VM instruction opcode types.
  • BYTE DataLength: Some instructions hold variable-length data in their operand region; this field describes the length of that data.
  • BYTE RVAPosition1 and BYTE RVAPosition2: Some instructions specify locations within the x86 binary. Since the binary's base address may change when loaded as a module by the operating system, these locations may need to be recomputed to reflect the new base address. FinSpy VM side-steps this issue by specifying the locations within the binary as relative virtual addresses (RVAs), and then adding the base address to the RVA to obtain the actual virtual addresses within the executing module. If either of these bytes is not zero, the FinSpy VM treats it as an index into the instruction's data area at which 32-bit RVAs are stored, and fixes the RVA up by adding the base address to the RVA.
  • BYTE Data[16]: The instruction's operands, if it has any, are stored here. Each instruction may interpret the contents of this array differently.

4.3 VM Instruction Set

The 34 instructions fall into three groups.

4.3.1 Group #1: Conditional and Unconditional Jumps. 

FinSpy virtualizes all of the standard conditional jump instructions from X86, i.e. JZ, JNZ, JB, JO, JNP, etc. There are 17 of these VM opcodes in total, including the unconditional jump. These VM instructions are implementing by checking the respective conditions in the EFLAGS DWORD saved at the bottom of the host stack. I.e., ZF is stored in bit 0x40 within EFLAGS; the JZ VM instruction's implementation checks this bit within the saved EFLAGS, and either falls through to the next VM instruction if ZF is not set, or branches to a different instruction if it is set.

Technically, the implementations for the conditional branch instructions allow the branch target to be specified two different ways. The first is as a displacement off of VM EIP, in which case the next VM EIP is calculated by VMEIP + Displacement (with the displacement stored at &Data[1]). The second method, if the displacement at &Data[1] was 0, is to use the DWORD at &Data[5] as a raw X86 location specified as an RVA (i.e., something other than a VM instruction, causing a VM exit). However, in practice, no VM bytecode instructions within this sample used the latter method.

The VM's unconditional jump instruction is implemented in a slightly more obscure fashion involving dynamic code generation. It technically also allows the branch target to be specified as either a relative displacement against VM EIP or an x86 RVA, and in practice, only the former method was used.

4.3.2 Group #2: Operations on the Dedicated Register

As mentioned previously, the VM has one dedicated register, which we shall call "scratch". The second group of VM instructions involves this register. None of these instructions have unusual control flow; they all transfer control to the physically next instruction after the current VM EIP. Here are those instructions and their operands:

  • mov scratch, 0
  • mov scratch, imm32 [Operand: imm32 at &Data[0]]
  • shl scratch, imm32 [Operand: imm32 at &Data[0]]
  • add scratch, imm32 [Operand: imm32 at &Data[0]]
  • mov scratch, savedReg32 [Operand: index of savedReg32 at &Data[0]]
  • add scratch, savedReg32 [Operand: index of savedReg32 at &Data[0]]
  • mov savedReg32, scratch [Operand: index of savedReg32 at &Data[0]]
  • mov dword ptr [scratch], savedReg32 [Operand: index of savedReg32 at &Data[0]]
  • mov scratch, dword ptr [scratch]
  • mov dword ptr [scratch], imm32 [Operand: imm32 at &Data[0]]
  • mov dword ptr [imm32], scratch [Operand: imm32 at &Data[0]]
  • push scratch

As mentioned, the scratch register is used to virtualize 32-bit x86 memory operands. Here is an example (from the VM bytecode disassembly) of how the instruction "lea eax, dword ptr [ebp+eax*2+0FFFFFDE4h]" is translated into VM instructions:

0x035d78: MOV SCRATCH, 0
0x035d90: MOV SCRATCH, EAX          ; scratch := EAX
0x035da8: SHL SCRATCH, 0x000001     ; scratch := EAX*2
0x035dc0: ADD SCRATCH, EBP          ; scratch := EBP+EAX*2
0x035dd8: ADD SCRATCH, 0xfffffde4   ; scratch := EBP+EAX*2+0FFFFFDE4h
0x035df0: MOV EAX, SCRATCH          ; EAX := EBP+EAX*2+0FFFFFDE4h

4.3.3 Group #3: X86-Related Instructions

Perhaps the only interesting feature of the FinSpy VM is how it handles non-virtualized x86 instructions. Every x86-virtualizing software protection needs a mechanism to execute non-virtualized instructions. Quite often this is implemented by exiting the VM, executing the non-virtualized x86 instruction stored somewhere in one of the program's executable sections, and then transferring control back into the VM. The FinSpy VM takes the approach of including instructions within its instruction set that contain raw machine code for X86 instructions, which are copied into RWX memory and executed on demand as the VM executes. This is somewhat more stealthy than the approach just described, since in that approach, the unencrypted x86 instructions must appear literally somewhere within the binary. For the FinSpy VM, since the VM instructions are encrypted, this allows the non-virtualized x86 instructions to remain encrypted most of the time.

There are four instructions in this group, although two of them are instruction-for-instruction identical to one another. Also, this is the only group of instructions to make use of the "pDynamicCodeStack" member of the VMContext structure. This point emphasizes how the VM context structure and VM instructions are intertwined and cannot be understood in isolation from one another.

Raw X86

The simplest of these instructions simply executes one non-control-flow x86 instruction. It generates the following sequence of code on the dynamic code stack, and then executes it:

9D             popf
61             popa
??    ...   ?? [raw X86 machine code copied from VM instruction's data at &Data[0]]
68 ?? ?? ?? ?? push offset fpVMReEntryStateInRegs
C3             ret

I.e., this sequence restores the saved flags and registers, then executes a single x86 instruction copied out of the VM instruction's data area, and then re-enters the VM at the initialization sequence.

X86 Callout, Direct Address

This instruction is used to virtualize call instructions to fixed targets. I.e., "call sub_1234" will be implemented via this VM instruction, where "call eax" will not. It generates a sequence of code on the dynamic code generation stack, adds 0x30 to the dynamic code generation stack pointer, and then executes the code it generated:

9D             popf
61             popa
68 ?? ?? ?? ?? push offset @RESUME
68 ?? ?? ?? ?? push (Data[4]+ImageBase)
C3             ret
68 ?? ?? ?? ?? push offset NextVMInstruction
60             pusha
9C             pushf
68 ?? ?? ?? ?? push offset VMContext
5B             pop ebx
68 ?? ?? ?? ?? push offset fpVMReEntryReturn
C3             ret

This sequence begins as before by restoring the saved flags and registers. Next, it pushes the address of some subsequently-generated code, which is used as the return address for the function. Next, it pushes an address within the .text section, fixed up from an RVA into a VA by adding the base address, and then returns. These three instructions together simulate a direct call instruction.
The location labeled @RESUME -- the return location for the called function -- begins by pushing the address of the following VM instruction. (This is computed by adding 0x18 -- sizeof(VMInstruction) -- to the current VM instruction pointer from within the VM Context structure). Next it saves the registers and the flags, sets EBX to the pointer to the VM Context structure for this thread, and then re-enters the VM at a special entrypoint. This special entrypoint sets VM EIP to the value just pushed on the stack before entering the ordinary initialization phase, and subtracts 0x30 from the dynamic code generation stack pointer.
Note that I have been referring to the "dynamic code generation stack", with emphasis on the "stack" aspect. The first X86-type VM instruction did not need to make use of the "stack" aspect, since it was only used to virtualize instructions without control flow. The VM instruction that we are currently discussing does need to utilize the stack features, because if a VM instruction executed by the called function made use of the dynamic code generation feature, and generated code at the same location as the code we've just generated, it would overwrite the sequence of code used for the function's return. Hence, after writing the dynamic code just discussed onto the dynamic code generation stack, it must also increase the stack so as to prevent these kinds of overwrites. In concert, the special VM re-entry location for returns must decrement the dynamic code generation stack pointer.

X86 Callout, Indirect Address

The final X86-related VM instruction is rather similar to the one we just discussed, except it is used to virtualize indirect call instructions, i.e., calls where the destination is not known at the time of virtualization. I.e., an instruction like "call eax" would be virtualized by this VM instruction. It generates the following code on the dynamic code generation stack, increments the dynamic code generation stack pointer, and executes it:

9D             popf
61             popa
68 ?? ?? ?? ?? push offset @RESUME
??    ...   ?? [raw X86 machine code copied from VM instruction's data at &Data[4]]
68 ?? ?? ?? ?? push offset NextVMInstruction
60             pusha
9C             pushf
68 ?? ?? ?? ?? push offset VMContext
5B             pop ebx
68 ?? ?? ?? ?? push offset fpVMReEntryReturn
C3             ret

The generated code is nearly identical to the one from the previous instruction, and so not much commentary is needed. The only difference lies in what happens after pushing the return address for the call. The previous VM instruction type pushed an address in the .text section onto the stack, and returned to it; this VM instruction, in contrast, executes a raw x86 instruction copied out of the VM bytecode instruction's data region.

5. Disassembling FinSpy VM Programs

Now we have everything we need to write a disassembler for this sample. We know what each VM instruction does, and how each encodes its operands -- and this information is not necessarily specific to this particular sample; it can potentially be used across many samples. 

There are a number of elements that might easily vary between samples; we know them for this particular sample, at least.

  • We have obtained the raw contents of the VM bytecode program for this sample.
  • We know the XOR key used to decrypt the instructions.
  • We know which opcode corresponds to which VM instruction for this sample.

Since I only had one sample to work with, I chose to hard-code all of these elements in my disassembler. I.e., for any other samples, you will need to determine the three pieces of information just described. This is how I generally approach deobfuscation projects: I start with one sample and write code to break it. Then, I obtain another sample, and determine what needs to change to adapt my previous code to the new sample. If this reveals any sample-specific information, then I need to devise a procedure for extracting the information from a given sample. I repeat this process until my tool can break any sample from the given family.

5.1 Disassembler Architecture

I decided to write my FinSpy VM disassembler in Python, since I figured I might need to make use of my x86 library (the most maintained version of which is also written in Python). You can find the FinSpy VM disassembler here. At several points along the way I wished I'd chosen OCaml, since OCaml is very well-suited to writing programs that manipulate other programs, and has strong type-checking to prevent common errors. In fact I even originally wrote functions to output the disassembled instructions as OCaml datatypes, but I abandoned this idea after I contemplated interoperating between Python and OCaml.

First, I wrote a function to read VM instructions, i.e. 0x18-byte chunks, from the user-specified VM bytecode program binary, and to XOR them with the sample-specific XOR value.

Next, I designed a class hierarchy to represent VM instructions. Since each instruction has a fixed-length encoding of 0x18 bytes, and the first 8 bytes are treated the same by each, it made sense to make a GenericInsn class to handle the common functionality. That class implements a function called Init which takes the decrypted byte array and the position of the instruction within the file, and decodes the common elements in the first 8 bytes.

From there, I derived a few other abstract classes from GenericInsn, to handle families of instructions whose operands were encoded identically. Those classes contain common functions invoked by the constructor to decode the instruction operands, as well as some common functions to assist in printing. These classes are:

  • ConditionalBranch, for the 17 branching instructions
  • StackDisp32, for some of the dedicated register instructions
  • StackDisp8, for other dedicated register instructions
  • Imm32, for other dedicated register instructions

Finally, I derived classes for each of the VM instruction types from the classes just mentioned. Those classes invoke the decoding routines from their base classes in their constructors, and provide support for printing the instructions as strings. Not much is noteworthy about printing the instructions, except that when the dedicated register instructions reference x86 registers by numeric index, I emit the name of the register instead of the index. Also for instruction types with embedded x86 machine code, I invoke my disassembler library to print the machine code as human-readable x86 assembly language.

The last piece was to devise a mechanism for creating the correct Python object for a given decrypted instruction encoded as an 0x18-byte array. For this, I created a dictionary mapping the opcode byte to an enumeration element specifying which VM instruction it represented. Then, I created a dictionary mapping those enumeration elements to lambda functions that constructed the correct object type.

Finally I combined the Python object-creation mechanism just described with the raw instruction reading mechanism from the beginning, resulting in a generator that yielded Python instructions. From there, disassembling the program is simply a matter of printing the instruction objects:

for x in insn_objects_from_file("Tmp/dec.bin"):
    print x

You can see the complete disassembly for this sample's VM bytecode program here.

Part Two: Low-Level Analysis Details

6. Analysis of FinSpy VM Pre-Initialization

Here is a link to the assembly language for the pre-initialization phase, as well as a rough decompilation of that code. We will reproduce snippets of code from the decompilation in explaining how it operates. Feel free to open up the binary and follow along at the instructions shown in the first link showing the assembly language.

6.1 Saving the Host Registers and Flags

The first thing that any VM does upon entry is to save the contents of the registers and the flags. FinSpy VM is no different:

.text:00401BF6  push    eax ; PUSHFD sequence, in order
.text:00401D5C  push    ecx
.text:004019D2  push    edx
.text:004020A4  push    ebx
.text:00401CEC  push    ebx ; EBX in place of ESP
.text:00401DA7  push    ebp
.text:00401F41  push    esi
.text:00401EFF  push    edi
.text:00401E9D  pushf

6.2 A Word on Position-Independent Code (PIC)

Like many obfuscated programs (and some non-obfuscated ones), the x86 implementation of the FinSpy VM uses position-independent code (PIC). PIC is employed in non-obfuscated contexts to reduce the number of pointers within the program that require relocation should the imagebase change. PIC is employed in obfuscated contexts for the same reason, as well as to obscure which addresses are actually being accessed by a particular memory operation. The next two instructions establish EBP as a PIC base:

.text:00401F60  call    $+5 ; Establish position-independent code (PIC) base
.text:00401F65  pop     ebp ; EBP = 00401F65

Any subsequent memory reference involving EBP (before EBP is subsequently modified) will be computed relative to the address shown. For example:

.text:004020D7  mov     eax, [ebp+0D75h] ; == [401F65h+0D75h] == [402CDAh]
.text:00402CDA  dd 0C4CDh ; <- this is the referenced address and its contents

6.3 Allocating the Global Per-Thread VM Context Structure Array

The FinSpy VM is architected in such a way where multiple threads can execute VM code at the same time. As a result, each thread must have its own VM context structure, and not much of the VM state can be held in global data items. FinSpy handles this by allocating a global array that holds pointers to each thread's VM context structure, where each pointer is initially NULL.

The first thing the pre-initialization function does is to check a global pointer to see whether we have already allocated the thread-specific array, and if not, allocate it. The following code snippets from the decompilation illustrate:

DWORD *  gp_VMContext_Thread_Array = NULL;
DWORD **gpp_VMContext_Thread_Array = &gp_VMContext_Thread_Array;

DWORD *GetVMContextThreadArray()
  return *gpp_VMContext_Thread_Array;

void SetVMContextThreadArray(DWORD *pNew)
  *gpp_VMContext_Thread_Array = pNew;

void VMPreInitialize(DWORD encodedVMEip)
  if(GetVMContextThreadArray() == NULL)

6.4 Allocating Thread-Specific VM Context Structures

After the previous step, FinSpy knows the global array of VM Context structure pointers has been allocated, so it queries whether a VM Context structure has been allocated for the thread in which it is currently executing. It uses the current thread ID, shifted right by 2 to mask off the low 2 bits, as an index into the array. If not already allocated, it must then allocate a context structure for the current thread.

DWORD GetThreadIdx()
  // If we're executing in user-mode...
  if(g_Mode != X86_KERNELMODE)
    return GetCurrentThreadId() >> 2;
  // Kernel-mode code has not been decompiled here: it relies upon an RVA
  // that has been set to 0 in my sample, so can't be analyzed

// ... Continuing inside of void VMPreInitialize(DWORD encodedVMEip) ...
DWORD *dwGlobal   = GetVMContextThreadArray();
DWORD dwThreadIdx = GetThreadIdx();
VMContext *thContext = dwGlobal[dwThreadIdx];
if(thContext == NULL)
  thContext = Allocate(0x10000);
  dwGlobal[dwThreadIdx] = thContext;

6.5 Pre-Initializing Thread-Specific VM Context Structures

If the VM Context structure for the current thread was not already allocated in the previous step, FinSpy performs some preliminary initialization of the VM context structure. This is the first point in the code that we begin to understand the layout of the VM Context structure, whose full layout is shown here.

// Base address for currently-running executable
thContext->dwBaseAddress = g_BaseAddress;

// Last DWORD in VM Context structure
thContext->dwEnd = (DWORD)thContext + 0xFFFC;

// Initialize pointer to data section
thContext->pData = &thContext->Data[1];

While analyzing this and subsequent code, the assembly language contained many structure references such as the following:

.text:00401B19  mov     ecx, [ebp+0D79h] ; ecx = .text:00402CDE     BASE_ADDRESS dd 400000h
.text:00401FA9  mov     [ebx+28h], ecx   ; <- structure reference: field_28
.text:00401A2F  mov     eax, ebx
.text:00401BCC  add     eax, 0FFFCh
.text:00401AAD  mov     [ebx+4], eax     ; <- structure reference: field_4
.text:00401EC2  lea     eax, [ebx+50h]   ; <- structure reference: field_50
.text:00401C56  add     eax, 4
.text:00401998  mov     [ebx+50h], eax   ; <- structure reference: field_50

I knew that I would later need to know where and how these structure fields had been initialized, so I opened up a new window in my text editor and took notes on each structure field access. Later I turned those notes into an actual structure in IDA. The full notes can be found here. Here were there notes I took for the fields referenced in the assembly snippet above:

  • field_4:  00401AAD: pointer to last DWORD in this structure (0xFFFC offset)
  • field_28: 00401FA9: dwBaseAddress
  • field_50: 00401998: initialized to address of field_54

6.6 Results of Analyzing the Decompression Code

FinSpy keeps its VM instructions in a large array of VMInstruction structures. This array may be compressed, in which case, the first time the next phase of pre-initialization executes, the instructions will be decompressed. Subsequent thread-specific pre-initializations of VM Context structures need not decompress the instructions after they have been decompressed for the first time.

// If dwInstructionsDecompressedSize was 0, then the instructions aren't
// compressed, so just copy the raw pointer to the instructions to the
// pointer to the decompressed instructions
if(dwInstructionsDecompressedSize == 0)
  *dwInstructionsDecompressed = *dwInstructionsCompressed;

// If the decompressed size is not zero, this signifies that the
// instructions are compressed. Allocate that much memory, decompress the
// instruction data there, and then cache the decompression result
  // If we've already decompressed the instructions, don't do it again
  if(*dwInstructionsDecompressed == NULL)
    // Allocate RWX memory for the obfuscated, encrypted APLib decompression
    // code
    void *pDecompressionStub = Allocate(dwDecompressionStubSize);
    memcpy(pDecompressionStub, dwDecompressionStubSize, &x86DecompressionStub);
    // Decrypt the decompression code (which is still obfuscated)
    XorDecrypt(pDecompressionStub, dwDecompressionStubSize, dwXorKey);
    // Allocate memory for decompressed instructions; decompress
    VMInstructions *pDecInsns = Allocate(dwInstructionsDecompressedSize);
    fpDecompress Decompress = (fpDecompress)pDecompressionStub;
    Decompress(*dwInstructionsCompressed, pDecInsns);
    // Update global pointer to hold decompressed instructions
    *dwInstructionsDecompressed = pDecInsns;
// Store the pointer to decompressed instructions in the context structure
thContext->pInstructions = *dwInstructionsDecompressed;

// Locate the first instruction by encoded VM EIP
thContext->pCurrInsn = FindInstructionByKey(thContext, encodedVMEip);

6.7 Process of Analyzing the Decompression Code

This section describes how the results in the previous section were obtained. After the previous phases of pre-initialization, I observed FinSpy decrypting some obfuscated code stored within its .text section, and then executing it. Seeing it for the first time, I didn't know what the decrypted code did; it turned out to be responsible responsible for decompressing the VM bytecode program. Here's how I figured that out.

For extra stealth, the decompression code is itself encrypted via XOR, and is also obfuscated using the same conditional jump-pair technique seen in the previous entry. When analyzing this part of pre-initialization for the first time, I first noticed it allocating a small amount of RWX memory, copying data from the .text section, and decrypting it:

.text:004020D7  mov     eax, [ebp+0D75h] ; eax = .text:00402CDA  dd 0C4CDh
.text:00401E1A  add     eax, [ebp+0D79h] ; eax = 0C4CDh + .text:00402CDE BASE_ADDRESS dd 400000h
                                         ; i.e. eax contains 0x40C4CD, an address in .text
.text:00401A1F  push    eax              ; eax = 0x40C4CD
.text:00401CDE  push    ecx              ; ecx = .text:00402CD6 dd 60798h
.text:00401E8A  mov     eax, [ebp+0D6Dh] ; eax = .text:00402CD2 dd 67Dh

.text:00401B94  push    eax
.text:00401B95  call    ALLOCATE_MEMORY  ; Allocate 0x67D bytes of RWX memory
.text:00401B9A  mov     edi, eax
.text:00401C18  push    edi              ; Save address of allocated memory
.text:00401D45  mov     esi, [ebp+0D61h] ; esi = .text:00402CC6 dd offset dword_40BE50

.text:00401CA5  cld                      ; Clear direction flag for rep movsb below
.text:00401FF9  mov     ecx, [ebp+0D6Dh] ; .text:00402CD2 dd 67Dh

.text:00401E54  push    ecx              
.text:00402064  rep movsb                ; Copy 0x67D bytes from dword_40BE50 to allocated RWX memory
.text:00401E41  pop     ecx              ; Restore ecx = 60798h
.text:00401A86  pop     edi              ; Restore edi = address of allocated region
.text:00401C2C  mov     eax, [ebp+0D7Dh] ; .text:00402CE2 XOR_VALUE dd 2AAD591Dh

.text:00401D7C  push    eax              ; eax = value to XOR against allocated memory
.text:00401D7D  push    ecx              ; ecx = size of allocated region
.text:00401D7E  push    edi              ; edi = beginning of allocated region
.text:00401D7F  call    loc_40250B       ; XOR_DWORD_ARRAY: XOR DWORDs starting at edi+4, to edi+(ecx>>2)

Seeing the code being decrypted, I wrote a small IDC script that replicated the action of the function at loc_40250B, a/k/a XOR_DWORD_ARRAY. It can be found here and is reproduced here for ease of presentation.

static DecryptXOR(encAddr, encLen, xorVal)
  auto encNum;
  auto cursor;
  auto i;

  // Skip first DWORD
  cursor = encAddr + 4;
  encNum = (encLen - 4) >> 2;
  for(i = 0; i < encNum; i = i + 1)
    cursor = cursor + 4;

And then I invoked it to decrypt the region of code being referenced in the snippet above: DecryptXor(0x40BE50, 0x67D, 0x2AAD591D).

Continuing to follow the assembly code where the last snippet left off, we see:

.text:00401CB9  pop     ecx              ; Restore ecx = .text:00402CD6 dd 60798h
.text:00402046  pop     eax              ; Restore eax = 0x40C4CD
.text:00401F79  mov     esi, eax         ; esi = 0x40C4CD
.text:00401FBE  push    ecx              ; Allocate 60798h bytes of memory
.text:00401FBF  call    ALLOCATE_MEMORY
.text:00401FC4  mov     edx, edi         ; edx = 0x67D-byte region of RWX memory, decrypted previously
.text:00401AD6  mov     edi, eax         ; edi = 60798h bytes of memory just allocated
.text:00401A4C  push    edi              ; Second argument = edi (60798h-byte region)
.text:00401A4D  push    esi              ; First argument = 0x40C4CD
.text:00401A4E  call    edx              ; Execute decrypted code in smaller region

Now I knew the decrypted region contained code copied from 0x40C4CD, so I began to analyze it. It was obfuscated with the same conditional jump-pairs as we saw in part one. As somebody who has been reverse engineering malware for 15 years, though is somewhat rusty due to different focuses lately, I immediately experienced a sense of deja vu. I knew I'd seen the following function a million times before:

.text:0040C320     add     dl, dl
.text:0040C1E6     jnz     short @@RETURN
.text:0040BF24     mov     dl, [esi]        
.text:0040C01F     inc     esi    
.text:0040C100     adc     dl, dl    
.text:0040BFE8     retn    

Specifically, the "adc dl, dl" instruction jumped out at me. Were I more regularly engaged in malware analysis, I would have immediately recognized this as part of the APLib decompression code. Instead, I wound up searching Google with queries like 'x86 decompression "adc dl, dl"'. Eventually my queries led me to other malware analysis write-ups which identified APLib as the culprit decompression library.

Once I recognized that the code I'd decrypted probably implemented APLib, I decided to try to unpack it and see what happened. If I'd gotten gibberish as a result, I may have needed to dive deeper into the code to see whether it had implemented any custom modifications. Such is a bane of static reverse engineering -- had I been dynamically analyzing this code, I could have merely dumped it out of memory.

I began by using an IDC one-liner to dump the compressed blob out of memory: savefile(fopen("e:\\work\\Malware\\blah.bin", "wb"), 0, 0x40C4CD, 0x60798); If you want the compressed blob, you can find it here.

Now that I knew I had an APLib-compressed blob, I needed to decrypt it. First I went to the APLib website and downloaded the APLib package. Next I simply tried to invoke the stand-alone decompression tool on my blob, but it failed with a non-descriptive error message. I began reading the source code and tracing the decompression tool in a debugger, and realized that the failure arose from APLib expecting a header before the encrypted blob. Fair enough. I set about re-creating a header for my data, but failing to do so within 20 minutes and becoming impatient, I started looking for another way.

I came across the open-source Kabopan project which purported to offer "a python library of various encryption and compression algorithms, cleanly re-implemented from scratch." After a few minutes reading the examples, I wrote a small script which did the trick:

from kbp.comp.aplib import decompress

with open("compressed.bin", "rb") as g:
    dec,decSize = decompress(g.read()).do()
    print decSize
with open("decompressed.bin", "wb") as f:

Kabopan didn't complain, and my script printed the expected decompressed size, so for all I knew, it had worked. The decompressed result can be found here. I loaded the decompressed binary into IDA, not knowing what to expect. The data had some regularly-recurring patterns in it, but its format was still unknown to me at this point. I would later analyze it more fully.

7. Analysis of FinSpy VM Initialization

This section resumes immediately after the pre-initialization phase from the last section. The assembly language for the initialization phase is shown here; feel free to follow along. For clarity of exposition, most of the following shall use the decompilation, shown in full here. To summarize the last section, FinSpy VM's pre-initialization phase is clearly denoted by two attempts to access previously-allocated structures. If the structures aren't present, it performs the allocation and pre-initialization phases as described in the previous section. If they are present, it skips those phases and branches directly to the initialization phase, mocked up in pseudocode as such:

// From pre-initialization phase, shown previously
// Try to get the global VM Context * array
if(GetVMContextThreadArray() == NULL)

// Once allocated, retrieve the global pointer
DWORD *dwGlobal   = GetVMContextThreadArray();

// Try to get the per-thread VMContext *
DWORD dwThreadIdx = GetThreadIdx();
VMContext *thContext = dwGlobal[dwThreadIdx];
if(thContext == NULL)
  // We detailed this phase in the previous section:
  // Allocate and pre-initialize the VMContext * if not present

// Beginning of initialization phase
// Once allocated, retrieve the thread-specific pointer
DWORD *dwGlobal   = GetVMContextThreadArray();
DWORD dwThreadIdx = GetThreadIdx();
VMContext *thContext = dwGlobal[dwThreadIdx];

// ... execution resumes here after both pointers are allocated ...

7.1 Initializing VM Function Pointers

Most of the initialization phase computes the value of function pointers using a position-independent code idiom. Subsequent analysis clarified the purpose of these function pointers, which were somewhat difficult to understand in isolation. They will be explained in a subsequent section, in a more natural context (section 9).

Like the the pre-initialization code, this section contains many structure references. I continued to update my document describing what I knew about each structure reference. Again, those notes can be found here

thContext->pPICBase = (void *)0x00402131;

// Points to the vm_initialize label above
thContext->fpFastReEntry = (fpReEntryNoState)0x0040247C;
thContext->fpLookupInstruction = &FindInstructionByKey;

if(g_Mode != X86_KERNELMODE)
  thContext->fpEnterFunctionPointer = (fpReEntryHasState)0x00402B92;
  thContext->fpEnterFunctionPointer = (fpReEntryHasState)0x00402BC6;

thContext->fpExecuteVMInstruction = (fpReEnterToInstruction)0x00402C39;
thContext->fpRelativeJump = (fpReEntryHasState)0x00402B74;
thContext->fpNext = (fpReEntryHasState)0x00402B68;

One thing I did notice quickly was that the "fpFastReEntry" function pointer pointed to the VM initialization code. Recalling the diagram from section 3.2 showing generic VM organization, note that VM instructions need to locate the instruction fetch/decode code within the VM entrypoint. I guessed that the code at that location might serve that purpose; that was mostly correct (actually some initialization is performed before instruction fetch, namely the initialization shown in the pseudocode above).

7.2 Host <=> VM State Transfer

After installing the function pointers as per the previous section, the VM then saved the host's ESP into the VM context structure, twice, and then set the host's ESP to a point near the end of the VM context structure. Hence it was clear that the VM state included a stack, physically located at the end of the context structure.

thContext->pSavedEsp1 = ESP;
thContext->pSavedEsp2 = ESP;
ESP = thContext->pDataEnd;

8. Analysis of FinSpy VM Instruction Fetch, Decode, and Dispatch

At this point in the analysis, I was still hazy about many of the VM Context structure entries, though fortunately the notes I'd taken allowed me to efficiently piece things together. The subsequent code dealt with loading instructions from the decompressed memory region, optionally transforming them, and then transferring control to the relevant x86 code responsible for implementing them.

8.1 FinSpy VM Instruction Fetch

The pseudocode decompilation for the instruction fetch phase can be found here, and the commented assembly listing can be found here.

After initializing the context structure, the FinSpy VM then makes use of some fields we've already analyzed. Thanks to the ongoing notes I'd taken, it was easy to understand what was happening. Recall these two statements from the pre-initialization phase when decompressing the instructions:

thContext->pInstructions = *dwInstructionsDecompressed;
thContext->pCurrInsn = FindInstructionByKey(thContext, encodedVMEip);

I.e. the pCurrInsn member of the VM Context structure already contains a pointer somewhere within the pInstructions data blob. Returning to the present, the next part of the FinSpy VM entrypoint makes use of these fields:

VMInstruction *Fetch(VMContext *thContext)
  memcpy(&thContext->Instruction,thContext->pCurrInsn, sizeof(VMInstruction));
  XorDecrypt(&thContext->Instruction, sizeof(VMInstruction), dwXorKey);
  return &thContext->Instruction;

From here we learn the size of each entry in the Instruction structure member -- namely, 0x18 -- and we also see that the instructions are decrypted using the same function we saw previously to decrypt the decompression code. We will return to talk more about these points in Section 8.4.

Note again the following line from pre-initialization:

thContext->pCurrInsn = FindInstructionByKey(thContext, encodedVMEip);

To refresh your memory from part one, each VM entrypoint pushed a value onto the stack before transferring control to the VM entrypoint. An example is reproduced from part one:

.text:004065A4     push    5A403Dh                  ; <- encodedVMEip
.text:004065A9     push    ecx                      ; 
.text:004065AA     sub     ecx, ecx                 ; 
.text:004065AC     pop     ecx                      ; 
.text:004065AD     jz      loc_401950               ; VM entrypoint

This "encodedVMEip" parameter is used to look up the first VM instruction that should be executed beginning at that VM entrypoint. Namely, the FindInstructionByKey function takes the encodedVMEip parameter, iterates through the decompressed instructions, and tries to find one whose first DWORD matches that value:

VMInstruction *FindInstructionByKey(VMContext *vmc, DWORD Key)
  VMInstruction *insns = vmc->pInstructions;
  while(insns->Key != Key)
  return insns;

So we have learned that:

  • Each VM instruction is associated with a key.
  • The key is the first element in the VM instruction structure.
  • The VM may look up instructions by their keys.

8.2 FinSpy VM Instruction Decode

The pseudocode decompilation for the instruction decode phase can be found here, and the commented assembly listing can be found here.

After copying an instruction out of the global instructions array and decrypting it, the VM entrypoint then calls a function that will optionally fix-up DWORDs within the instructions by adding the current module's imagebase to them. This functionality allows the instructions to reference addresses within the module's image without having to care where the module is loaded. This functionality effectively performs the same job as the operating system's loader.

void Decode(VMContext *thContext)
  VMInstruction *ins = &thContext->Instruction;
    *(DWORD *)&ins->Data[ins->Fixup1 & 0x7F] += thContext->dwImageBase;
    *(DWORD *)&ins->Data[ins->Fixup2 & 0x7F] += thContext->dwImageBase;

8.3 FinSpy VM Instruction Dispatch

The pseudocode decompilation for the instruction dispatch phase can be found here, and the commented assembly listing can be found here.

Finally, after the instructions have been decrypted and optionally fixed up, the VM then transfers control to the x86 code that implements the instruction. In decompilation:

// Locate the handler table in the .text section
fpInstruction *handlers = (fpInstruction *)vmc->pPICBase + 0x9AF;
// Invoke the relevant function in the handler table

In x86:

.text:00402311  movzx   ecx, byte ptr [ebx+3Ch] ; Fetch the opcode byte
.text:0040248D  mov     eax, [ebx+24h]          ; Grab the PIC base
.text:004023FC  add     eax, 9AFh               ; Locate the VM handler table
.text:0040218C  jmp     dword ptr [eax+ecx*4]   ; Transfer control to the opcode handler

8.4 A Cursory Look at the VM Bytecode

In this chapter on instruction fetch / decode / dispatch, we learned that the instructions are 0x18-bytes apiece, and that each instruction is encrypted via a DWORD-level XOR. Since I'd previously decompressed the instruction blob, I could now decrypt it and take a look. I used the same script I had previously written during my analysis of the pre-initialization decompression code, linked again for convenience. I just wrote a script to call that function in a loop, and executed it within the IDB for the decompressed blob:

auto i;
for(i = 0; i < 0x00060797; i = i + 0x18)
  DecryptXOR(i, 0x18, 0x2AAD591D);

After running that script, the data format was still mostly unknown. The data was mostly zeroes, but clearly very structured. Since I have written an x86 disassembler before, some of the non-zero data jumped out at me as looking plausibly like x86 machine code. I turned a few of them into code and IDA confirmed:

seg000:00000078   dd 5A1461h
seg000:0000007C   db 1Bh ; VM opcode: 0x1B
seg000:0000007D   db 6   ; x86 instruction length
seg000:0000007E   db 0   ; fixup1 location
seg000:0000007F   db 0   ; fixup2 location
seg000:00000080 ; ---------------------------------------------------------------------------
seg000:00000080   mov     dword ptr [eax], 1D7h
seg000:00000080 ; ---------------------------------------------------------------------------
seg000:00000086   db 0Ah dup(0)

seg000:00000090   dd 5A1468h
seg000:00000094   db 1Bh ; VM opcode: 0x1B       
seg000:00000095   db 5   ; x86 instruction length
seg000:00000096   db 1   ; fixup1 location       
seg000:00000097   db 0   ; fixup2 location       
seg000:00000098 ; ---------------------------------------------------------------------------
seg000:00000098   mov     eax, offset unk_1F000
seg000:00000098 ; ---------------------------------------------------------------------------
seg000:0000009D   db 0Bh dup(0)

seg000:000000A8   dd 5A146Ch
seg000:000000AC   db 1Bh ; VM opcode: 0x1B       
seg000:000000AD   db 1   ; x86 instruction length
seg000:000000AE   db 0   ; fixup1 location       
seg000:000000AF   db 0   ; fixup2 location       
seg000:000000B0 ; ---------------------------------------------------------------------------
seg000:000000B0   leave
seg000:000000B0 ; ---------------------------------------------------------------------------
seg000:000000B1   db 0Fh dup(0)

seg000:000000C0   dd 5A146Dh
seg000:000000C4   db 1Bh ; VM opcode: 0x1B       
seg000:000000C5   db 3   ; x86 instruction length
seg000:000000C6   db 0   ; fixup1 location       
seg000:000000C7   db 0   ; fixup2 location       
seg000:000000C8 ; ---------------------------------------------------------------------------
seg000:000000C8   retn    4
seg000:000000C8 ; ---------------------------------------------------------------------------
seg000:000000CB   db 0Dh dup(0)

This was curious to say the least. Why were there single, isolated x86 instructions contained in data blobs? Does the VM really contain raw x86 instructions? Though not all of the instructions are like this, it turns out that, yes, indeed, about half of the VM instructions contain literal x86 machine code within them. I figured that if I could convert the remaining non-raw-x86 VM instructions into x86, it would probably be easy to revert the entire VM bytecode program back into x86.

Anyway, talk about recovering the x86 program is premature at the moment. Now that we have decrypted the VM bytecode program, it will be easier for us to reverse engineer the VM instruction set. If we are ever confused about how an x86 instruction handler for a VM opcode is using parts of the VM instruction structure, we can simply look up real examples of those instructions for clarification without having to guess.

9. Analysis of FinSpy VM Exit and Re-Entry

Before we discuss the VM instruction set, let's return to the five function pointers found within the VM context structure pertaining to VM exit and re-entry. All of these disassembly listings can be found here, though they are reproduced herein.

9.1 Re-Enter VM, State in Registers

The first function pointer for returning to the VM, called fpFastReEntry in the VM Context structure, simply contains the address of the initialization phase within the VM entrypoint. It assumes that the registers and flags contain meaningful values, and that they have not already been placed on the stack.

9.2 Re-Enter VM, State on Stack

The next return sequence assumes that EAX contains the address of the function pointer from the previous section, and that the flags and registers are saved on the stack. It simply pops the flags and registers off the stack, and transfers control to the function pointer in EAX.

.text:00402BB2     mov     [esp-4], eax        ; Save VM initialization location from EAX onto stack
.text:00402BAB     popf                        ; Restore flags
.text:00402BBD     popa                        ; Restore registers
.text:00402B9C     jmp     dword ptr [esp-28h] ; Branch to VM initialization location

9.3 Branch Taken, State on Stack

This return sequence is used by the VM's conditional and unconditional branch instructions when the branch should be taken. It has the capability to add a displacement to VM EIP -- very similar to how processors like x86 implement relative branches. It also has the capability to branch to an x86 location specified as an RVA, but this code was never used in the sample that I inspected.

.text:00402B74 VM__RelativeJump__Execute:
.text:00402B74     mov     eax, [ecx+1]                         ; Instruction data at DWORD +1 contains
.text:00402B74                                                  ; a displacement for the VM EIP
.text:00402B77     test    eax, eax                             ; Was it non-zero?
.text:00402B79     jz      short jump_to_x86                    ; No -- branch to x86 code at [ecx+5]
.text:00402B7B     add     [ebx+VMContext.pCurrInsn], eax       ; Add the displacement to VM EIP
.text:00402B7D     mov     eax, [ebx+VMContext.fpVMEntry]       ; eax := state-in-registers VM entry
.text:00402B80     mov     esp, [ebx+VMContext.SavedESP1]       ; esp := restore host ESP
.text:00402B83     jmp     [ebx+VMContext.fpVMReEntryToFuncPtr] ; branch to state-on-stack re-entry
.text:00402B86 ; ---------------------------------------------------------------------------
.text:00402B86 jump_to_x86:                             ; CODE XREF: ...
.text:00402B86     mov     eax, [ecx+5]                         ; Otherwise, use [ecx+5] as x86 RVA
.text:00402B89     add     eax, [ebx+VMContext.dwBaseAddress]   ; Convert RVA to VA
.text:00402B8C     mov     esp, [ebx+VMContext.SavedESP1]       ; esp := restore host ESP
.text:00402B8F     jmp     [ebx+VMContext.fpVMReEntryToFuncPtr] ; branch to state-on-stack re-entry

9.4 Branch Not Taken, State on Stack

This return sequence is used by the VM's conditional and unconditional branch instructions when the branch is not taken. It simply adds 0x18, the size of a VM instruction, to the current VM instruction pointer, and re-enters the VM via the "state on stack" handler.

.text:00402B68 VM__NextInsn__Execute:
.text:00402B68     add     [ebx+VMContext.pCurrInsn], size VMInstruction ; Point VM EIP to next instruction
.text:00402B6B     mov     eax, [ebx+VMContext.fpVMEntry]                ; EAX := ordinary VM re-entry location
.text:00402B6E     mov     esp, [ebx+VMContext.SavedESP1]                ; Restore host ESP
.text:00402B71     jmp     [ebx+VMContext.fpVMReEntryToFuncPtr]          ; branch to state-on-stack re-entry

9.5 Function Return, State on Stack

Finally, this VM re-entry function pointer is invoked by dynamically-generated code after a function call returns. The code must pop the last frame off of the dynamically-generated code stack, and update the VMContext->pCurrInsn entry to the VM instruction after the one that issued the function call. Since the VM instruction pointer is saved on the stack above the registers and flags, this code shifts the registers and flags up by one DWORD, and adds 4 to the stack pointer. Finally, it re-enters the VM using the state-on-stack function pointer.

.text:00402C6F     mov     eax, [esp+24h] ; Get saved VMInstruction * from stack
.text:00402CB0     lea     esi, [esp+20h] ; esi = address of registers begin
.text:00402C4E     lea     edi, [esp+24h] ; edi = address of VMInstruction *
.text:00402C40     mov     ecx, 9
.text:00402C66     std                    ; Move saved registers/flags up by
.text:00402C93     rep movsd              ; one DWORD, erasing the VMInstruction *
.text:00402CAA     cld
.text:00402C79     add     esp, 4         ; Adjust stack to remove superfluous DWORD
.text:00402C8B     mov     [ebx+VMContext.pCurrInsn], eax ; Set next instruction
.text:00402C58     sub     [ebx+VMContext.pScratch], 30h  ; Pop function stack
.text:00402C83     mov     eax, [ebx+VMContext.fpVMEntry] ; Get VM entry pointer
.text:00402C9F     jmp     [ebx+VMContext.fpVMReEntryToFuncPtr] ; Use state-on-stack return

10. Analysis of FinSpy VM Instruction Set

The VM instruction set, 34 opcodes in total, can be divided neatly into three groups.

  • Conditional and unconditional jumps
  • Operations involving the dedicated register
  • X86-related instructions

10.1 Group #1: Conditional and Unconditional Jumps

Annotated disassembly listings for these instructions can be found here.

The conditional jump instructions are all fairly simple, and nearly identical to one another. Note and recall that at the time the VM entrypoint transfers control to any instruction, the host stack contains the saved flags at the bottom, and the saved registers immediately above that. The conditional jump instructions all load the saved flags off of the bottom of the host stack, and then test the desired combination of bits by using the "test" instruction to isolate individual bits from the saved EFLAGS register. To make my job easier, I created an enumeration in IDA to give names to the symbolic constants:

FFFFFFFF ; enum X86Flags, mappedto_291
FFFFFFFF X86Flags_CF  = 1
FFFFFFFF X86Flags_PF  = 4
FFFFFFFF X86Flags_AF  = 10h
FFFFFFFF X86Flags_ZF  = 40h
FFFFFFFF X86Flags_SF  = 80h
FFFFFFFF X86Flags_OF  = 800h

Now we can easily apply this enumeration to the disassembly listing to tell which bits are being tested. Here is the x86 code for the VM "JL" instruction:

.text:00403007     lea     ecx, [ebx+VMContext.CurrentInsn.InsnData]
.text:00402FD2     mov     eax, [ebx+VMContext.SavedESP1]
.text:00402FE8     mov     eax, [eax] ; Grab saved flags from the host stack
.text:0040302D     test    eax, X86Flags_OF ; 800h
.text:00402FFA     setnz   dl
.text:0040303C     test    eax, X86Flags_SF ; 80h
.text:0040304C     setnz   al
.text:00403014     cmp     al, dl
.text:00403021     jnz     short loc_402FDD

; JNZ path
    .text:00402FDD     jmp     [ebx+VMContext.fpVMRelativeJump]

; JZ path
    .text:00402FC9     jmp     [ebx+VMContext.fpVMNext]

This sequence tests the overflow flag (OF) and sign flag (SF) within EFLAGS, and performs a relative jump if they are different from one another. (That's the "JNZ path" in the code above.) If they are the same, execution is instead transferred to the physically next VM instruction (the "JZ path" above) -- exactly the same way that a real conditional jump works on a real x86 processor.

You might not have the condition codes for the various conditional jumps memorized. I.e. you might be confused, when encountering the code above, about which x86 conditional jump instruction corresponds to the situation where the branch is taken when OF!=SF. I usually forget them, too; I reference the code from my X86-to-IR translator when I need a reminder. Here's a table so you don't have to look them up.

Branch | Condition
JS       SF
JNS      !SF
JO       OF
JNO      !OF
JZ       ZF
JNZ      !ZF
JP       PF
JNP      !PF
JB       CF
JAE      !CF
JA       !CF && !ZF
JBE      CF || ZF
JL       SF != OF
JGE      SF == OF
JG       SF == OF && !ZF
JLE      SF != OF || ZF

The unconditional jump instruction uses some trickery involving dynamically-generated x86 code. It seems suitable to delay that instruction until we discuss dynamic code generating-VM instructions in Section 10.3.

10.2 Group #2: "Dedicated Register" Instructions

Annotated disassembly listings for these instructions can be found here.

Analyzing the "dedicated register" instructions confused me at first. As you can see in the code snippet below, each of these contains a memory reference that looks like "[ecx+eax*4+4]", where ECX points to the saved ESP before VM entry, and EAX is computed as 7-Data[0]. 

.text:00403982     lea     ecx, [ebx+VMContext.CurrentInsn.InsnData] ; point ECX to the data region
.text:004039B0     movzx   ecx, byte ptr [ecx]                       ; get the first byte
.text:00403995     mov     eax, 7
.text:0040396C     sub     eax, ecx                                  ; EAX := 7-data[0]
.text:00403947     mov     ecx, [ebx+VMContext.SavedESP1]            ; ECX := host ESP
.text:0040395E     mov     eax, [ecx+eax*4+4]                        ; Grab saved register
.text:00403951     mov     [ebx+VMContext.dwScratch], eax            ; Store loaded register into SCRATCH
.text:00403978     add     [ebx+VMContext.pCurrInsn], size VMInstruction
.text:004039B8     mov     eax, [ebx+VMContext.fpVMEntry] 
.text:004039A2     mov     esp, [ebx+VMContext.SavedESP1] 
.text:0040398D     jmp     [ebx+VMContext.fpVMReEntryToFuncPtr]

I eventually realized that the host stack contains the saved flags followed by the saved registers. Thus, the "+4" part of the memory operand indexes past the saved flags. Then, the value of EAX in "eax*4" is used to select one of the saved registers. If the byte in Data[0] was 0, then this instruction would select the 7th element in the saved register array -- i.e. the first register pushed, which was EAX. If Data[0] was 1, then this instruction would select the 6th saved register -- i.e. the second register pushed, which was ECX.

Since I've written a disassembler before, I realized the connection between Data[0] and the register chosen. x86 machine code assigns a number to its 32-bit registers in the following order: EAX, ECX, EDX, EBX, ESP, EBP, ESI, EDI. Thus, Data[0] selects a register in that order. Therefore, it is a simple matter to figure out which register is accessed by one of these instructions -- just keep an array with the register names in that order.

I was also worred about what happens when Data[0] was not in the range of [0,7]. Negative numbers could result in indexing below the host stack, and numbers 8 or above would result in indexing above the saved registers. I wasn't sure how I would represent and devirtualize these possibilities. I coded limited support for these situations in my disassembler, figuring I would deal with the problem if it ever materialized. Fortunately, it didn't.

10.3 Group #3: X86-Related Instructions

Annotated disassembly listings for these instructions can be found here.

The high-level summary already described the purpose of these instructions: to execute x86 instructions whose machine code are provided within the relevant VM instructions' data regions. The only real difficulty in analyzing them was making sense of which instructions they were writing to the dynamic code region. As you can see in the snippet below, the values written are obfuscated in the disassembly listing:

.text:004040F1     mov     edi, [ebx+VMContext.pScratch]
.text:004040BF     mov     eax, [ebx+VMContext.SavedESP1]
.text:004040DD     and     eax, 2Fh
.text:00404110     add     edi, eax
.text:0040412E     push    edi
.text:00404119     mov     word ptr [edi], 6ACh         ; <- Write 0x6AC
.text:0040408F     xor     word ptr [edi], 6731h        ; <- XOR with 0x6731
.text:00404139     add     edi, 2
.text:004040B1     lea     esi, [ebx+VMContext.CurrentInsn.InsnData]
.text:00404086     movzx   ecx, byte ptr [ebx+VMContext.CurrentInsn.DataLength]
.text:004040A2     rep movsb
.text:0040414F     mov     byte ptr [edi], 0E9h         ; <- Write 0xE9
.text:00404165     xor     byte ptr [edi], 81h          ; <- XOR with 0x81
.text:0040405E     mov     eax, [ebx+VMContext.fpVMEntry]
.text:004040CA     mov     [edi+1], eax
.text:00404068     mov     byte ptr [edi+5], 6Ch        ; <- Write 0x6C
.text:004040D3     xor     byte ptr [edi+5], 0AFh       ; <- XOR with 0xAF
.text:004040E5     add     dword ptr [ebx+VMContext.pCurrInsn], size VMInstruction
.text:0040415B     pop     eax                         
.text:004040FD     mov     esp, [ebx+VMContext.SavedESP1]
.text:00404071     jmp     eax

For example, these two lines:

.text:00404119     mov     word ptr [edi], 6ACh         ; <- Write 0x6AC
.text:0040408F     xor     word ptr [edi], 6731h        ; <- XOR with 0x6731

Write 0x06AC, and then XOR that value with 0x6731. Fortunately it's easy enough to use IDA's calculator feature (press '?' in the disassembly listing) to perform these computations. I.e., that snippet writes 0x619D, or in bytes, 9D 61. From here you can look the bytes up in the Intel processor opcode maps, or if you're lazy, find some alignment bytes in the binary, use IDA's Edit->Patch Program->Change Bytes.. feature, write the bytes, and then turn them into code to have IDA automatically disassemble them for you.

10.3.1 Unconditional Jumps

The unconditional branch instruction also makes use of dynamically-generated code, despite not needing to do so. Specifically, it writes the following bytes to the VM stack area: E9 02 F8 B0 F9 B8 [0x004033C9] FF E0, where the value in brackets is a sample-specific value that refers to an address in the .text section that contains code.

If you were to disassemble the sequence of x86 machine code just given, you would find that the first five bytes are an x86 long relative unconditional jump. Then you would stop and scratch your head, since if you write a relative jump with a fixed displacement to an unpredictable address (as the FinSpy VM stack base address is dynamically allocated, thus can change across runs, and is hence unpredictable), the result is also an unpredictable address. Surely, then, this instruction must crash? My original disassembly for this instruction was "CRASH", until I saw how frequently it occurred in the listing of the disassembled program, which forced me to take a closer look.

.text:00403398     lea     ecx, [ebx+VMContext.CurrentInsn.InsnData] ; ECX := &Instruction->Data[0]
.text:004033D1     mov     edi, [ebx+VMContext.pScratch]         ; EDI := dynamic code generation stack
.text:00403237     mov     eax, [ebx+VMContext.SavedESP1]        ; EAX := host ESP
.text:00403219     and     eax, 2Fh
.text:00403350     lea     edi, [eax+edi-4AEDE7CFh]              ; Stupid obfuscation on the value of EDI
.text:00403244     mov     dword ptr [edi+4AEDE7CFh], 0B0F802E9h ; [EDI]    := E9 02 F8 B0
.text:00403266     mov     word ptr [edi+4AEDE7D3h], 0B8F9h      ; [EDI+4]  := F9 B8
.text:00403382     call    $+5
.text:00403387     pop     eax
.text:00403253     lea     eax, [eax+42h] ; 0x004033C9
.text:00403305     mov     [edi+4AEDE7D5h], eax                  ; [EDI+6]  := 0x004033C9
.text:00403229     mov     word ptr [edi+4AEDE7D9h], 0E0FFh      ; [EDI+10] := FF E0
.text:004032A1     mov     al, [ecx]                             ; al = first byte of instruction data
.text:004032CA     mov     [edi+4AEDE7CFh], al                   ; [EDI] := al, overwriting E9
.text:00403412     mov     eax, [ebx+VMContext.SavedESP1]        ; EAX := host ESP
.text:004033B6     lea     edi, [edi+4AEDE7CFh]                  ; EDI := beginning of code we just wrote
.text:0040332F     push    dword ptr [eax]                       ; Push saved flags
.text:004032E2     popf                                          ; Restore flags
.text:0040331E     jmp     edi                                   ; Jump to the code we just generated

Upon closer inspection, the value of the first byte -- previously described as E9 -- is overwritten on line 004032CA with the first byte from the VMInstruction's data array. So I found an instance of this instruction within the decompressed VM bytecode -- opcode 0x06 in the sample -- and looked at the value of the first byte within the instruction's data, which was 0xEB. (In fact, every VM instruction with an opcode of 0x06 had 0xEB as its first data byte.) Then, substituting EB in place of E9, we end up with a more sensible disassembly listing:

.text:0041EFBC EB 02              jmp     short loc_41EFC0
.text:0041EFBC                ; --------------------------------
.text:0041EFBE F8                 db 0F8h ; °
.text:0041EFBF B0                 db 0B0h ; ¦
.text:0041EFC0                ; --------------------------------
.text:0041EFC0                loc_41EFC0:
.text:0041EFC0 F9                 stc     ; Set carry flag
.text:0041EFC1 B8 C9 33 40 00     mov     eax, offset loc_4033C9
.text:0041EFC6 FF E0              jmp     eax

Now we can analyze the code at the location referenced in the snippet above, namely loc_4033C9:

.text:00403371         jb      short loc_4032FD ; JB will be taken, since carry flag was set previously

; JB path
    ; The code here is identical to the VM conditional branch taken re-entry sequence

; JNB path
    ; No real point in analyzing this, as the JB will always be taken
    .text:0040327C         add     [ebx+VMContext.pCurrInsn], size VMInstruction
    .text:00403426         mov     eax, [ebx+VMContext.fpVMEntry] 
    .text:004033E6         mov     esp, [ebx+VMContext.SavedESP1] 
    .text:00403345         jmp     [ebx+VMContext.fpVMReEntryToFuncPtr]