Automation Techniques in C++ Reverse Engineering

Here are the slides for my recent presentation at RECON, entitled "Automation Techniques in C++ Reverse Engineering". As usual, my slides use in-frame animations, so in brief, navigate them using contiguous mode (one slide at a time, advance/retreat with left/right) instead of continuous mode (where parts of multiple slides are visible at the same time, scrolling with up and down). A video of the talk was recorded, but RECON has not yet released it; I will update this post when the video is available, and likely tweet about it also.

(By the way, it’s become obvious in conversation that seemingly nobody reads the slides for my conference presentations, whereas more people do actually read my blog posts. Somebody suggested that this might be because conference slides in computer security are frequently thin in content and difficult to follow without an accompanying video, so people don't bother reading slides at all. I agree and I often don't read slides myself for the same reason. However, this is unfortunate, as I put a lot of effort into creating diagrams and animations for my presentations; I try to make them self-contained publications for which a video and/or accompanying white paper is not required. So if you were inclined not to read the slides on the expectation they'd be hard to follow standalone, I'd suggest giving them a look.)

In summary, I noticed that when I reverse engineer C++ programs, I spend almost all of my time recreating structures, applying structure offsets in the disassembly listing, and applying structure types to Hex-Rays variables and function arguments. I focused on automating these activities. I devised two techniques based on DLL injection for collecting structure-related data from programs at runtime. The first, inspired by academic work such as Howard, collects information about accesses to allocated memory. The other technique determines when dynamically-allocated objects are passed as arguments to functions. To process the data generated by these tools, I wrote an IDA/Hex-Rays plugin to automatically reconstruct structures from the data, automatically apply structure offsets, and automatically set types for function arguments and Hex-Rays variables. There are a number of interactive features that attempt to make reverse engineer's lives easier when dealing with structure recovery.

About the code, the short version is that it isn't ready for release yet. I did put a few days worth of effort into commenting it and cleaning it up, but it needs more than that. Furthermore, I confess that I quickly found myself distracted by new research directions and implementing new features, which I found a lot more interesting than, for example, thwarting ridicule by writing proper text parsers for the input formats rather than calling eval() on untrusted input. I want people to actually be able to use the tool when it's released, whereas in its present state, I'm probably the only person who could use it. Since you only get one chance at a first impression, I think it's more important to "release something useful" than to merely "release something for the sake of releasing something", and so the release will be come at a later date, when it's ready.

I have added some new features since presenting at RECON (which, likely, won't make sense unless you have read the presentation). First, I created an analysis that uses the Hex-Rays decompiler machinery to locate additional structure accesses that were not observed at runtime. Beyond that, most of what I've been working on lately is whole-program-scale recovery of every observed structure. The presentation touched on a few reasons why this is difficult (nested structures being one of them). The problem is more difficult than simply recovering a unique structure with unique substructures for every allocation site. In brief, to exploit the data to its fullest extent, one needs to determine one single type for every access location in the program -- even if it that location is observed accessing different allocations, or different locations within the same allocation. This necessitates discovering when two allocation sites allocate identical structures, determining identical substructure types contained within different structures, and determining nesting/inheritance relationships -- all the while battling with the noise in the data (as described in the presentation), having to explicate information only present implicitly, and contending with a very large state space that daunts direct computation of the optimal solution. Progress is good and tantalizing, but it isn't finished yet.

An Abstract Interpretation-Based Deobfuscation Plugin for Ghidra

This blog entry announces the release of an abstract interpretation-based Ghidra plugin for deobfuscation. The code can be found here (see the ‘Releases’ tab for a binary release). In view of the picture below, the static analysis described herein is designed to take graphs like the one on the left, detect and remove the "opaque predicates" (fake "conditional" branches that only go in one direction), and produce graphs like the one on the right.

HeaderImg.png

Though I have previously published the details of the analysis -- once as a blog post in 2011, once as part of my keynote speech at RECON 2012, entitled "The Case For Semantics-Based Methods in Reverse Engineering" -- the code that I had released at the time was for educational purposes only. It relied upon my private program analysis framework, which was never made public. The current release is fully-functional: you can use it immediately upon installation as an extension in Ghidra. The release ships with a number of examples, which shall be detailed in the rest of this entry. The reader is encouraged to reproduce the results for themselves using the provided scripts and binaries. The code is mostly meant to feed information into larger automated deobfuscation tools, but I have put a good deal of effort into creating user-interactivity and reusable code. I welcome feedback and pull requests for improving my understanding of Ghidra, and also Java development.

Although I prefer my blog entries to be self-contained, given that I've published the details of the analysis already, I don't want to reproduce those details in full. Thus, I shall only sketch the analysis in this entry, and refer the reader to the two links in the previous paragraph should they wish to learn more. (And if you would like to know more about abstract interpretation in general, I published a lengthy introduction to the subject.)

Motivation: Example #1

This example is the obfuscation scenario that lead me to develop this analysis in the first place. Consider the following x86 code, and the annotations off to the right which I created manually:

ObfTFAnnotated.png

Through a contorted process, this code pushes the flags onto the stack, and isolates the "TF" bit, also known as the "trap flag", which is set when single-stepping in a debugger. Using a few bitwise operators (AND, OR, and rotations), the code moves the TF bit into the position that the "ZF" bit (the "zero flag") normally occupies, and then loads the flags register with this modified value. Finally, it performs a "JZ" instruction. Since the TF bit has been moved into the ZF position, the JZ is actually testing whether the trap flag had been previously set. If so, the obfuscator corrupts the internal state of the program, leading to a crash later on. This is an anti-debugging mechanism.

The comments to the right demonstrate how I ultimately convinced myself that this was the behavior of the snippet. In essence, we follow the TF bit through the data flow of the code, and ultimately observe that the TF bit governs whether the JZ at the end of the block is taken. We do not know the values of any of the bits labeled "?", but as you can see, we don't need to. Importantly, as with the two AND instructions, we can sometimes determine the values of bits within quantities, even if some of the relevant input bits were unknown to begin with (e.g., notice all the 0 bits produced by the first AND operation above, despite one of the inputs containing 31 "?" bits).

These observations lead me to consider a generic, automated static analysis tool based on the type of analysis just pictured and described. I.e., if I can perform this analysis manually, why can't I write a program to do it for me automatically?

Automating the Analysis

To accomplish that, we will need two ingredients. First, we need to be able to represent quantities within the program (such as the EDX register and the [ESP] memory location) as collections of bits, with the added possibility that we don't know the values of certain bits (namely, the ones marked '?' above). That is to say, we will need to represent bits with three possibilities instead of the normal two: "0", meaning that the bit is known to be zero; "1", similarly with one; and "?" (also written "1/2" in the below), meaning that the bit is unknown. We will call these "three-valued" bits. Furthermore, we will need data structures grouping together fixed-sized collections of three-valued bits, as well as a model of memory containing such quantities.

The second piece that we'll need is the ability to represent the ordinary bitvector operations (such as AND, OR, the rotation operators above, ADD, MUL, IDIV, etc.) for three-valued bits, and do so faithfully with regards to the ordinary two-valued results. That is to say, when the bits are known to be constant, our three-valued operators must behave identically to the two-valued ones. For unknown input bits, the outputs must faithfully capture the results for both possible input values. To illustrate, let's consider the simple example of adapting the ordinary AND operator to the three-valued world. First, recall the ordinary truth table for two-valued AND:

3VLAnd2.png

How should AND behave in the presence of unknown bits? That is, what is the value of "0 AND 1/2", "1 AND 1/2", and "1/2 AND 1/2"?

  • "0 AND 1/2" is easy -- AND only returns 1 if both inputs were 1, so "0 AND 1/2" must be 0.

  • The "1 AND 1/2" case isn't as nice. Since the "1/2" could represent either 0 or 1, the operation could be either "1 AND 0" (namely, 0) or "1 AND 1" (namely, 1), so we have to say that the result is unknown -- that is to say, "1 AND 1/2" must yield "1/2".

  • For the same reason, "1/2 AND 1/2" yields "1/2".

3VLAnd3.png

As such, the truth table for our three-valued AND is as above. Note that the black values in the corners correspond to the cases when both input bits are known, and that furthermore, they match the two-valued truth table.

In the world of abstract interpretation, an analysis is said to be "correct" if it encompasses all legal data values that may actually occur at runtime. (All abstract interpretation-based analyses are designed to be "correct" by construction.) Since "1/2" refers to any possibility for a given bit, it is always correct for an analysis to decide that a particular output bit is "1/2". Therefore, the red values in the table above could all have been "1/2" (in fact, the black ones could have been "1/2" as well). However, another property of a correct analysis is its "precision". Though harder to quantify, one correct analysis is "more precise" than another if it models fewer data values. Therefore, producing "0" for the "0 AND 1/2" case is both correct (by the analysis above) and more precise than if we'd modeled that case by producing "1/2". Note further in looking at the annotated x86 code above that our choice of modeling 0 AND 1/2 as 0 was crucial in our ability to link the behavior of the JZ instruction at the end to the TF bit at the beginning. Had we modeled 0 AND 1/2 as 1/2 instead, the analysis would have failed to establish that connection.

Similar results may be derived for the other standard bitwise operations:

3VL-LargerTable.png

The OR case works out similarly to the AND case; if an input bit is known to be one, we can use "1" for the output, even if the other bit is unknown. NOT also works out nicely with little imprecision. XOR and == are not so nice; any time an input bit is unknown, the output bit is also unknown.

It may be less obvious how to deal with more complicated operations such as addition. The trick lies in realizing that there must be bitwise formulae for computing this operation (since, after all, processors implement these calculations), and we can just use the operators defined above to implement their three-valued counterparts. For example, addition of two quantities "z = x + y" is defined bit-by-bit by the following formulae, where the "c" bits are the carry bits produced at each step in the computation.

3VL-Add2.png

Since we have defined XOR, AND, and OR above, we can simply use the three-valued operators to implement three-valued addition.

3VL-Add3.png

The rest of the ordinary bitvector operations -- such as comparisons, shifts (including by unknown amounts), multiplication, and unsigned/signed division and remainder -- can be derived in a similar fashion. If you would like to see more derivations, consult the presentation mentioned previously, and/or read the source code.

Given that the example above writes to (and reads from) the stack, our analysis must also handle memory accesses. We use a crude model wherein only fully-known memory addresses are tracked. Any writes to addresses that are not fully known result in the entire memory state being set to 1/2 (since a write to an unknown destination could overwrite anything). This model is cruder than necessary, and it causes a few complications.

Implementing the Analysis in Ghidra

Since I had previously implemented the analysis for my private framework, my immediate task was simply to port it to Java, and more specifically adapt it to Ghidra's pcode (whereas my prior implementation had targeted my own intermediate representation). First, I needed to familiarize myself with Ghidra pcode. You can enable its display from the Ghidra user interface. After clicking the "Jenga" block in the top-right of the "listing window", right click on "PCode" and select "Enable Field", as below:

GhidraX86EnablePcodeProcedure.png

Now the listing will display the pcode translations for the assembly instructions:

GhidraX86PcodeEnabled.png

From here, you can start to get a sense of the pcode language. The Ghidra distribution ships with complete documentation of the pcode language; you can find it in your installation directory under "./docs/languages/index.html".

The initial port was straightforward and took a single day. Noteworthy components of that functionality are:

  • TVLBitVector.java: implements a class for aggregating fixed numbers of three-valued bits.

  • TVLBitVectorUtil.java: implements the three-valued version of all supported integer operators, such as AND, addition, etc.

  • TVLAbstractMemory.java: implements the memory model, associating addresses with TVLBitVector objects.

  • TVLAbstractGhidraState.java: models the entire state of a Ghidra pcode program: register values, "unique" values, and memory spaces.

  • TVLAbstractInterpreter.java: the core of the abstract analysis. Given Ghidra pcode objects, query the TVLAbstractGhidraState to obtain the TVLBitVector representations of the inputs, decide which pcode operation is taking place, invoke the relevant method in TVLBitVectorUtil to perform the abstract operation, and finally update the TVLAbstractGhidraState with the result.

Example #1, Continued

From the analysis above, we expect that the JZ instruction at the end will be taken, or not taken, depending respectively upon whether the initial value of the TF register is 1 or 0. We can use the abstract interpretation-based analysis described above to investigate.

First, within a new or existing Ghidra project, import the binary "3vl-test.bin" (which ships with the code I've released, under the "data" directory in the root).

GhidraImportMenu.png

Next, when the "Import" dialog pops up, select "x86 / default / 32-bit / little-endian / Visual Studio" from the "Language" listing:

Ghidra3VLSelectLanguage.png

Upon loading the binary, Ghidra will ask if you wish to analyze the binary, as in the figure below. This is optional -- the analysis will work even with no code defined in the database -- but you might as well press 'Yes' to make it easier to interpret the results.

GhidraAutoAnalyze.png

Finally, bring up "Window->Script Manager" and select the folder "Deobfuscation", then double-click the script "Example1TF.java". It's a good idea to read the script to get a sense of how you should invoke the relevant library code. The script analyzes the code in the range of addresses 0x0 to 0x1C. It performs three separate analyses: one when the TF bit is initially set to 0, one where the TF bit is set to 1, and another where the TF bit is not set (i.e., is considered unknown, or 1/2, initially).

The output is not very sexy, but it does confirm that the analysis was able to discover the relationship between the initial value of the TF bit and the ultimate decision as to whether the JZ instruction was taken.

Ghidra3VLScriptOutput.png

The later examples will demonstrate more capabilities of the analysis, and more options for visualizing the analysis results.

Example #2: An Obfuscating Compiler on ARM

Last year I found myself looking at an ARM binary produced by an obfuscating compiler. I realized that I could break certain aspects of it with the same techniques as described above. However, my binary program analysis framework did not support ARM. Thus, in order to make progress with the obfuscated binary, I needed to first add ARM support to my framework -- a task that I realistically expected would take weeks or months, based on my experience writing the x86 support. Rather than adding ARM support, I remorsefully abandoned the project instead. Now that I have ported the analysis to Ghidra, and because Ghidra supports pcode translation on ARM, my code just works, automatically. Those weeks or months of agonizing tedium have been shaved off of my research budget.

Here's a snippet of the obfuscated ARM code:

ARMObfAnnotated.png

The comments off to the side indicate the values of the registers and memory locations as the code executes. Basically, through a convoluted series of arithmetic operations, R0 ultimately obtains the value 0x8E8C, which is then moved into the program counter (PC register) on the final line. All of this code is effectively just a branch to loc_8E8C. However, due to the obfuscated implementation, most disassembler tools won't be able to determine the branch destination, and hence will simply stop disassembling after the line ending in -88. The Hex-Rays decompiler can successfully resolve this branch, but chokes on the next one:

ARMHRFailure.png

(In fairness to Hex-Rays, if the user manually adds a cross-reference from the final instruction to the resolved destination, it will be able to proceed with decompiling the code at the target address. The analysis described herein can furnish such branch targets.)

This is another form of obfuscation that can be handled by the analysis described by this entry. Similarly to the steps given previously, import the binary "ARM-LE-32-v8-Default.bin" into your Ghidra project. This time, for the "Language", select "ARM / v8 / 32-bit / little-endian / default", as in:

GhirdaImportARM.png

As before, upon loading, Ghidra will prompt you whether you would like to analyze the binary. Select 'Yes'. Next, under the Script Manager window, in the “Deobfuscation” directory, run "Example2ARM.java". By default, the script will add comments to the disassembly listing when it resolves conditional or indirect branches. In particular, running the script should add the comment shown in the figure below:

ARMAnalysisResolvedBranchComments.png

Let's also take a look at some other output options. For both this example and the previous one, I have shown assembly snippets with manually-written comments suggesting what the written values look like. In fact, given that the analysis automatically computes such information, we don't have to manually create comments like these. Towards the bottom of Example2ARM.java, you will see four calls to "runWithAnalysisOptions", where the first one is enabled and the final three are commented out. Uncomment the second one (the one with "ValueComments" in the code) and run the script again. After adjusting Ghidra's listing fields in the Jenga display (namely, enabling the "PCode" field, and dragging the "Post-Comments" field next to it), we get a nice listing showing the three-valued quantities derived by the analysis after every pcode operation, as such:

ARMAnalysisValuesOutput.png

Notice that many values have been determined to be fully constant. In fact, every pcode operation in this example either produces a value that is fully known (i.e., all constant bits), or fully unknown (i.e., all unknown bits). For example, take a look at the output for the pcode in the first instruction, which I've reproduced below for easier discussion:

ARMAnalysisValuesOutputCropped.png

The second pcode line computes whether the register "r0" is signed less-than 0, and stores the boolean result in a variable named "tmpNG". Since the value of r0 was determined to be 0, the result of this comparison is known: 0 is in fact not less than 0, so the value written to tmpNG is 0. Similarly, the subsequent line compares r0 against 0 for equality, and sets the variable "tmpZR" based on the result. Since r0 contains 0, tmpZR will always be 1 after this comparison.

In compiler theory, an analysis known as "constant propagation" determines when variables are known to contain constant values at a given point in the program, and replaces the variable with the constant. A related analysis, "constant folding", determines when all of the inputs to a given operator are constant values (such as the signed less-than operator, or equality comparison operator from the code just discussed), computes the constant result, and replaces the operation with its result. Our analysis performs both constant propagation and constant folding as a side-effect of its main goal. Also, the code optionally has the ability to modify the pcode along those same lines.

Modify the "Example2ARM.java" script again, this time uncommenting the third call to runWithAnalysisOptions (the one with "PcodeComments" in the code), and commenting out the others. Now, run the script. Below is an example of its output for the same truncated example above. I have manually indicated with red dots where the analysis has changed the pcode -- namely, the computations for "tmpNG" and "tmpZR" both become "COPY" operations, instead of their original "INT_SLESS" and "INT_EQUAL" operations.

ARMAnalysisPcodeOutputCropped.png

(The output may be a bit difficult to read -- it uses the so-called "raw" output for the pcode, which differs from the output that Ghidra uses by default.) For completeness, here are the analysis results for the entire sequence.

ARMAnalysisPcodeOutput.png

Example #3: Virtualization Obfuscation

The previous two examples were mostly for proof of concept. We isolated two representative examples of the opaque predicates generated by the respective obfuscators, and demonstrated that the analysis is capable of resolving the correct branch destinations. Though both examples came from real-world scenarios, they were curated for the sake of demonstrating the analysis. For this example, we will use a real-world sample, unmodified, in its natural habitat.

The sample in question is a purportedly malicious binary, obtained from VirusTotal, which is protected by a virtualization obfuscator. (I haven't analyzed the sample and can't speak about its malicious activities.) Rather than examining sample snippets as in the previous two examples, we shall compute an analysis across entire control flow graphs, taking branches into account. The control flow graph-based analysis is computed very much in the same style as so-called "global intraprocedural dataflow analyses" from compiler theory.

The sample can be found in the same directory as the previous examples. It is contained in a .ZIP file with the password “infected”. Extract this archive to a location that won’t be intercepted by your anti-virus software.

The VM uses a similar type of obfuscation as we've already seen -- fake branches based upon constant values -- though the implementation is distinct from the previous examples. Here's a representative example of one of its opaque predicates:

VMIDAOpaque.png

The middle basic block begins by setting CH to 55h. Next, the code subtracts 0Ch from CH. After an unimportant unconditional branch, it then performs a "JO" conditional branch, which is taken if the OF flag had been set by the subtraction. Since the values just described are all constants, the branch has to evaluate the same way every time it executes -- it will either always be taken, or never be taken. (In fact, it can never be taken.)

The analysis library contains one final result visualization technique not described by the examples above. When the analysis determines that a branch always goes in one direction, as long as the code down the not-taken path is not referenced elsewhere, we can color the not-taken instructions in red. Then, in concert with Ghidra's "Window->Function Graph" visualizer, we can easily visualize which parts of the code were not executed. In the figure below, the blue nodes were executed, but the red nodes were not. You can reproduce this for yourself by running the Example3VM.java script.

VMGhidraGraphColored.png

Problem: Parity Flag

At this point in the blog entry, it should not be surprising that we can detect these opaque predicates and remove them. I did run into one snag trying to use Ghidra on this sample. Namely, some of the opaque predicates are based on the x86 parity flag:

VMPF.png

However, by default, Ghidra's x86 language model does not model the updates to this flag. That is to say, even if an x86 instruction modifies the parity flag, Ghidra's pcode translations do not include updates to this register. For example, both the AND and OR instructions in the snippet below do in fact set the parity flag when executed on an x86 processor. However, as you can see, the pcode does not contain updates to the parity flag.

VMPFNoModel.png

Although this is a sensible design choice for many binary program analyses, we find ourselves in need of such information. For most binary analysis tools, we would find ourselves somewhere between screwed with no way to proceed at one extreme, or facing a non-trivial development effort to model the parity flag at the other. With Ghidra, I managed to solve the issue within about a half an hour. More specifically, under the Ghidra installation, the file ./Ghidra/Processors/x86/data/languages/ia.sinc contains the following snippet:

ExistingIaSinc.png

I was able to simply insert code to properly model the parity flag (the diff is here). After reloading the binary, now the pcode contains proper updates to the parity flags:

VMPFModel.png

I was very impressed at how easy it was to enable this change. Between that and the fact that my analysis worked out of the box on ARM, Ghidra has proven itself to be very capable in the binary program analysis department, not a curiosity or a toy.

Control Flow Graphs

The virtualization obfuscator employed by this binary obfuscates the VM instruction handlers with opaque predicates as demonstrated above. Most unobfuscated handlers, in reality, consist of a single basic block with no control flow. However, a few of the handlers do contain legitimate control flow, including one with a loop. We would like to remove bogus control flow, while at the same time respecting the legitimate control flow.

Rather than hacking together a proof of concept that would only serve as a demonstration, I decided to solve this problem the proper way. In particular, I implemented a library for representing and creating control flow graphs, which is configurable in a variety of ways. Then, I implemented the classical compiler algorithm for solving forward data flow analysis instances. (At present, this code is specialized for the analysis described hereinbefore, but I will likely generalize it to support arbitrary forward and backward data flow problems on pcode control flow graphs). This involved a non-trivial development effort that accounts for over 30% of the code at present.

A False Negative

As one final curiosity, I'll present an example where the analysis was not sophisticated enough to resolve one particular opaque predicate. Take a look at the graph below.

VMGhidraFalseNegative.png

The first assembly language instruction in the first block actually expands into a loop in the pcode control flow graph, which results in complete loss of information about the memory contents. The second, red-underlined line adds a value from memory to the ESP register. Given that the memory state is unknown at this point, the value being added to ESP is unknown, hence the entire register ESP is unknown. Therefore, the subsequent instructions that access the stack are doing so at unknown addresses, so we can't model their effects precisely. As a result, although the third basic block implements a typical opaque predicate that we would ordinarily be able to resolve, because it involves the stack, our analysis cannot resolve it.

So be it. Actually, this is the only opaque predicate in the whole binary that the analysis is unable to resolve.

Conclusion

This blog entry served to announce the release of an open-source deobfuscation plugin for Ghidra. In actuality, this is part of a larger effort, unimaginatively termed "GhidraPAL", short for Ghidra Program Analysis Library. Of the four packages in the distribution at present, three of them are generic and are not tied to the analysis described in this entry. I plan to implement more standard functionality in the future (such as integration with SMT solvers), as well as any other analysis I find useful in binary analysis, and of course weird research efforts that may or may not be useful.

I still have one more thing to tackle before I feel comfortable recommending Ghidra wholeheartedly for automated binary program analysis. If it turns out that, as usual, I am just being neurotic and anticipating problems that will in fact not materialize, then I'll give it my fullest blessing. For now, my experience with this project was almost entirely painless. Ghidra's API is, on the whole, absurdly well-documented. At times I found myself griping that I couldn't find things because there was too much documentation. (Imagine that as a complaint about a software system.) Beyond the documentation, it's also sensibly designed, and includes a lot of very useful functionality in the standard distribution. For classical data flow analysis and abstract interpretation, I highly recommend it, though the jury is still out on SMT integration.

Removing an Annoying Compiler Optimization with a Hex-Rays Microcode Plugin

As part of my reverse engineering work, I wrote a small plugin to deal with an optimization that had been irritating me. The plugin isn't very sophisticated or interesting, but it is useful, and it's fully automatic, requiring no user interaction. You can find the code here, and my recent article on the Hex-Rays microcode API provides all of the necessary background.

In brief, testing individual bits within a quantity is a common pattern in C and C++ programming. For example, in the Windows API, after you have retrieved the attributes for a file, you might check to see if the file is a directory with an expression such as "(fileAttributes & FILE_ATTRIBUTE_DIRECTORY) != 0". The most obvious assembly-language implementation is simply with a "test" instruction against the constant:

ASM-unoptimized-impl.png

However, in the binaries I've been analyzing lately, I frequently see a pattern such as the following, involving shifting the bit in question to the lowest position and performing an "and" or a "test":

ASM-optimized-impl.png

Honestly, I am not sure why this is an optimization. My first thought was that the shift/test sequence would be smaller, but they are actually the same size. The pattern manifests itself in the Hex-Rays decompilation as follows:

decomp-no-NOT.png

Or, an even uglier variant involving a bitwise NOT:

decomp-NOT.png

Because this optimization gets rid of the original constant value, we can't make use of Hex-Rays' nice features to convert constant values into enumeration elements. That means every time I see this pattern, I have to look up which enumeration element is being used, and leave the information in a comment. Obviously, it would be preferable to simply be able to use enumerations.

The problem was on my to-do list for long enough, and I finally got bored enough, that I did something about it. I wrote a small Hex-Rays Microcode API plugin to detect instances of the pattern and rewrite them using the proper constants, as in:

decomp-altered-no-NOT.png

And, for the bitwise NOT variant:

decomp-altered-NOT.png

Much better. It took some extra work, but I also added the ability to change the constant to an enumeration. (The way Hex-Rays stores information about constants from the disassembly in the decompilation listing, and the rules under which you can convert them to enumeration elements, are somewhat complicated. The source code discusses them in more detail, along with some other information that might benefit Microcode plugin developers.) Here's an example:

decomp-enum.png

Mission accomplished. In retrospect, it would have been easier and faster to write this as a CTREE plugin rather than a Microcode plugin, but I suppose all that matters is that it works. Maybe somebody else will find it useful.

A Quick Solution to an Ugly Reverse Engineering Problem

Reverse engineering tools tend to be developed against fundamental assumptions, for example, that binaries will more or less conform to the standard patterns generated by compilers; that instructions will not jump into other instructions; perhaps that symbols are available, etc. As any reverse engineer knows, your day can get worse if the assumptions are violated. Your tools may work worse than usual, or even stop working entirely. This blog post is about one such minor irritation, and the cheap workaround that I used to fix it.

In particular, the binary I was analyzing -- one function in particular -- made an uncommon use of an ordinary malware subterfuge technique, which wound up violating ordinary assumptions about the sizes of functions. In particular, malware authors quite often build data that they need -- strings, most commonly -- in a dynamic fashion, so as to obscure the data from analysts using tools such as "strings" or a hex editor. (Malware also commonly enciphers its strings somehow, though that is not the feature that I'll focus on in this entry.) As such, we see a lot of the following in the function in question.

ManyStackWrites.png

What made this binary's use of the technique unusual was the scale at which it was applied. Typically the technique is used to obscure strings, usually no more than a few tens of bytes apiece. This binary, on the other hand, used the technique to build two embedded executables, totaling about 16kb in data -- hence, there are about 16,000 writes like the one in the previous figure, each implemented by a 7-byte instruction. The function pictured above comprises about 118KB of code -- over 25% of the total size of the binary. The function would have been large even without this extra subterfuge, as it has about 7kb of compiled code apart from the instructions above.

The Hex-Rays decompilation for this function is about 32,500 lines. The bulk of this comes from two sources: first, the declaration of one stack local variable per written stack byte:

HR-ManyStackVariables.png

Second, one assignment statement per write to a stack variable:

HR-ManyStackWrites.png

To IDA's credit, it handles this function just fine; there is no noticeable slowdown in using IDA to analyze this function. Hex-Rays, however, has a harder time with it. (I don't necessarily blame Hex-Rays for this; the function is 118KB, after all, and Hex-Rays has much more work to do than IDA does in dealing with it.) First, I had to alter the Hex-Rays decompiler options in order to even decompile the function at all:

HR-DecompilerOptions.png

After making this change, Hex-Rays was very slow in processing the function, maxing out one of my CPU cores for about five minutes every time I wound up decompiling it. This is suboptimal for several reasons:

  • I often use the File->Produce file->Create .c file... menu command more than once while reverse engineering a particular binary. This function turns every such command into a cigarette break.

  • Some plugins, such as Referee, are best used in conjunction with the command just mentioned.

  • When using the decompiler on this function in an interactive fashion (such as by renaming variables or adding comments), the UI becomes slow and unresponsive.

  • Randomly looking at the cross-references to or from a given function becomes a game of Russian Roulette instead of a normally snappy and breezy part of my reverse engineering processes. Decompile the wrong function and you end up having to wait for the decompiler to finish.

Thus, it was clear that it was worth 15 minutes of my time to solve this problem. Clearly, the slowdowns all resulted from the presence of these 16,000 write instructions. I decided to simply get rid of them, with the following high-level plan:

  • Extract the two .bin files written onto the stack by the corresponding 112KB of compiled code

  • Patch those .bin files into the database

  • Replace the 112KB worth of instructions with one patched call to memcpy()

  • Patch the function's code to branch over the 112KB worth of stack writes

The first thing I did was copy and paste the Hex-Rays decompilation of the stack writes into its own text file. After a few quick sanity checks to make sure all the writes took place in order, I used a few regular expression search-and-replace operations and a tiny bit of manual editing to clean the data up into a format that I could use in Python.

PythonCleanedUp.png

Next, a few more lines of Python to save the data as a binary file:

PythonSaveBinary.png

From there, I used IDA's Edit->Patch program->Assemble... command to write a small patch into the corresponding function:

IDA-AssembleCommand.png

After a bit of fiddling and manual hex-editing the results, my patch was installed:

Patched.png

And then I used a two-line IDC script to load the binary files as data in the proper location:

IDC.png

Afterwards, the navigation bar showed that about 31% of the text section had been converted into data:

IDA-NavBarAfter.png

And now the problem is fixed. The function takes approximately two seconds to decompile, more in line with what we'd expect for a 7kb function. Hooray; no more endless waiting, all for the time cost of about three accidental decompilations of this function.

This example shows that, if you know your tools well enough to know what causes them problems, that sometimes you can work your way around them. Always stay curious, experiment, and don't simply settle for a suboptimal reverse engineering experience without exploring whether there might be an easier solution.

Hex-Rays CTREE API Scripting: Automated Contextual Function Renaming

One thing I teach my students in my static reverse engineering training classes is to exploit information that programmers have left in the binaries for debugging purposes. The most obvious example of this is when the program contains a debug logging function, where one of the parameters is the function's name. In this case, the reverse engineer can write a script to extract the function names and automatically rename the calling function, thus saving valuable time and making their lives easier. (I teach several other example scenarios.)

Inspired by my recent forays into the Hex-Rays API, today I found a new method for exploiting such ad hoc debug-type information. This entry will demonstrate how to use the Hex-Rays CTREE API to extract information and automatically rename functions as a result, which would otherwise be more cumbersome and error-prone to do with the ordinary IDA SDK. The code is available here.

In particular, today I was reverse engineering the Hex-Rays decompiler itself. The Hex-Rays API is unusual in that the user-accessible API functions are not defined as exports from hexrays.dll. (In fact, hexrays.dll exports only the standard DLL entrypoint, as well as the standard plugin_t structure required of all IDA plugins.) Instead, the "exported" API methods are all declared as small inline functions that each invoke the same function, namely "hexdsp", which serves as a central dispatcher to the public Hex-Rays API functions. The first parameter to hexdsp is an enumeration element of type "hexcall_t", which tells the Hex-Rays kernel which API is being called, and the subsequent parameters (of which there are a variable number) are specific to the called API. (Internally to hexrays.dll, the hexdsp function performs a switch over the hexcall_t parameter, and then invokes the specified function.)

To be concrete about it, the last 3500 lines of hexrays.hpp all look roughly like this:

HexDsp.png

If the API functions had been exported from the DLL, IDA would have automatically renamed them for us, which would have made trivial the task of finding the APIs that we wanted to reverse engineer. Instead, nearly none of the functions in the the database have meaningful names.

The lack of meaningful function names does not prove to be a huge impediment to reverse engineering Hex-Rays API functions. We simply need to locate the hexdsp function in hexrays.dll, and locate the switch case that corresponds to the hexcall_t enumeration element that interests us. Most of these switch cases are one-liners that immediately invoke the relevant API function.

To find hexdsp, we can textually search the disassembly listing for the word "switch", which IDA inserts automatically as a comment upon encountering a compiled switch construct.

TextSearch.png

Next, we can filter the search by the word "cases" to obtain the beginning of each switch construct, as well as the number of cases that the switch contains.

FilteredSearch.png

Finally, we can count the number of enumeration elements in hexcall_t, and look for a switch with roughly that many cases. (For this, I simply selected the contents of the hexcall_t enumeration in my text editor, and looked at the "lines selected" display at the bottom.) hexcall_t contains 532 enumeration elements, and one of the switches has exactly 532 cases. That must be hexdsp.

Now, to reverse engineer any particular API function, we just need to find the value of the associated constant in the hexcall_t enumeration, and examine that particular switch case. IDA can offer us some more assistance here, also. I begin by copying and pasting the hexcall_t enumeration into its own text file and saving it:

hexcall_t_C.png

Next, I use IDA's C header parsing functionality to automatically create an enumeration that I can use inside of IDA. If successful, we get a message box telling us so.

ParseCHeaderFile.png

In IDA's Enumerations window, now we need to add the enumeration we just imported. Press insert to bring up the "Add enum type" window, and then use the "Add standard enum by enum name" option near the bottom. It will bring up a list of all currently-known enumerations; the one we just imported will be at the bottom.

LoadEnum.png

We can use this enumeration to make the decompilation listing prettier. The decompiled switch statement currently shows the hexcall_t enumeration elements as integers. We can ask Hex-Rays to display them as enumeration element names instead, by placing our cursor on one of the integers and pressing 'M'.

HRApplyEnum.png

Select the hexcall_t entry from the dialog that pops up and press enter. Voila, now we see the enumerated names in the decompilation listing:

HREnumApplied.png

At this point, locating particular Hex-Rays API functions is pretty easy. The switch now uses human-readable names for the enumeration elements; we just find the one we're interested in, and look at the function called in that switch case. To wit, here's gen_microcode(), the one I had been after in the first place:

SwitchPrettyNames.png

Of course, finding the correct switch case is still tedious, given that there are 532 of them. I thought that the nicest solution would be to rename the functions after the name of the hexcall_t enumeration element in the corresponding switch case. Of course, doing this by hand for all 532 hexcall_t enumeration elements would be a waste of time, and we should automate it instead.

For this task, I decided to use the Hex-Rays CTREE API via IDAPython, which I discussed somewhat in my recent blog entry about the Hex-Rays Microcode API. My observation was that we could extract the case number from the case statements, get the name of the pertinent enumeration element, extract the address of the called function from the first line of code in the case body (if the first line does call a function), and then rename the called address after the enumeration element. Using the Hex-Rays API saves us from having to write nasty, brittle code to search the disassembly listing instruction-by-instruction, and also makes it convenient for us to access the switch case numbers and their associated case statement bodies.

Our script will make use of so-called "CTREE visitor" objects. Our CTREE visitor classes contain methods such as "visit_insn" and "visit_expr", which Hex-Rays will invoke for every element in the CTREE listing.

The first visitor, which I called SwitchExaminer, exists merely to locate the switch statement within hexdsp's decompilation listing. Once found, it iterates over each switch case. For each case, it first extracts the case number. Next, it then extracts the first line of code in the body of the case statement, and passes it to another visitor. The second visitor, called a CallExtractor, extracts the addresses of all called functions (if any).

Finally, some glue code at the bottom of the script applies the visitors to the decompilation of hexdsp, and then renames the called functions after the hexcall_t enumerated element numbers that were collected from the switch statement.

That's all; after about 100 lines of heavily-commented IDAPython code, the script renames 436 functions for us automatically, nearly 10% of the non-library, non-thunk functions in the binary.

DecompRenamed.png

You can find the code here. A small note is that the code takes advantage of an IDAPython wrapper function regarding switch case values, which exists in IDA 7.2 beta, but does not exist in IDA 7.1 or below. In particular, you will not be able to run the script as-is until IDA 7.2 is released. If you would like to do something similar with a different use case that does not require you to extract switch case label values, you might be able to adapt the existing code for that purpose in the meantime.

Weekend Project: A Custom IDA Loader Module for the Hidden Bee Malware Family

Here's a half-day project that I did this weekend for my own edification. Perhaps someone will benefit from the source code in the future.

While reading hasherezade's research on the Hidden Bee malware family's custom file format (samples here), I was struck with the thought that this use-case seemed particularly well-suited for an IDA custom loader module. The IDA loader module approach has a few advantages over the previous approach: it's fully automated, requiring no additional programs, plugins, or scripts; the imports have proper names and type information, allowing IDA's ordinary P.I.T. algorithms to propagate the information; and the user can relocate the database to an arbitrary base address.

LoaderScreen.png

Given that custom loaders are the only variety of IDA plugin that I haven't yet written, this seemed like a nice small-scope project for the weekend to round out my knowledge. My very minor contribution with this entry is the IDA custom loader for the Hidden Bee format, which can be found on my GitHub. The IDAPython code requires that Ero Carrera's pefile module be installed, say via pip. 

Hidden Bee

In brief, the Hidden Bee malware family distributes payloads in a customized file format, which is a majorly stripped-down version of the PE file format. You can see all of the details in hasherezade's write-up. I did no original malware analysis for this project; I merely read her blog entry, figured out how to convert the details into a loader plugin, and then debugged it against the sample links she gave. As usual, Chris Eagle's The IDA Pro Book, 2nd Edition was useful. Some details about the loader API have changed with the IDA 7.x API port, but Hex-Rays' porting guide was informative, and the loader examples in the IDA 7.1 SDK have also been ported to the newest API.

IDA Loader Modules in Brief

An IDA loader module is simply an IDA plugin with a well-defined interface. IDA loader modules will be called when loading any file into IDA. They have two primary responsibilities: 

  1. Given access to the bytes of a file, determine whether the file is of a format that the loader module can handle. Every IDA loader module must export a function named accept_file for this purpose. This function returns 0 if it can't recognize the file format, or a non-zero value if it can.
  2. If the file type can be loaded by the module, and the user chooses to use this module to load the file, perform the actual loading process e.g. creating segments within the IDB, copying bytes out of the file into the segments, processing relocations, parsing imports, adding entrypoints, and so on. Every IDA loader module must export a function named load_file for this purpose.

Both of these functions take as input an "linput_t *" object that behaves like a C FILE * object, which supports seeking to specified positions, reading byte arrays out of the file, and so on. Since Hidden Bee's format includes relocations, I chose to implement a third, optional IDA loader module function: move_segm. This function will be called by the IDA kernel when the user requests that the database be relocated to another address.

Writing a Loader Module for Hidden Bee

After reading the aforementioned write-up, I figured that the only difficulties in loading Hidden Bee images in IDA would be A) that the Hidden Bee customized header specifies API imports via hash rather than by name, and B) that it includes relocation information. Relocations and import lookup via hash are simple enough conceptually, but the precise details about how best to integrate them with IDA are not obvious. Sadly, I did not feel confident in these tasks even after reading the loader module examples in the SDK. Four out of the five hours I spent on this project were reverse engineering %IDADIR%\loaders\pe.dll -- the loader module for the PE file format -- focusing in particular on its handling of relocations and imports. As expected, the results are idiosyncratic and I don't expect them to generalize well. 

Imports

For dealing with the imports by hash, hasherezade's toolchain ultimately generates a textual file with the addresses of the import hash names and their corresponding plaintext API string. Then, she uses one of her other plugins to create repeating comments at the addresses of the import hash DWORDs. Instead, I wanted IDA to show me the import information the same way it would in a normal binary -- i.e., I wanted IDA to set the proper type signature on each import. I figured this might be difficult, but after a few hours reverse engineering the virtual functions for the pe_import_visitor_t class (partially documented in %IDASDK%\ldr\pe\common.hpp), it turns out that all you have to do to trigger this functionality is simply to set the name of the DWORD to something from a loaded type library.

Here's a screenshot showing IDA successfully applying the type information to the APIs:

ImportNames.png

Relocations

For the IMAGE_REL_BASED_HIGHLOW relocations common in PE files, each can ultimately be processed via straightforward translation of the relocation information into IDA's fixup_data_t data structures, and then passing them to the set_fixup API. The SDK examples did not give a straightforward idea of what I needed to do to handle PE IMAGE_REL_BASED_HIGHLOW relocations properly, so I reverse engineered pe.dll to figure out exactly what needed to happen with the relocations. (Fortunately, reverse engineering IDA is trivial due to the availability of its SDK.) If you wish, you can see the results in the do_reloc function. Don't ask me to explain why it works; however, it does work.

Here's a before and after comparison of rebasing the database from base address 0x0 to base address 0x12340000. Note particularly that the red underlined bytes change. Before:

RebaseBefore.png

After:

RebaseAfter.png

The Atredis BlackHat 2018 CTF Challenge

This post covers my solution to the Atredis BlackHat 2018 challenge, for which I won second place and a ticket to BlackHat. I'd like to express my gratitude to the author, the increasingly-reclusive Dionysus Blazakis, as well as Atredis for running the contest.

Initial Recon

As you can see from the screenshot in the tweet linked above, and reproduced below, once you connect to the network service on arkos.atredis.com:4444, the system prints information about the firmware and hardware version, as well as the memory map. Following that is the message-of-the-day, which contains a reference to 8-bit computing. 

AtredisChall.jpg

It also prints the contents of an email, which makes reference to an attachment having been written to disk. It seems like the immediate goal of the challenge is clear: retrieve the attachment from disk.

AtredisMail.png

At about 8:30PM, I logged into the system for the first time. The challenge had been released at 2PM and I figured many people would have already solved it. Not knowing what to do at the prompt, I typed "help". The system replied by printing information about two commands: "help" and "read", with an example of how to use "read":

'read F400' reads the byte at $F400

I ran that command and it came back with the value 31, or '1' in ASCII. I ran the command for a few adjacent addresses and this location seemed to contain the hardware version string "1.0.3 rev A", part of what was printed initially in the first messages upon connecting to the service.

Dumping the Firmware

At first blush, there wasn't much more to do than read bytes off of the system at specified addresses. However, this challenge was not my first rodeo in the embedded reverse engineering space. I was immediately reminded of an exploit I once wrote for a SOHO router, where, once I obtained its memory map, I used an arbitrary read vulnerability to dump its memory contents so I could further analyze the software in IDA. I decided to do something very similar here, minus the need for an arbitrary read vulnerability.

Although I don't like Python much as a programming language, I do have to credit it for having an absurdly large standard library. In particular, while previously writing the aforementioned exploit, I made use of Python's Telnetlib module to automate interaction with the router. Nothing seemed to be stopping me from doing the same thing in this situation, so I spent about 10 minutes writing a 30-or so line Python script to log into the device, repeatedly send "read" commands, parse the resulting output, and save the results to a binary file. That functionality combined with the memory map printed by the device upon connection was all that was needed. You can find the dumped memory as .bin files here.

My script took nearly four hours to dump the memory. I don't know how much of that was related to the crappy Wi-Fi at my hotel, and how much had to do with other contestants hammering the server. Nevertheless, by the time I had the memory dump, it was 12:15AM and I had a business engagement in the morning. I needed to finish quickly so I could sleep.

Inspecting the Firmware

I began by briefly inspecting the dumped memory contents in a hex editor. The three regions which, in total, encompassed the range 0x0000-0x1FFF, were entirely zero. Only a few bytes within 0xF000-0xFEFF were non-zero. In the range 0xFF00-0xFFFF, only the final three words were non-zero.

The most interesting dumped memory range was 0x4000-0xEFFF. It began with the strings that the server printed upon connection, as well as others that had not been printed. The most interesting strings were "write" and "call", which seem like they might have been commands that the user could execute on the system. After these strings, the memory dump had zeroes until address 0x8000.

Strings.png

At address 0x8000, there were 0x542 bytes of unknown binary data, with the remainder of the region being zeroes. Now that I've inspected the entire memory space, if this thing has any code in it at all, it must be the bytes at 0x8000-0x8542. The only other non-zero, unknown data is the few sporadic, isolated bytes previously mentioned in the range of 0xF000-0xFFF.

I connected to the system again and tried executing the "call" command I had discovered in the strings. I provided it the address 8000, which seemed to be the beginning of the code in the memory image. The thing printed:

JSR $8000

Apart from that, nothing happened. Next, I executed "call 7FFF" and the system reset. I took that as a positive sign.

Determining the Architecture

At this point, I did not know what architecture I was dealing with. I had two hints. First, the message-of-the-day string made reference to 8-bit computing. Second, the "call" command had printed the string "JSR", which on some architectures is a mnemonic for "jump to subroutine" (i.e., the same functionality known as "call" on x86). The best logical guess right now is that we are dealing with an 8-bit architecture where the mnemonic for the call instruction is "JSR".

I wish I could tell you I used a systematic procedure in searching for such an architecture, but I would be lying. In retrospect, I could have loaded %IDASDK%\include\allins.hpp and searched for "jsr". The string occurs 32 times in that file, which would have given me a pile of possibilities:

  • Angstrem KR1878
  • DEC ALPHA
  • DEC PDP-11
  • Freescale M68HC12 or M68HC16 or HCS12 
  • Java bytecode
  • Mitsubishi 740 or 7700 or 7900
  • MOS M65 [e.g. 6502] or M65816
  • Motorola 6809 or 68000
  • Motorola DSP56000
  • Panasonic MN102
  • Renesas H8, H8500, SuperH, or M16C
  • Rockwell C39

Instead what I ended up doing was searching Google for things like "assembly jsr". For any architecture suggested by one of my search results, I loaded the .bin file for the memory at 0x4000, changed the processor type to the one I was testing, and tried to disassemble the file using the relevant processor module. I ran through a few theories, among them System/360 and 68K. For each of my theories, IDA either refused to disassemble most of the code, or the resulting disassembly was gibberish. For example, when loaded as 68000, IDA refused to disassemble much of the code:

68000-gibberish-disasm.png

I almost gave up after about 15 minutes, since I had to be somewhere in the morning. But, a final Google search for "8-bit assembly jsr" brought me to a Wikibooks page on 6502 assembly language, which also has a "jsr" instruction. I loaded the binary as 6502, and lo and behold, the resulting disassembly listing looked legitimate. There were, for example, several loads followed by compares, followed by relatively-encoded short jumps to the same nearby location; loads of 0 or 1 immediately before a RTS (return) instruction, and so on.

M6502-legit-looking-disasm.png

It looked like I had found my winner.

Loading the Binary Properly

Now that I seemed to know the architecture, I needed to load all of the memory dumps into a single IDA database for analysis. This is easy if you know what you're doing. If you do, you may as well skip this section. If you don't know how to do this, read on.

First, I loaded the file for memory range 0x4000-0xEFFF into IDA and changed the processor type to M6502.

LoadAsM6502.png

Next I had to change the loading offset and the region that IDA considered as containing the ROM.

LoadConfig.png

From there, I loaded each of the remaining memory dumps as additional binary files.

LoadBinaryFile.png

For each of those, I had to change their loading offset so they'd be represented at the proper location within the database.

LoadBinaryFileConfig.png

That's all; now the IDB has all of the memory ranges properly loaded. The clean IDB can be found here.

Reverse Engineering the Binary

I only have a few notes on the process of statically reverse engineering this particular binary. That's because, for the most part, I used the same process as I would while statically reverse engineering any binary on any architecture. In particular, I followed the same procedure that I teach in my static reverse engineering training classes. The whole process took 30-40 minutes. You can find the final IDB here.

Reset Vector

It's often useful to know where execution begins when a system or program executes. For embedded devices, the "system entrypoint" is often held as a pointer in the system's reset vector. For example, on 16-bit x86, the memory location F000:FFF0 contains the address to which the processor should branch upon reset.

For this binary, I noticed that the final three words in the memory dump -- at addresses 0xFFFA, 0xFFFC, and 0xFFFE -- all contained the value 0x8530, which happens to be near the end of the region with the code in it. It seems likely that this is our entrypoint. I renamed this location BOOTLOC and began reading.

Bootloc.png

6502 Documentation and Auto Comments

Now my job is to analyze and document 1300 bytes of 6502 code. A slight problem is that I don't know anything in particular about 6502 beyond some tutorials I read years ago. Nevertheless, this was not a major impediment to reverse engineering this binary. Two things helped. First, the Wikibooks web page I had found previously (which had tipped me off that the binary might be 6502) was a reasonably good layman's guide to the instruction set. Whenever I wanted to know more about a particular mnemonic, I searched for it on that page. Most of my questions were answered immediately; a few required me to correlate information from a few different parts within the document. It was good enough that I didn't have to consult any outside sources.

The second feature that helped me was IDA's "auto comments" feature, under Options->General->Auto Comments.

AutoComments.png

Once enabled, IDA leaves brief one-line descriptions of the mnemonics in the same color that it uses for repeating comments. You can see an example in the screenshot below. Although the comments may confuse you if you don't know 6502, the Wikibooks link above was enough to fill in the missing details, at which point the auto comments assisted in rapidly disassembling the binary.

AutoCommentsDisasm.png

Disjointed References

One slightly abnormal feature of this binary -- probably 6502 binaries in general -- is the fact that the instructions usually operate on 8-bit values, but the memory addresses are 16-bits. Thus, when the code needs to write a value into memory that is more than 8 bits, it does so one byte at a time. The following screenshot illustrates:

StringRefsDisasmCode.png

Lines 0x817D-0x8183 write the constant 0x405F into memory (at location 0x80) as a little-endian word. Lines 0x8188-0x818E writes the constant 0x406C to the same location. Lines 0x8193-0x8199 write the constant 0x4081 to the same location.

Those constants stuck out to me as being in the area where the strings are located. Indeed, here's the view of that location:

DisasmStrings.png

Thus, the addresses written by the code previously are those of the first three strings making up the textual description of the memory map that is printed when connecting to the system. (The string references also tip us off that the called function likely prints strings.)

Although splitting larger constants into smaller ones is not unique to M6502 -- for example, compiled binaries for most processors with fixed-width instructions do this -- nevertheless, as far as I know, IDA's support for recognizing and recombining these constructs into macroinstructions must be implemented per processor (as part of the processor module). Evidently, the 6502 processor module does not support fusing writes into pseudo-operations. Therefore, we have to resolve these references manually while reverse engineering. Nevertheless, this is not difficult (and we only have about 1300 bytes of code to analyze). Simply use the "G" keyboard shortcut to jump to the referenced locations when you see them. For strings, just copy and paste the string at the referencing location.

Memory-Mapped I/O

When the system prints its memory map upon startup, the region from 0xF000-0xFF00 is labeled "MMIO", which presumably is short for memory-mapped I/O. Memory-mapped I/O addresses are effectively global variables stored in memory. When you write to a memory-mapped I/O location, the data is accessible to other components such as peripherals (i.e., a screen controller or a disk controller). Similarly, peripherals can also share data with the 6502 component via memory-mapped I/O, such as keyboard input or the results of disk I/O.

Reverse engineering the variables in this range is effectively the same as reverse engineering accesses to any global variable, although you must keep in mind that MMIO output locations might never be read, and MMIO input locations might never be written.

System Description

Once familiar with the basics of 6502, and the particular points described above, I statically reverse engineered the binary's scant 25 functions.

  • Upon boot, the system prints the diagnostic messages we see in the screenshot at the top of this post. 
  • Next, it loads the message-of-the-day off of the file system (from the file /etc/motd) and prints it. 
  • Next, it checks mail (under the file /var/mail/spool/atredis) and prints it. 
  • Finally, it enters into an infinite loop processing user input and dispatching commands.

Reading the code for the final step -- command dispatch -- we can see indeed that "write" and "call" are valid, undocumented commands. The "write" command is ultimately very straightforward: it converts the address and value arguments into hexadecimal, and then write the value to the address. The "call" command is also straightforward, but I found it neat. It creates 6502 code at runtime for a JSR and RTS instruction, which it then invokes. My daily work does not usually involve examining JIT code generation on archaic platforms.

CallImpl.png

File I/O

It's been a while since we mentioned it, but recall from the introduction that when we connected to the system, it printed the most recent mail, and dropped a hint about the attachment having been written to disk. Our ultimate goal in reverse engineering this binary is to find and exfiltrate this mystery file from the disk. The message did not actually give us a filename, or anything of that sort. Let's take a closer look at how the system loads and prints files off of the disk, such as the message-of-the-day and the most recent email.

The full details can be found in the IDB, but the process generally works like this:

  • PrintMOTD() writes a pointer to "/etc/motd" into a global variable X, then calls Load()
  • Load() reads, in a loop, each of the 0x40 disk sectors into a global array Y and compares the first bytes of Y against X
  • Once found, Load() returns 1. The contents of the disk sector are still in Y.
  • If Load() succeeded, PrintMOTD() calls PrintDiskSectors().
  • At this point, the global buffer Y contains not only the name of the file, but also two words; let's call them Z and W. Z indicates the number of the first disk sector which contains the file's contents, and W is the number of sectors that the file occupies.
  • PrintDiskSectors() then consults Z and W from within the global array. Beginning at sector Z, it prints the raw contents of the file onto the screen, and repeats for W sectors.

(My analysis at the time was slightly sloppier than the above. I did not fully understand the role of the values I've called Z and W.)

Enumerating the Disk Contents, Finishing the Challenge

I now had a rough understanding of the mechanism of how files were read from disk and printed to the screen. My understanding also indicated that I could dump the underlying sectors of the disk without needing to know the names of the files comprising those sectors.

In particular: PrintDiskSectors(), at address 0x822C, reads its two arguments from global memory. Those arguments are: 1) Z, i.e., which sector to begin dumping data from, and 2) W, i.e., how many sectors to dump.

And so it was immediately obvious that I could use the undocumented "write" and "call" commands to write whatever I wanted into Z and W and then invoke PrintDiskSectors(). I tried it out in netcat and it worked on the first try -- I was able to dump sector #0.

Thus, I incorporated this functionality into my scripts based on Python's Telnetlib that I had previously used to dump the memory. Since my understanding at the time was a bit off, I ended up with a loop that executed 0x40 times (the number of sectors), which wrote 0x0000 in for Z (i.e., start reading at sector 0), and each iteration of the loop wrote an increasing value into W, starting with 1 and ending with 0x3F. The script would dump the returned data as a binary file, as well as printing it onto the screen. You can find the script here, and its output here.

I let my script run and once it got to the final iteration of the loop, there was a message congratulating me and telling me where to send an email, as well at what the contents should be. I immediately sent an email at 1:35AM, about 80 minutes after I'd first dumped the memory off of the device. Shortly thereafter, I received a personalized email congratulating me on my second-place finish.

Concrete and Abstract Interpretation, Explained through Chess

I've decided to release my presentation (two slide decks) on the theoretical foundations of abstract interpretation, illustrated through the game of chess. It has been collecting dust on my hard drive for five years, so I figured I may as well give it a proper burial.

This was the first thing I wrote as part of what eventually became my SMT-Based Program Analysis training, back when I planned for that course to cover abstract interpretation as well as logical analysis. The idea to use chess as a medium came to me in a nightmare. I spent about six weeks working on it, polishing it, and extending my original ideas to cover broader areas of abstract interpretation than what my nightmare originally entailed. Writing it was a useful exercise; it forced me to muddle through all of the formalism you see in the preliminaries section of an abstract interpretation paper, and digest them well enough that I thought I could present them to other people. And I personally liked what I wrote, especially the detailed figures.

Next, I sent it off to proofreaders. Two of them -- people with graduate degrees in program analysis and mathematics -- also liked it. The other 18 or so of them -- bright people, mostly professional security researchers -- seemed to utterly despise it. I think they were angry at me for sending it to them with a straight face and expecting a response. You know who you are, and I'm sorry for putting you through that. There were some more changes I wanted to make, but the response made me finally realize that you need at least an undergraduate education in mathematics to understand abstract interpretation, no matter how hard I might try to make it friendly. The experience ultimately convinced me that my time was better spent in teaching people about SMT solvers.

Now it's my turn to preemptively offer my apologies, and best wishes for luck, to anybody else who would like to read this. Chances are very good that you are not going to like it, either. The second deck could use some more polishing, but that's the least of your worries. The bigger problem is that the audience for this document is virtually non-existent.

After that compelling interlude, I present:

P.S. if you find any corrections, don't bother sending them since I have no intention of working on this anymore.

Bonus picture, a preview of what awaits you.

ai-lobster.png

FinSpy VM Unpacking Tutorial Part 3: Devirtualization. Phase #4: Second Attempt at Devirtualization

[Note: if you've been linked here without context, the introduction to Part #3 describing its four phases can be found here.]

1. Introduction

In Part #3, Phase #1, we deobfuscated the FinSpy VM bytecode program by removing the Group #2 instructions. In Part #3, Phase #2, we made a first attempt to devirtualize the FinSpy VM bytecode program back into x86 code. This was mostly successful, except for a few issues pertaining to functions and function calls, which we examined in Part #3, Phase #3.

Now we are ready to take a second stab at devirtualizing our FinSpy VM sample. We need to incorporate the information from Part #3, Phase #3 into our devirtualization of X86CALLOUT instructions. After having done so, we will take a second look at the devirtualized program to see whether any issues remain. After addressing one more major observation and a small one, our devirtualization will be complete.

2. Devirtualization, Take Two

We are finally ready to choose an address in the original FinSpy sample at which to insert the devirtualized code, devirtualize the FinSpy VM program, and copy the devirtualized machine code into the original binary. I chose the address 0x500000, for no particular reason other than that it was after any of the existing sections in the binary.

If everything we've done so far has worked correctly, now we have all of the information we need to generate proper functions in our devirtualized program. We have a set containing the non-virtualized functions called by the FinSpy VM program. For virtualized function targets, we have a list containing tuples of the function addresses, the VM instruction key corresponding to the first virtualized instruction in the function, and a list of prologue bytes to prepend before the devirtualization of the first virtualized instruction. 

We derive two dictionaries from the virtualized function information. 

  1. The dictionary named X86_VMENTRY_TO_KEY_DICT maps an X86CALLOUT target to the VM instruction key corresponding to the beginning of the virtualized function body. 
  2. The dictionary named KEY_TO_PROLOGUE_BYTES_DICT maps the VM instruction key to the copied x86 prologue machine code bytes for the function beginning at the VM instruction with that key.

Now we make two changes to our instruction-by-instruction devirtualization process:

  • In the loop that iterates over all VM bytecode instructions and produces the devirtualized output, consult KEY_TO_PROLOGUE_BYTES_DICT to see if the instruction corresponds to the beginning of a virtualized function. If so, insert the prologue bytes before devirtualizing the instruction.
  • When devirtualizing X86CALLOUT instructions, look up the address of the target in the NOT_VIRTUALIZED set. 
    • If the target is in the set, then nothing special needs to be done to devirtualize the X86CALLOUT instruction; emit an x86 CALL instruction with a dummy displacement DWORD of 0x0, and a fixup to later replace the 0x0 value with the proper distance from the source to the target (similarly to how we devirtualized VM jump instructions).
    • If the target is not in the set, then we need to generate an x86 CALL instruction to the devirtualized address of the target's VM instruction key. Emit a dummy x86 CALL instruction as before. Also generate a fixup specifying the offset of the dummy 0x0 displacement DWORD in the x86 CALL instruction, and the target of the X86CALLOUT instruction.

After the instruction-by-instruction devirtualization process, we need to process the fixups generated for the two varieties of X86CALLOUT instructions mentioned above (i.e., based on whether the destination is virtualized or not).

Here is partial code from the second approach at devirtualization.

# This is the same devirtualization function from before, but
# modified to devirtualize X86CALLOUT instructions and insert
# function prologues where applicable.
# It has a new argument: "newImageBase", the location in the
# FinSpy-virtualized binary at which we emit our devirtualized
# code.
def RebuildX86(insns, newImageBase):

    # These are the same as before:
    mcArr  = [] # Machine code array into which we generate code
    locsDict = dict() # VM location -> x86 position dictionary
    keysDict = dict() # VM key -> x86 position dictionary
    locFixups = []    # List of fixup locations for jumps

    # New: fixup locations for calls to virtualized functions
    keyFixups = []

    # New: fixup locations for calls to non-virtualized functions
    binaryRelativeFixups = []

    # Same as before: iterate over all instructions
    for i in insns:

        # Same as before: memorize VM position/key to x86 mapping
        currLen = len(mcArr)
        locsDict[i.Pos] = currLen
        keysDict[i.Key] = currLen

        # New: length of prologue instructions inserted before
        # devirtualized FinSpy VM instruction. Only obtains a
        # non-zero value if this instruction corresponds to the
        # beginning of a virtualized function.   
        prologueLen = 0

        # New: is this VM instruction the beginning of a
        # virtualized function?
        if i.Key in KEY_TO_PROLOGUE_BYTES_DICT:

            # Get the prologue bytes that should be inserted
            # before this VM instruction.
            prologueBytes = KEY_TO_PROLOGUE_BYTES_DICT[i.Key]

            # Increase the length of the instruction.
            prologueLen += len(prologueBytes)

            # Copy the raw x86 machine code for the prologue
            # into the mcArr array before devirtualizing the
            # instruction.
            mcArr.extend(prologueBytes)

        # Now devirtualize the instruction. Handling of
        # "Raw X86", "Indirect Call", and jumps are identical
        # to before, so the code is not duplicated here.

        # Is this an "X86CALLOUT" ("Direct Call")?
        if isinstance(i,RawX86Callout):

            # New: emit 0xE8 (x86 CALL disp32)
            mcArr.append(0xE8)

            # Was the target a non-virtualized function?
            if i.X86Target in NOT_VIRTUALIZED:

                # Emit a fixup from to the raw target
                binaryRelativeFixups.append((i.Pos,prologueLen+1,i.X86Target))

            # Otherwise, the target was virtualized
            else:
                # Emit a fixup to the devirtualized function body
                # specified by the key of the destination
                keyFixups.append((i.Pos,prologueLen+1,i.X86Target))

            # Write the dummy destination DWORD in the x86 CALL
            # instruction that we just generated. This will be
            # fixed-up later.
            mcArr.extend([0x00, 0x00, 0x00, 0x00])

The Python code above generates additional fixup information for devirtualized X86CALLOUT instructions. The two cases of the destination being virtualized or not are handled similarly, though they are placed in two different lists ("keyFixups" for virtualized targets, and "binaryRelativeFixups" for non-virtualized targets). After the main devirtualization loop shown above, we must process the fixups just generated, the same way we did for the jump instructions. The process of applying the fixups is nearly identical to what we did for jump instructions, except that for virtualized targets, we need to determine the VM instruction key corresponding to the x86 address of the X86CALLOUT target. Here is the code for fixing up calls to virtualized functions:

# Fixups contain:
# * srcBegin: beginning of devirtualized CALL instruction
# * srcFixup: distance into devirtualized CALL instruction
#             where displacement DWORD is located
# * dst:      the X86CALLOUT target address
for srcBegin, srcFixup, dst in keyFixups:

    # Find the machine code address for the source
    mcSrc = locsDict[srcBegin]

    # Lookup the x86 address of the target in the information
    # we extracted for virtualized functions. Extract the key 
    # given the function's starting address.
    klDst = X86_VMENTRY_TO_KEY_DICT[dst]
    
    # Find the machine code address for the destination
    mcDst = keysDict[klDst]

    # Set the displacement DWORD within x86 CALL instruction
    StoreDword(mcArr, mcSrc+srcFixup, mcDst-(mcSrc+srcFixup+4))

Next, and more simply, here is the code for fixing up calls to non-virtualized functions:

# Same comments as above
for srcBegin, srcFixup, dst in binaryRelativeFixups:

    # Find the machine code address for the source
    mcSrc = locsDict[srcBegin]
    
    # Compute the distance between the end of the x86
    # CALL instruction (at the address at which it will
    # be stored when inserted back into the binary) and
    # the raw x86 address of the X86CALLOUT target
    fixup = dst-(newImageBase+mcSrc+srcFixup+4)
    
    # Set the displacement DWORD within x86 CALL instruction
    StoreDword(mcArr, mcSrc+srcFixup, fixup)

3. Inspecting the Devirtualization

Now we are in a similar place to where we were after our initial devirtualization attempt in Part #3, Phase #2; let's look at the devirtualized code in IDA and see if anything jumps out as being obviously incorrect.

IDA's navigation bar shows a few things.

take2NavBar.png
  • The first third of the binary -- in the transparently-colored regions -- contains data defined as arrays. IDA has not identified code in these regions.
  • The red-colored areas have properly been indentified as code, but don't have any incoming references, therefore they have not been defined as functions.

These two issues are related: if the regions currently marked as data are actually code, and if they make function calls to the code in red, then perhaps IDA can tell us that the red regions do have incoming references and should be defined as functions. I selected the entire text section, undefined it by pressing 'U', and then selected it again and pressed 'C' to turn it into code. The result was much more pleasing:

take2NavBar-2.png

Now the whole devirtualized blob is defined as code. There is still an obvious cluster of functions that don't have incoming references.

Next we list the remaining issues that appear when inspecting the new devirtualization.

3.1 Some Functions Don't Have Incoming References

As we just saw from the navigation bar, there is a cluster of functions with no incoming references. Furthermore, inspecting these functions shows that they all lack prologues, like we noticed originally for all functions in our first devirtualization. If we turn them into functions, IDA makes its objections well-known with its black-on-red text:

noReferenceNoPrologue.png

So apparently, our prologue extraction scripts have missed these functions. We'll have to figure out why.

3.2 Many Call Instructions Have Invalid Destinations

IDA's "Problems" window (View->Open Subviews->Problems) helpfully points us to another category of errors. Many function calls have unresolved addresses, which IDA highlights as black-on-red text.

nonResolvedCallTargets.png

These issues have an innocuous explanation. In this Phase #4, we made the decision to choose the address 0x500000 as the base address at which to install the devirtualized code within the original binary. The x86 CALL instructions targeting non-virtualized functions are thus computed relative to an address in that region of the binary. Since we are currently inspecting the .bin file on its own, its base address is 0x0, and not 0x500000 like it will be when we insert it the devirtualized code into IDA. The x86 CALL displacements are indeed nonsensical at the moment, but we'll double check on them after we've inserted the devirtualized code back into the binary.

3.3 One Call in Particular is Weird

All of the x86 CALL instructions described in the previous issue have displacements that begin with the nibbles 0xFFF....., indicating that the destination of those CALL instructions lies at an address located physically before the CALL instruction. However, one x86 CALL instruction at the beginning of a devirtualized function has a positive displacement, also colored black-on-red:

seg000:00000E70 sub_E70 proc near
seg000:00000E70     push    0B6Ch
seg000:00000E75     push    40B770h
seg000:00000E7A     call    near ptr 6D85h ; <- bogus destination

I looked at the corresponding function in the original binary from which this prologue had been copied, and the situation became clear.

.text:00404F77         push    0B6Ch
.text:00404F7C         push    offset stru_40B770
.text:00404F81         call    __SEH_prolog
.text:00404F86         xor     esi, esi
.text:00404F88         mov     [ebp-20h], esi
.text:00404F8B         push    edi        ; Save obfuscation register #1
.text:00404F8C         push    ebp        ; Save obfuscation register #1 
.text:00404F8D         mov     ebp, offset word_411A6E ; Junk obfuscation
.text:00404F92         shrd    edi, esi, 0Eh           ; Junk obfuscation

The prologue for the pre-virtualized function installed an exception handler by calling __SEH_prolog. Our prologue extraction script simply copied the raw bytes for the prologue. Since x86 CALL instructions are encoded relative to their source and destination addresses, we can't simply copy a CALL instruction somewhere else without updating the destination DWORD; if we don't, the destination will be incorrect.

Since this issue appeared only once, instead of re-architecting the prologue extraction functionality to deal with this special case, I decided to just manually byte-patch my devirtualized code once I've copied it into the original binary. If I wanted to write a more fully-automated FinSpy devirtualization tool, I would tackle this issue more judiciously.

3.4 What are These Function Pointers?

The second devirtualization contains many pointers that reference hard-coded addresses within the original FinSpy binary from which we extracted the VM bytecode. For example, the following example references a function pointer and an address in the .text section:

seg000:00004BB9     push    0
seg000:00004BBE     push    0
seg000:00004BC3     push    400h
seg000:00004BC8     push    0FFFFh
seg000:00004BCD     call    dword ptr ds:401088h
seg000:00004BD3     mov     eax, ds:41FF38h
seg000:00004BD8     push    0
seg000:00004BDD     call    dword ptr [eax+38h]

Since we are examining the devirtualized code in isolation from the original binary, IDA cannot currently provide us meaningful information about the addresses in question. We can check the addresses in the FinSpy sample IDB to see if they make any sense; for example, here's the address referenced by the CALL: 

.idata:00401088     ; LRESULT __stdcall SendMessageW(HWND hWnd, UINT Msg, WPARAM wParam, LPARAM lParam)
.idata:00401088     extrn SendMessageW:dword

Things look good; we see four arguments pushed in the code above, and the function pointer references a function with four arguments. Once we've inserted our devirtualization back into the original binary, IDA will resolve the references seamlessly, and allow us to make full use of its normal facilities for cross-referencing, naming, type inference, and parameter tracking.

I also noticed that some of the instructions made reference to items within the original binary's .text section. 

seg000:0000333E     mov     dword ptr [esi+4], 4055D5h
seg000:00003345     mov     dword ptr [esi], 40581Eh

; ... later ... 

seg000:000034C4     mov     dword ptr [esi+4], 40593Ch
seg000:000034CB     mov     dword ptr [esi], 4055FEh

; ... later ...

seg000:000034DC     mov     dword ptr [esi+4], 40593Ch
seg000:000034E3     mov     dword ptr [esi], 405972h

; ... more like the above ...

Looking at these addresses in the original binary, I found that they corresponded to virtualized functions in the .text section. For example, here are the contents of the first pointer from the snippet above -- 0x4055D5 -- within the original binary:

.text:004055D5     mov     edi, edi                ; Original prologue
.text:004055D7     push    ebp                     ; Original prologue
.text:004055D8     mov     ebp, esp                ; Original prologue
.text:004055DA     push    ebp                     ; Push obfuscation register #1
.text:004055DB     push    esi                     ; Push obfuscation register #2
.text:004055DC     mov     esi, offset word_41CCBA ; Junk obfuscation
.text:004055E1     mov     ebp, 7C9E085h           ; Junk obfuscation
.text:004055E6     bswap   ebp                     ; Junk obfuscation
.text:004055E8     pop     esi                     ; Pop obfuscation register #2
.text:004055E9     pop     ebp                     ; Pop obfuscation register #1
.text:004055EA     push    5A329Bh                 ; Push VM instruction entry key
.text:004055EF     push    ecx                     ; Obfuscated JMP
.text:004055F0     sub     ecx, ecx                ; Obfuscated JMP
.text:004055F2     pop     ecx                     ; Obfuscated JMP
.text:004055F3     jz      GLOBAL__Dispatcher      ; Enter FinSpy VM

And it turns out that the VM key pushed by this sequence, namely 0x5A329B, references one of the functions in the devirtualized binary which otherwise did not have a incoming reference. Great! We would like to extract the addresses of the pointed-to functions so that we can process them with the scripts we developed in Part #3, Phase #3 in order to extract their prologues. We'd also like to alter the raw x86 instructions that reference the function pointers to make them point to their devirtualized targets within the devirtualized blob instead.

4. Next Step: Function Pointers

At this point, only two issues remain. First, we noticed that some devirtualized functions still don't have prologues. The explanation for this behavior must be that the addresses of their virtualized function stubs must not have been passed to the scripts. If we had provided those virtualized functions' addresses, our scripts would have found something for their prologues, even if it was an incorrect prologue. Yet the scripts found nothing.

Secondly, we noticed that the devirtualized code contained function pointers referencing the addresses of the prologue-lacking functions from the previous paragraph. We would like to replace the raw function pointers within the x86 instructions with the addresses of the corresponding functions in the devirtualized code.

4.1 Extracting Function Pointers from the VM Bytecode Disassembly

It seems like the first step in resolving both issues is to locate the function pointers within the FinSpy VM. I took a look at the raw FinSpy VM instructions from the snippet above with the function pointer references.

Here's the first VM instruction:

0x030630: X86 mov dword ptr [esi+4h], 4055D5h

Here's the raw bytes that encode that VM instruction:

seg000:00030630 dd 5A5C54h ; <- VM instruction key
seg000:00030634 db  1Bh    ; <- Opcode: raw x86
seg000:00030635 db    7    ; <- Length of x86 instruction: 7
seg000:00030636 db    3    ; <- Fixup offset: #3
seg000:00030637 db    0    ; <- Unused
seg000:00030638 db 0C7h    ; <- x86: mov dword ptr [esi+4], 55D5h
seg000:00030639 db  46h    
seg000:0003063A db    4
seg000:0003063B db 0D5h    ; <- 3: offset of 0x55D5 in x86 instruction
seg000:0003063C db  55h
seg000:0003063D db    0
seg000:0003063E db    0

The important thing to notice is that the x86 machine code contained within this instruction disassembles to:

mov dword ptr [esi+4], 55D5h

Whereas the x86 instruction shown in the VM bytecode disassembly listing is, instead:

mov dword ptr [esi+4h], 4055D5h

The difference is in the DWORD value (55D5h in the raw VM bytecode versus 4055D5h in the VM bytecode disassembly). 

The reason for this difference lies in the line labeled "Fixup offset: #3". You may recall from part two that all FinSpy VM instructions have two byte fields at offsets +6 and +7 into the VM instruction structure that were named "RVAPosition1" and "RVAPosition2". To quote the description of those fields from part two:

"Some instructions specify locations within the x86 binary. Since the binary's base address may change when loaded as a module by the operating system, these locations may need to be recomputed to reflect the new base address. FinSpy VM side-steps this issue by specifying the locations within the binary as relative virtual addresses (RVAs), and then adding the base address to the RVA to obtain the actual virtual addresses within the executing module. If either of [RVAPosition1 or RVAPosition2] is not zero, the FinSpy VM treats it as an index into the instruction's data area at which 32-bit RVAs are stored, and fixes the RVA up by adding the base address to the RVA."

In a bit of unplanned, happy serendipity, when I was writing my FinSpy VM bytecode disassembler, I made a Python class called "GenericInsn" that served as the base class for all other Python representations of FinSpy VM instruction types. Its Init() method is called in the constructor for every VM instruction type. And in particular, Init() includes the following code:

if self.Op1Fixup:
    ApplyFixup(self.Remainder, self.Op1Fixup & 0x7F, self.Pos)
if self.Op2Fixup:
    ApplyFixup(self.Remainder, self.Op2Fixup & 0x7F, self.Pos)

Thus, we are in the fortunate position where FinSpy VM helpfully tags all pointers to items in the original binary by setting these RVAPosition1 and RVAPosition2 fields. And furthermore, our existing function "ApplyFixup" already receives all of these values when we disassemble a FinSpy VM bytecode program. Thus, all we need to do to extract the function pointers is to include some logic inside of ApplyFixup that detects when one of these embedded RVAs refers to a function pointer, and if it does, to store the virtual address of the function pointer into a global set. The logic I used to determine function pointers was simply checking whether the virtual address was between the beginning of the first function in the .text section, and the last address in the .text section.

To wit, I changed my implementation of ApplyFixup as follows:

# New: constants describing the first function in the
# .text section and the end of the .text section.
TEXT_FUNCTION_BEGIN = 0x401340
TEXT_FUNCTION_END = 0x41EFC6

# New: a global dictionary whose keys are fixed-up
# virtual addresses, and whose values are lists of 
# VM instruction positions whose bodies reference
# those virtual addresses.
from collections import defaultdict
ALL_FIXED_UP_DWORDS = defaultdict(list)

# Existing ApplyFixup function
def ApplyFixup(arr, FixupPos, InsnPos):
    # New: Python scoping statement
    global ALL_FIXED_UP_DWORDS
    
    # Existing ApplyFixup logic
    OriginalDword = ExtractDword(arr, FixupPos)
    FixedDword = OriginalDword + IMAGEBASE_FIXUP
    StoreDword(arr, FixupPos, FixedDword)
    
    # New: if the fixed-up DWORD is in the .text 
    # section, save it in ALL_FIXED_UP_DWORDS and
    # add the VM instruction position (InsnPos) to
    # the list of positions referencing that DWORD.
    if FixedDword >= TEXT_FUNCTION_BEGIN and FixedDword <= TEXT_FUNCTION_END:
        ALL_FIXED_UP_DWORDS[FixedDword].append(InsnPos)

4.2 Extracting Prologues from Virtualized Function Pointers

Next, I also wanted to treat these virtual addresses as though they were the beginnings of virtualized functions, so that my existing machinery for extracting function prologues would incorporate them. In Part #3, Phase #3, I had written a function called "ExtractCalloutTargets" that scanned the FinSpy VM instructions looking for X86CALLOUT instructions and extracted their target addresses. This was then passed to the function prologue extraction scripts to collect the data that was used in this Phase #4 to devirtualize X86CALLOUT instructions and insert the function prologues from virtualized functions into the devirtualization.

It seemed natural to modify ExtractCalloutTargets to incorporate the virtual addresses we collected in the previous subsection. To wit, I modified that function as such:

# Existing ExtractCalloutTargets function
def ExtractCalloutTargets(insns, vmEntrypoint):
    # New: Python scoping statement
    global ALL_FIXED_UP_DWORDS
    
    # New: initialize the set to the function pointer 
    # addresses collected in ApplyFixup
    calloutTargets = set(ALL_FIXED_UP_DWORDS.keys())
    
    # Existing: add vmEntrypoint to set
    calloutTargets.add(vmEntrypoint)
    
    # Existing: extract X86CALLOUT targets
    for i in insns:
        if isinstance(i,RawX86Callout):
            if i.X86Target not in NOT_VIRTUALIZED:
                calloutTargets.add(i.X86Target)
    
    # Existing: return list of targets
    return list(calloutTargets)

Now I ran the function prologue extraction scripts from Part #3, Phase #3 again, to re-generate the virtualized function prologue and VM instruction entry keys for the function pointers in addition to the existing data for the X86CALLOUT targets. I then pasted the output data back into the second devirtualization program we wrote in this Phase #4 (remembering to copy in the data I'd generated manually for those virtualized functions without junk obfuscation sequences), and ran the devirtualization again. This time, the unreferenced functions had proper prologues, though they were still unreferenced.

4.3 Fixing the Function Pointers in the Devirtualized x86 Code

The last remaining issue is that the devirtualized x86 instructions which reference the function pointers still use the addresses of the virtualized functions in the .text section, whereas we want to modify them to point to their devirtualized equivalents instead. This is implemented in the RebuildX86 devirtualization function after the machine code array for the devirtualized program has been fully generated.

Fortunately for us, we already know which VM instructions reference the function pointers -- we collected that information when we modified ApplyFixup() to locate and log virtualized function pointers. Not only did we log the virtual addresses of the purported function pointers, but we also logged a list of VM instruction positions referencing each such function pointer in the ALL_FIXED_UP_DWORDS dictionary.

4.3.1 A Slight Complication

A slight complication lead me to a solution that perhaps could have been more elegant. Namely, we collect the positions of the VM instructions referencing function pointers within ApplyFixup() at the time that we disassemble the VM bytecode program. However, the simplifications in Part #3, Phase #1 can potentially merge instructions together when collapsing patterns of VM instructions into smaller sequences. Therefore, it might be the case that the VM instruction positions that we have collected no longer refer to valid locations in the VM program after the simplifications have been applied. However, we'd still expect the function pointers to appear in the machine code for the VM instructions into which the removed instruction was merged.

To work around this issue, I made use of the locsDict dictionary that we generated through devirtualization. Namely, that dictionary recorded the offset within the devirtualized x86 blob of each VM instruction processed in the main devirtualization loop. We find the offset within the devirtualized x86 machine code array of the prior VM instruction with an entry within locsDict, and we find the devirtualized offset of the next VM instruction with an entry within locsDict. This gives us a range of bytes to search in the devirtualized machine code looking for the byte pattern corresponding to the function pointer for the virtualized function. Once found, we can replace the raw bytes with the address of the devirtualized function body for that virtualized function.

4.3.2 Locating the Function Pointers in the Devirtualized Blob

Here is the code for locating function pointers as just described; if it is still unclear, read the prose remaining in this subsection.

# dword: the virtual address of a virtualized function
# posList: the list of VM instruction positions 
# referencing the value of dword
for dword, posList in ALL_FIXED_UP_DWORDS.items():
    
    # For each position referencing dword:
    for pos in posList:
        
        # Set the low and high offset within the
        # devirtualized blob to None
        lowPos,highPos = None,None
        
        # posSearchLow is the backwards iterator
        posSearchLow = pos
        
        # Continue while we haven't located a prior
        # instruction with a devirtualization offset
        while not lowPos:
            
            # Does posSearchLow correspond to a 
            # devirtualized instruction? I.e., not
            # something eliminated by a pattern
            # substitution.
            if posSearchLow in locsDict:

                # Yes: get the position and quit
                lowPos = locsDict[posSearchLow]

            else:
                # No: move to the previous instruction
                posSearchLow -= INSN_DESC_SIZE

        # Now search for the next higher VM position
        # with a devirtualization offset
        posSearchHigh = pos+INSN_DESC_SIZE

        # Continue while we haven't located a later
        # instruction with a devirtualization offset
        while not highPos:

            # Does posSearchLow correspond to a 
            # devirtualized instruction? I.e., not
            # something eliminated by a pattern
            # substitution.
            if posSearchHigh in locsDict:
            
                # Yes: get the position and quit
                highPos = locsDict[posSearchHigh]
            else:
                # No: move to the next instruction
                posSearchHigh += INSN_DESC_SIZE

For each instruction position X that references one of the pointers to virtualized functions, I locate the last VM instruction at or before X in the locsDict array. This is implemented as a loop that tries to find X in locsDict. If locsDict[X] exists, we save that value -- the offset of the corresponding devirtualized instruction within the devirtualized blob. If locsDict[X] does not exist, then the VM instruction must have been removed by one of the pattern simplifications, so we move on to the prior VM instruction by subtracting the size of an instruction -- 0x18 -- from X. We repeat until we find an X that has been devirtualized; if X becomes 0x0, then we reach the beginning of the VM instructions, i.e., devirtualized offset 0x0.

We do much the same thing to find the next VM instruction with a corresponding devirtualized offset: add the size of a VM instruction -- 0x18 -- to X and look it up in locsDict. If it's not a member of locsDict, add 0x18 and try again. Once we find it, If X ever exceeds the last legal VM location, set the offset to the end of the machine code array.  Once we've found the next VM instruction's devirtualized position, we record it and stop searching.

4.3.3 Rewriting the Function Pointers

Immediately after the code just shown, we have now found a suitable range within the x86 machine code array that ought to contain the raw bytes corresponding to the virtual address of a virtualized function referenced via pointer. Next we simply byte-search this portion of the array looking for that address, and once found, replace the address with that of the corresponding devirtualized function body. There is nothing especially complicated about this; we simply consult the book-keeping metadata that we gathered through devirtualization to locate the devirtualized offset of the virtualized function pointer, add the offset of the new image base at which we are inserting the devirtualized blob within the binary, and store the DWORD value at the found position within the machine code array.

4.3.4 Done

After writing the code just described, now the pointers to virtualized functions have been modified to point to their devirtualized counterparts. Here again are two references to virtualized functions before the modifications just described:

seg000:00003555     mov     dword ptr [esi+4], 4055D5h
seg000:0000355C     mov     dword ptr [esi], 40581Eh

And, the same code after modification:

seg000:00003555     mov     dword ptr [esi+4], 50154Ah
seg000:0000355C     mov     dword ptr [esi], 5017B3h

5. Inserting the Devirtualized Code back Into the Binary

The last step before we can analyze the FinSpy dropper is to re-insert our devirtualized blob back into the binary. We have already chosen an address for it: 0x500000, which was important in generating the devirtualized code.

At this point I struggled to load the devirtualized code with with IDA's File->Load File->Additional binary file... and Edit->Segments->Create Segment menu selections. Although both of these methods allowed me to load the raw devirtualized machine code into the database, I experienced weird issues with both methods. Namely, the cross-section data references and/or function cross-references were broken. IDA might display the correct addresses for data items, and allow you to follow cross-references by pressing "enter" over an address, but it would not show symbolic names or add cross-references. For example we might see something like this:

.data:00504C3C call    dword ptr ds:401088h

Rather than what we see when things are working properly:

.data:00504C3C call    ds:SendMessageW

I tried screwing with every option in the two dialogs mentioned, especially the segment attributes ("CODE" instead of "DATA"). For some attempts, the code references worked properly but the data references didn't; and for other attempts, the opposite was true. More often neither would work. Igor Skochinsky from Hex-Rays has always been very helpful to me, but this time he was away from his keyboard and did not hear my cries of anguish until it was too late. (Thanks anyway, Igor.)

That being the case, although it wasn't my first choice, I ended up enlarging the .data section via Edit->Segments->Edit Segment and then loading the binary contents with a one-liner in IDC:

loadfile(fopen("mc-take2.bin", "rb"), 0, 0x500000, 26840);

And this time, everything worked. You can see the IDB with the devirtualized code here.

From there I reverse engineered the devirtualized FinSpy dropper program. Whereas I was not impressed with the FinSpy VM, the dropper was more sophisticated than I was expecting. You can see a mostly-complete analysis in the linked IDB; start reading from address 0x50207E, the address of WinMain() within the devirtualized code. I've tried to comment most of the assembly language, but a lot of the action is inside of Hex-Rays (i.e., look at those functions inside of Hex-Rays to see a lot of comments that aren't in the assembly language view).

6. Conclusion 

FinSpy VM is weak. It's closer in difficulty to a crackme than to a commercial-grade VM. (I suppose it is slightly more polished than your average VM crackme.) Having a "Raw x86" instruction in the FinSpy VM instruction set, where those instructions make up about 50% of the VM bytecode program, makes devirtualization trivial. 

I almost didn't publish this series because I personally didn't find anything interesting about the FinSpy VM. But, hopefully, through all the tedium I've managed to capture the trials and tribulations of writing a deobfuscator from scratch. If you're still reading at this point, hopefully you found something interesting about it, and hopefully this slog wasn't all for nothing. Hopefully next time you come across a FinSpy-protected sample, this series will help you make short work of it.