Automation Techniques in C++ Reverse Engineering

Here are the slides for my recent presentation at RECON, entitled "Automation Techniques in C++ Reverse Engineering". As usual, my slides use in-frame animations, so in brief, navigate them using contiguous mode (one slide at a time, advance/retreat with left/right) instead of continuous mode (where parts of multiple slides are visible at the same time, scrolling with up and down). A video of the talk was recorded, but RECON has not yet released it; I will update this post when the video is available, and likely tweet about it also.

(By the way, it’s become obvious in conversation that seemingly nobody reads the slides for my conference presentations, whereas more people do actually read my blog posts. Somebody suggested that this might be because conference slides in computer security are frequently thin in content and difficult to follow without an accompanying video, so people don't bother reading slides at all. I agree and I often don't read slides myself for the same reason. However, this is unfortunate, as I put a lot of effort into creating diagrams and animations for my presentations; I try to make them self-contained publications for which a video and/or accompanying white paper is not required. So if you were inclined not to read the slides on the expectation they'd be hard to follow standalone, I'd suggest giving them a look.)

In summary, I noticed that when I reverse engineer C++ programs, I spend almost all of my time recreating structures, applying structure offsets in the disassembly listing, and applying structure types to Hex-Rays variables and function arguments. I focused on automating these activities. I devised two techniques based on DLL injection for collecting structure-related data from programs at runtime. The first, inspired by academic work such as Howard, collects information about accesses to allocated memory. The other technique determines when dynamically-allocated objects are passed as arguments to functions. To process the data generated by these tools, I wrote an IDA/Hex-Rays plugin to automatically reconstruct structures from the data, automatically apply structure offsets, and automatically set types for function arguments and Hex-Rays variables. There are a number of interactive features that attempt to make reverse engineer's lives easier when dealing with structure recovery.

About the code, the short version is that it isn't ready for release yet. I did put a few days worth of effort into commenting it and cleaning it up, but it needs more than that. Furthermore, I confess that I quickly found myself distracted by new research directions and implementing new features, which I found a lot more interesting than, for example, thwarting ridicule by writing proper text parsers for the input formats rather than calling eval() on untrusted input. I want people to actually be able to use the tool when it's released, whereas in its present state, I'm probably the only person who could use it. Since you only get one chance at a first impression, I think it's more important to "release something useful" than to merely "release something for the sake of releasing something", and so the release will be come at a later date, when it's ready.

I have added some new features since presenting at RECON (which, likely, won't make sense unless you have read the presentation). First, I created an analysis that uses the Hex-Rays decompiler machinery to locate additional structure accesses that were not observed at runtime. Beyond that, most of what I've been working on lately is whole-program-scale recovery of every observed structure. The presentation touched on a few reasons why this is difficult (nested structures being one of them). The problem is more difficult than simply recovering a unique structure with unique substructures for every allocation site. In brief, to exploit the data to its fullest extent, one needs to determine one single type for every access location in the program -- even if it that location is observed accessing different allocations, or different locations within the same allocation. This necessitates discovering when two allocation sites allocate identical structures, determining identical substructure types contained within different structures, and determining nesting/inheritance relationships -- all the while battling with the noise in the data (as described in the presentation), having to explicate information only present implicitly, and contending with a very large state space that daunts direct computation of the optimal solution. Progress is good and tantalizing, but it isn't finished yet.