Originally published on January 21st, 2008 on OpenRCE

Here's an old and simple trick that I use extensively while recovering large structures in C++ reversing. Briefly, the challenges in structure recovery are to determine:

The size of the structure;
The inheritance hierarchy that relates this structure to others;
The location of the members within the structure;
The data types of the members;
All of the locations in the code where the structure is used;
The holistic picture: the overall purpose of the structure and the contributions of each data member to that.

This entry is concerned with point #5. Let's assume that we know (#3) where a particular member within a structure is situated. In order to figure out (#4) its data type and (#6) its purpose, we should inspect (#5) the locations at which this data member is used. We might get lucky; maybe we'll find something like this:

mov eax, [esi+Structure.field_XYZ]
push eax
push offset fmtstr ; "%s:Loading into memory for emulation\n"
call LoggingFunction
add esp, 8

From this we can infer both the data type (char *) and functionality (it's a pointer to the name of the file that is about to be emulated), and draw a conclusion about the overall structure (that it's probably related to emulation).  Perhaps we won't get as lucky as this scenario, but maybe a more subtle clue is revealed by one of the references.  So, how do we find other locations at which this structure member is being used?

The obvious answer would be to text-search for the phrase "[reg32+0XYZh]", but this method has a few drawbacks:

A)  It's slow;
B)  It relies upon the disassembler properly distinguishing code from data, which is in fact impossible to solve generally due to equivalence with the halting problem (a result of indirect addressing, which is the bread and butter of C++'s implementation of polymorphism via function pointers);
C)  Finding the string above just tells us that field_XYZ in *some* structure is being used, not necessarily our particular structure of interest.

Point C is critical and bears closer inspection; how can we be sure that the results we are finding actually refer to the structure that interests us?  Let's examine the situation for some specific values of XYZ:

Q:  How many structures contain a member defined at XYZ = +0?
A:  All of them.  Therefore if we were to search for [reg32], we could make no guarantees about which structure is actually being used (or even that a structure is being used, period).

Q:  How many structures contain a member defined at XYZ = +4?
A:  Most of them.  The same comment from above applies.

Q:  How many structures contain a member defined at XYZ = +40?
A:  Few of them.  In my experience a program generally contains proportionally very many structures that have size 0x40 or less, and proportionally very few structures larger than that.

Q:  How many structures contain a member defined at XYZ = +X, where X >= 0x80?  X >= 0x100?  X >= 0x1000?  X >= 0x10000?  X >= 0x80 and X is not dword-boundary-aligned?  X >= 0x80 and X is not a multiple of a high power of two?
A:  The larger the structure, the better the chance that the location of its data members are unique, or if not unique, then at least that the structures were derived from a common base class.

The first lesson is that the higher the offset within the structure, the fewer structures are going to have data members defined at that offset, which means that offset searching begins to become feasible for these high-offset data members.  Point C from above is addressed.

On point B, let's briefly look at some characteristics of instruction encoding on x86. Below are some typical structure references:

8B 16 mov edx, [esi] ; notice that the +0 is not present in the encoding
66 89 42 36 mov [edx+36h], ax ; notice that the +36h is present as a byte in the encoding
8B 8E AC 5F 00 00 mov ecx, [esi+5FACh] ; notice that the +5FACh is present as a dword in the encoding

Displacements off of a register that fit into a single signed byte, e.g. [esi-80h] ... [esi+7Fh], are represented with a single signed byte in the instruction's encoding, e.g. the 36h from the above. But searching for a byte is no good; a single byte could appear in any context. Displacements off of a register that are outside of this small window, e.g. [esi+80h], are represented with a dword in the instruction's encoding, e.g. the AC 5F 00 00 from the above. Therefore, any time one of these high-offset structure members is accessed directly, we're going to see an entire dword in the instruction stream that corresponds to the offset of the structure member. Searching for an entire dword gives much more precise results than searching for a byte.

Now all of the machinery is in place for the real point of this entry. Suppose we can't figure out field_5FAC's data type or functionality, and we would like to see other references to that member to see if they provide any clues. We could text search for the regular expression [.*+5FACh], and we would be reasonably sure that we were finding references to our structure of interest, or at least structures from the same family, but it would be slow, and would only find references that were defined as code.

This is where IDA's "binary search" feature, alt-B or Search->Sequence of bytes..., comes in handy. Enter AC 5F 00 00 into the window. IDA instantaneously brings up a window with sixty-nine lines of code, all of which have the form "mov reg32, [reg32+5FACh]" or "mov [reg32+5FACh], reg32". There is one additional result of the form "db 0ACh", which, when the surrounding bytes are turned into code, is revealed to be a structure reference of the aforementioned variety. None of the results are false positives.

The point of this blog entry was to say that, the larger a structure becomes, the more "unique" the addresses of the members within the structure become, and due to the instruction encoding on x86, we can find all direct references to the high-offset structure members quickly, easily, and with few to no false positives using IDA's binary search feature.