IDA-Python - Locate a function independently from its offset

Introduction

Analyzing samples statically could be not an easy going taks, especially when you deal with heavily obfuscated or encrypted data. Because of that, it’s useful to find out the decryption/deobfuscation routine and write a little script that makes sense of strings or function names before starting to analyze the whole sample.

The idea behind this technique was inspired by the infinite amount of tweaking I have made to run multiple Ida-python scripts that were heavily machine-setting dependents. The idea is very simple and it’s based on pattern matching as for YARA rules. The scope here is to create a kind of signature for a function that we would like to locate (e.g., deobfuscation routine) and apply its execution to its parameters in order to simplify our analysis without tweaking specific parameters making it independent from the function offset. I have already got good results experimenting with malware samples (e.g, Qbot) and custom binaries from CTFs as well as other samples that used some sort of obfuscation. Because of that, I decided to create this post to share this approach that could be useful for writing better scripts.

Generally speaking, approaching to these sample could be done through multiple perspective (e.g., starting with dynamic analysis and proceeding backwards), however, for the purpose of this blogpost I’m going to take a custom sample written by me, where strings are partially obfuscated with custom algorithm and I’m going to show an easy but effective method to write an ida-python script to locate and deobfuscate variables.

Practical example

First go first. In order to deobfuscate a piece of code, it’s still required to analyze the function that is going to perform this task and replicate its behavior implementing it within our script. For the purpose of this post, this step is out of scope, however, I’m going to directly analyze the assembly code in order to find out how to write the IDA-python script to collect all obfuscated variables to apply the proper routine.

The steps that we are going are pretty much the following:

Find out the deobfuscation function;
Create a “signature” to retrieve all its references;
Collect the obfuscated parameters;
Apply the deobfuscation routine;

Creating function signature

Since that sample it’s actually very easy, finding the proper function it’s not so tricky. In fact, we could see that there is a DWORD that is going to be passed to a function called “DeobfuscateString”. However, what really matters here, is to create a signature of this function in order to locate it in multiple samples, independently from the offset that is going to be loaded.

Figure 1 - Main function overview

To do so, it’s possible to open the text view and explore the function to collect a few bytes that could be useful for this signature-like phase. I have chosen to get something related to the strcspn function that is part of the deobfuscation routine. More precisely, I have chosen to gather 16 bytes (8D 55 E0 52 8D 45 E4 50 FF 15 CC 20 40 00).

Figure 2 - Deobfuscation routine

Now we could try to find out if our collected bytes are enough to identify this function. Let’s write a little script that checks this out.

Searching for the function characteristics

The first thing to do is to gather the .text segment and then, start to look for our signature. One of the easiest ways to get the segment requested is to iterate over all segments until the idc.get_segm_name contains the “.text” name:

def get_text_segment(seg_name=".text"):
for s in idautils.Segments():
  start = idc.get_segm_start(s)
  end = idc.get_segm_end(s)
  name = idc.get_segm_name(s)
  if name == seg_name:
    return start, end

Then, we should write a function that is capable to locate our signature:

def patternSearch(address_start, address_end):
    pattern = idaapi.compiled_binpat_vec_t()
    res = idaapi.parse_binpat_str(
        pattern,
        address_start,
        "8D 55 E0 52 8D 45 E4 50 FF 15 CC 20 40 00",
        16,
        0)
    if res is None:
        print("[!] Pattern not Found")
        return None
    else:
        ea = idaapi.bin_search(address_start, address_end, pattern, idaapi.BIN_SEARCH_CASE)
        print("[+] Pattern found at: {0}".format(hex(ea)))
        return ea

At first glance this code could seem a mess, but it’s easier than expected. Let’s do a deep dive. The parameters address_start | address_end are related to the text segment, since we are looking for the signature within the whole segment space. To understand the parameters passed to idaapi.parse_binpat_str, we could refer to its documentation. However, for interruption its signature is given here:

parse_binpat_str(out, ea, _in, radix, strlits_encoding=0) -> str (the odd variable could be represented by the out parameter. However it represents a vector of compiled binary patterns, that will be later used with bin_search()). Running those functions together it’s possible to observe few references related to the deobfuscation string used within the .text segment, then if we look for those references in IDA we could see that with those few bytes we were able to get all we needed to move forward to collect all the obfuscated parameters.

Figure 3 - Code reference of deobfuscation routine

Collecting parameters

Now we have tested that our script successfully identifies the function within the .text segment, however, before proceeding collecting its parameters, it is important to also get information related to its usage. To do so, we could use the returning value from the bin_search function and use them as a parameter for idaapi.get_func() that will return the addresses where our deobfuscation routine is actually used. As a final step to locate all its references it’s possible to use the function idautils.XrefsTo as follow:

def locate_decrypt_function(function_address):
    xref_list = []
    for xref in idautils.XrefsTo(function_address):
        if xref.frm not in xref_list:
            xref_list.append(xref.frm)
    return list(xref_list)

Through this method I’ve just created a list that is going to contain all function references. From that list, it is possible to iterate all over the function call and retrieve the obfuscated string.

Figure 4 - finding function code and its usage

Now, in order to find out the parameters, it’s necessary to observe the pattern used by this sample to understand how to collect the obfuscated string. Observing all the occurrences retrieved, it’s possible to create a function that retrieves all parameters named “offset” with mnemonic push (opcode 68).

def gather_string_offset(xref_address):
    function_start = idc.get_func_attr(xref_address, FUNCATTR_START)
    new_address = xref_address
    while True:
        prev_address = idc.prev_head(new_address)
        if prev_address <= function_start:
            break
        if idc.print_insn_mnem(prev_address) == "push":
            if idc.print_operand(prev_address,0).split()[0] == "offset": 
                return idc.get_operand_value(prev_address, 0) 
        new_address = prev_address
    print("[!] Offset not found.")
    return None

The function presented here, it’s pretty straightforward. It takes the fucntion_start parameters that represents the function block that contains the instruction call DeobfuscateString (since the code it’s all written within the main function, fucntion_start will be the first instruction in the main. However, in other example our DeobfuscateString function could be part of another subroutine and in that case, the function_start will point to the first instruction of that routine ) Then, from the function the instruction call DeobfuscateString, we start to move backward in order to find out the offset parameter that will point to our obfuscated string. It’s worth mentioning that the code prev_address <= function_start it’s used to avoid going over the main function boundaries (it doest make any sense to go backward over that line, since our code should be located there).

Now that we have collected all string references related to each call, we are still required to get the whole obfuscated string. In fact, through our code we were able to get the first byte of each string but it’s required to get the whole reference to forward the string to the deobfuscation routine:

def retrive_string(offset):
    obfuscated_string = ""
    counter = 0
    while chr(idaapi.get_byte(offset + counter)) != "\x00":
        obfuscated_string += chr(idaapi.get_byte(offset + counter))
        counter +=1
    return obfuscated_string

The script here, takes the first string byte and it starts to move forward until a NULL byte it’s located (In this case, the null byte represents the end of our string).

Deobfuscation function

After retrieving the string, it’s a matter of implementing the deobfuscation routine and applying it to the collected strings. Implementation details are left as an exercise for curious reversers that would like to make it on their own. Since implementation of this specific case it’s actually pointless for a real case scenario, it will be skipped from this article and will be directly implemented within the final IDA-python script.

Script Testing

To run our script it’s possible to go in File > Script file, selecting the proper ida_python script. Testing the script in IDA, it’s possible to observe that all obfuscated strings have been correctly deobfuscated and a little comment has been added to the end of each string reference.

Figure 5 - Variable deobfuscated successfully

Conclusion

This post has been written to illustrate an effective and quite useful technique to locate interesting functions within a binary and to write a custom IDA-python script that is capable of finding all function occurrences. This approach it’s actually very helpful also to avoid configuration dependencies, giving to our script the ability to be shared without tweaking specific parameters that could rely on IDA instances.

References

Ida-python script: fSearch.py
source code: obfuscation.c
compiled sample: obfuscation.exe