Threat Analysis Unit

fn_fuzzy: Fast Multiple Binary Diffing Triage with IDA

Summary

This week at HITBSecConf, Takahiro Haruyama, a Senior Threat Researcher for the CB Threat Analysis Unit (TAU), presented his work on fn_fuzzy, a tool which aims to help researchers and reverse engineers triage samples quicker. This blog post details the motivation for (and current standing of) the tool. The hope is that Takahiro’s work can help further advance the security community.

Takahiro can be reached on Twitter at @cci_forensics  

Details

Motivation

IDA Pro has a long history of being the de facto disassembler for malware reverse engineers. The program saves their findings, like function names and notes, into a corresponding database file (IDB). When analyzing new malware variants, the findings can be imported by comparing previously analyzed IDBs allowing analysts to focus on new functions that have not been previously analyzed.

However with multiple IDBs, the task of importing the databases isn’t as straightforward or easy. Experienced reverse engineers often have hundreds if not thousands of IDBs and typically don’t remember the code that they analyzed a few years ago. It is because of this that a tool to identify the most similar and analyzed IDBs quickly is needed.

JPCERT’s tool impfuzzy for Neo4j is handy for such a quick malware identification in large sample sets. The tool visualizes results of malware clustering based on impfuzzy values to determine which malware family a target sample belongs to. However the capability is limited to Windows PE executables and does not determine which sample IDB is the most analyzed. That’s why TAU newly created fn_fuzzy for performing a function-level binary diffing for large IDBs.

Basic Concept

fn_fuzzy calculates two kinds of fuzzy hashes for each function located in the sample’s IDB.

  • Ssdeep hash value of code bytes

Relocation (fixup) bytes, direct memory reference data and other ignorable variable code are excluded in the calculation.

  • Machoc hash value of call flow graph

Machoc value is used to correct the result by ssdeep hash when the function code bytes are small or generated polymorphically.

All hashes are then saved into one database file which is later used for comparison. This also allows us to import function names and prototypes from numerous IDBs to the target at one time.

How to Use

fn_fuzzy requires two python packages: mmh3 and python-idb.  Additionally fn_fuzzy currently only supports IDBs generated by IDA 6.9 or later because it requires the ida_nalt.retrieve_input_file_sha256() API. You can check the version of IDA that first generated the IDB by running the following Python command: ida_netnode.cvar.root_node.supstr(ida_nalt.RIDX_IDA_VERSION)

(don’t use get_inf_attr(INF_VERSION)).

Exporting hash values

The first thing we do is export hashes from IDBs. fn_fuzzy provides two methods for doing this:

  • Run fn_fuzzy.py on IDA to export an IDB containing function hashes
  • Run cli_export.py via the Windows Command Prompt to export multiple IDBs at once

Export by fn_fuzzy.py

When we execute fn_fuzzy.py on IDA, the following dialog will pop up. The dialog displays various execution options. The options are separated into four sections: General Options, Commands, Export Options and Compare Options.

Figure 1: fn_fuzzy.py execution option dialog

General Options

DB file path

Specify a SQLite database path for exported results.

minimum function code size

The hash calculation will be applied to only functions whose extracted code byte size is the value or more (default is 0x10).

exclude library/thunk functions

This is an exclusion option for library/thunk functions based on IDA FLIRT signatures.

Commands

Choose “Export” when exporting hashes.

Export Options

update the DB records

This option overwrites the database records corresponding to the sample.

store flags as analyzed functions

analyzed function name prefix/suffix (regex)

If this checkbox is enabled, fn_fuzzy makes the assumption that renamed functions with the specific prefix/suffix string are already analyzed then stores a flag into each function record. The flag will be utilized in “Compare” command to limit the comparison to only analyzed functions.

Compare Options

These options will be explained later in this post.

Press Run on the dialog then the function hashes are calculated then exported. If the DB doesn’t exist, it will be created.

Figure 2: Output window after Export command

Export by cli_export.py

Practically, there may be more opportunities to export hashes from multiple IDBs. Cli_export.py is a wrapper script calling fn_fuzzy.pyExport” command.

Figure 3: command-line options of cli_export.py

The options are similar to “Export” command of fn_fuzzy.py, however the script accepts recursive exports of IDBs in a specified folder (the -r option).

Figure 4: multiple IDB functions export by cli_export.py

Comparing hash values

After exporting function hash values of previously-analyzed IDBs, we are able to compare an unknown sample with the DB by running fn_fuzzy.py on IDA. The “Compare” command provides several options for performance and similarity threshold. The performance-related options are below:

compare with only analyzed functions

The option limits a comparison to the analyzed functions according to a flag information in the DB stored on “Export” command execution.

compare with only IDBs in the specified folder

the folder path

This option restricts a comparison to the functions of the IDBs in the specified folder path.

function code size comparison criteria (0-100)

Each function hash comparison will target only functions with similar sizes. The default value is 40, which means a comparison with only 60%-140% sized functions.

Once the “Compare” command is finished, the summary tab of the comparison result will be added to the IDA GUI. The tab contains a list of the SHA256 hash value of a binary file and its IDB path with the similar functions.

Figure 5: fn_fuzzy summary tab

In order to check the details per IDB, double-click one of the lines. Then an additional tab will be displayed.

Figure 6: similarity details tab for a specific IDB

The primary function shows the functions in the opened IDB and the secondary analyzed function means the analyzed functions with the highest ssdeep score to the primary ones in the double-clicked IDB.

fn_fuzzy detects similar functions matching with one of the following conditions:

  • function similarity score threshold (0-100) without CFG match

Functions whose ssdeep scores are more than the threshold will be detected even if the CFG (Machoc) hashes are not matched. Default is 50.

  • function similarity score threshold (0-100) with CFG match

Functions whose ssdeep scores are more than the threshold will be detected if the CFG hashes are matched. Default is 10.

  • function code size threshold evaluated by only CFG match

Functions whose code sizes are more than the threshold will be detected if the CFG hashes are matched. Default is 0x100 (256) bytes.

In the example below,  all conditions are included.

Figure 7: example of detected similar functions

By right-clicking on a similarity details tab and selecting Import function name and prototype, we are able to import the information to primary functions.

Figure 8: importing function names and prototypes

If structure types in a function prototype to import are not defined in the opened IDB, fn_fuzzy will ask to import type information from the source (secondary) IDB after a warning dialog.

Figure 9: dialog asking to import type information

After importing, the primary function names will be updated automatically.

Figure 10: imported function names

And function prototypes will be imported as well. It’s useful to execute IDA AppCall.

Figure 11: imported function prototype

Wrap-up

In this article, TAU described how to use fn_fuzzy, a fast and lightweight function-level binary diffing tool. fn_fuzzy focuses on a fast triage for large sets of IDBs and depends on just two kinds of fuzzy hash values (ssdeep and Machoc) to detect similarities, thus the accuracy is not as good as BinDiff and Diaphora. TAU recommends to utilize fn_fuzzy to narrow down the most similar and analyzed IDBs before the one-on-one diffing.