Nowadays, machine learning is routinely used in the detection of network attacks and the identification of malicious programs. In most ML-based approaches, each analysis sample (such as an executable program, an office document, or a network request) is analyzed and a number of features are extracted. For example, in the case of a binary program, one might extract the names of the library functions being invoked, the length of the sections of the executable, and so forth.
Then, a machine learning algorithm is given as input a set of known benign and known malicious samples (called the ground truth). The algorithm creates a model that, based on the values of the features contained in the samples, is the ground truth dataset, and the model is then able to classify known samples correctly. If the dataset from which the algorithm has learned is representative of the real-world domain, and if the features are relevant for discriminating between benign and malicious programs, chances are that the learned model will generalize and allow for the detection of previously unseen malicious samples.
The Role of Feature Engineering
Even though the description above is an oversimplification of the actual process, the key point is that some feature engineering is necessary. Namely, (human, expensive) data scientists have to decide which features need to be extracted from each sample, and this decision is guided by their domain knowledge or, more prosaically, by a gut feeling of what features are really useful for detection. However, what if they don’t get it right? For example, in a recent experiment some security experts were able to evade the detection system of a security product by embedding strings associated with a benign video game.
A New Approach
In our research, I collaborated with the University of California Santa Barbara to explore a novel approach to malware detection that does not require feature engineering. The approach relies on an information-rich representation of programs: that is, the report produced by sandboxing technology. These reports detail the actions performed by a program when executed in a controlled environment (called, aptly, “the sandbox”).
Not all sandboxes are created equal, but they share a common feature: instead of focusing on the static aspects of a program (that is, its code, or the way in which its data is packaged), sandboxes focus on the dynamic aspects of the program’s actual execution (for example, which files were accessed, which processes were created, which network connections were established). In the end, these reports can be seen as lengthy, detailed documents about the actions performed by programs.
Our approach, called Neurlux, uses these documents as input, and applies deep learning techniques to create a classifier that is able to discriminate between malicious programs and benign ones. More precisely, Neurlux treats these reports as it would treat any other document: as a series of words. These words are transformed into vectors in a process called embedding. Finally, the vectors are given as input to a neural network that combines several techniques: technically speaking, a Convolutional Neural Network (CNN), a Bidirectional Long Short-Term Memory Network (BiLSTM), and an Attention Network.
In order to understand if this approach is indeed effective at detecting malware, we compared Neurlux to a state-of-the-art approach that relies on feature engineering (see the technical paper for the details). Neurlux showed an accuracy of 96.8% compared to the state-of-the-art, which had an accuracy of 89.2%.
These results show that an approach that does not rely on feature engineering can be extremely effective in real-world settings.
In addition, the fact that no human was involved in determining which specific features had to be extracted and encoded allows the approach to be “future-proof”: if suddenly a new aspect of the execution becomes relevant to the detection of malicious programs, the system does not need to be modified, but simply re-trained.
Given that these systems are operating in continuous training mode to address the ever-changing threat landscape, a feature-less approach provides greater effectiveness without requiring human experts to continually tweak the system.
And we know how overwhelmed our data analysts already are.
Find all the details in the following paper: