Everywhere I look, someone’s talking about machine learning (ML) or artificial intelligence (AI). These two technologies are shaping important conversations in multiple sectors, especially marketing and sales, and are at risk of becoming overused and misunderstood buzzwords, if they haven’t already. The technologies have also drawn the attention of security professionals over the past few years, with some believing that AI is ready to transform information security.
Despite this hype, there’s still a lot of confusion around AI and ML and their utility for information security. In this blog post, I would like to correct some misperceptions. Let’s start by differentiating machine learning from artificial intelligence in general.
Machine Learning vs. Artificial Intelligence
Artificial intelligence is the science of trying to replicate intelligent, human-like behavior. There are multiple ways of achieving this — machine learning is one of them. For example, a type of AI system that does not involve machine learning is an expert system, in which the skills and decision process of an expert are captured through a series of rules and heuristics.
Machine Learning is a specific type of AI. An ML system analyzes a large data set in order to categorize the data and create rules about what datum belongs in what category. To take one example, machine learning can be used to analyze network behavior data and categorize it as normal or anomalous.
Given these definitions, all ML systems are also AI systems. However, not all AI systems use machine learning. It’s similar to saying that while humans are mammals, not all mammals are humans. However, the current trend is that few of the other AI techniques are being used. If we find ourselves in a situation in which the only AI systems are those that use ML, then the two terms would be synonymous. Just as if there were no mammals other than humans, then saying human and mammal would be synonymous.
There are two main branches of machine learning: supervised and unsupervised. Supervised ML involves mapping input variables to output variables in order to make accurate predictions about the data being analyzed. In terms of threat detection, an ML algorithm could use known suspicious behaviors and a “malicious” category assignment as the ground truth for developing a threat classifier. It can then use that classifier to analyze new samples.
In unsupervised ML, the second branch of machine learning, a system tries to cluster groups of data together, based on the data’s features. In this case, the result is the identification of groups of similar elements, which allows an analyst, for example, to handle a large number of similar samples based on a single decision (e.g., all these emails have similar attachments which are all malicious).
There’s also deep learning, a specific type of machine learning that uses neural networks instead of statistical analysis for analyzing data. Deep learning is particularly good at finding classifications in large amounts of data. But deep learning is disadvantaged by its reduced explanatory power as to why something belongs in a particular grouping, such as why an executable is dangerous.
Challenges of ML and AI in Information Security
Machine learning faces a unique challenge in information security: in the effort to take data sets that are representative of malicious behavior and extract knowledge, algorithms must grapple with data that’s attempting to fight back. This is known as adversarial learning, which is data that’s deliberately trying to avoid being classified, especially when it’s something malicious that’s attempting not to be seen as such. Malware authors learn what algorithms are looking for and tweak their samples or try to re-educate the model until the wrong classification is given so that attackers can then avoid detection and infect more users. In so doing, bad actors use what algorithms have learned against security professionals and subsequently users.
To account for this adversarial setting, security professionals need to develop machine learning techniques that look for outliers and false flags. They must be extra cautious about the process they use to source and characterize data. Otherwise, the results could be terrible.
Take the packing of an executable, for example. Lots of malware uses packing as a way to look different and avoid detection by antivirus software, while benign code seldom uses packing (for example in cases in which the authors want to protect their intellectual properties, as happens in video games). If you apply machine learning to programs without first performing unpacking, the algorithm will learn that packing is bad and flag everything that’s packed as malicious, leading to a large number of false positives.
Such a development highlights the reality that AI and ML aren’t silver bullets. There are a lot of unrealistic expectations that AI and ML can do anything. But that’s not the case. As illustrated above, these technologies can’t automatically detect outliers and false positives without some form of human input, guidance, decisions, or intervention.
Even more importantly, there’s an ongoing tension between “precision” and “recall” for machine learning and artificial intelligence in information security. Recall, as it relates to information security, is the ability to identify all possible malicious programs, whereas precision is the aim to single out only the dangerous samples. Usually, a precise algorithm ends up letting a lot of malware through because of the programmed desire not to make too many mistakes. The alternative, which is high recall with low precision, will generate lots of false positives in an attempt to protect against all threats. These problems are characteristic of the imprecise, statistical means of analysis found in machine learning algorithms. There will always be these types of errors. It’s an unsolvable dilemma.
Other limitations exist for AI and ML in information security. Overall, it’s impossible to encapsulate all the understanding of a human malware analyst and distill it into an AI system. There are just too many variables in the way. At the same time, the world is always changing, so machine learning algorithms need constant re-training and re-learning in order to stay current with the latest threat developments, trends, and capabilities.
How to Address ML and AI Security Challenges
The adversarial setting of AI and ML in information security, not to mention the technological limitations of security-related algorithms discussed above, reveals that artificial intelligence and machine learning aren’t enough to keep organizations safe. To train these technologies, security professionals need to supply them with hundreds of thousands of known samples that non-AI tools like signature-based detection technologies and heuristics utilities have deemed malicious — and continue doing so to keep the models up to date. It therefore makes sense to partner AI and ML with these other methodologies.
We couldn’t agree more, which is why NSX Network Detection and Response uses a combination of technologies to detect threats and network breaches. In addition to machine learning (as our preferred artificial intelligence technology), we draw upon the input of anomaly detection and expert systems to analyze millions of samples a day. Through this synthesis of information, NSX Network Detection and Response can provide a user with a complete picture of a breach that’s not distorted by false positives. More than that, our technology can tell them how severe each incident is by bringing seemingly disparate events together for greater context about an attack when it occurs. This is a crucial benefit for security professionals who don’t have time to deal with everything at once and who need to triage security alerts in order to focus on the highest-risk threats. Therefore, the way NSX Network Detection and Response has implemented AI can save companies time and money, allowing security professionals to remediate each threat more quickly and completely.
Learn more about NSX Network Detection and Response and its AI-powered NDR solutions.