Skip to main content

Escape the malware blob with modern machine learning

(Image credit: Image Credit: JMiks / Shutterstock)

The complexity of files and objects with their increased breadth of file formats and sizes has presented a significant challenge to modern day organisations seeking to improve detection and response processes for advanced malware threats. What might be called a "malware blob," these threats are packed deep within data, hidden layers down and sometimes even out of sight from typical detection engines. For human analysts responsible for tracking and responding to threats, current detection engines offer only a "black box" perspective. In other words, they provide alerts, but offer little to no context as to what's happening within the "blob," causing human analysts to struggle to understand and act on the risk they present effectively. To take down the “blob,” analysts need a more effective way to bridge the gap between detecting malware and understanding what triggered an alert in the first place. Innovations in machine learning techniques have recently surfaced, giving security teams hope for better threat explanations and improved ability to defend against malware's growing complexity and volume.

The evolution of detection and machine learning

Machine learning and other anomaly detection capabilities were developed to extend malware detection beyond blacklists or databases of known attack signatures. Anomaly-based detection systems observed the behaviour of the network, profiled the normal behaviour, and predicted new threats based on some type of anomalous behaviour or anomalous characteristics. However, while new zero-day threats started to be uncovered, these predictions were missing a critical piece to the puzzle, the “WHY” behind the “WHAT.”

While detection vendors produced a binary conviction or malware classification type, the analyst never understood what characteristics of the threats or indicators were present to fully understand the conclusion. Quite simply, signature-based, AI-based and machine learning-based threat detection came with little to no context. This lack of context resulted in analysts spending numerous hours attempting to understand why a file was identified as malicious in order to effectively support their response. And for most analysts, the same scenario plays out in today’s security operations centres.

To better understand how to improve machine learning-driven results, we must first understand that machine learning is a technology that in its essence converts information and object relationships into numbers that try to quantify these properties. The very first step in implementing any such system is the conversion of human experience into a sequence that a machine understands and can learn from. Where machines are specifically built to read and interpret numbers, the people who are meant to use these models often feel limited and confused by these ML/AI systems. The most common question asked of a machine learning expert is, “Why? Why did the machine present such a result?” Or more specifically for those in cybersecurity, “Why was this object detected as malicious?”

To answer the “why,” let’s start from the beginning. As mentioned, the very first step in implementing any such system is the conversion of human experience into a sequence of numbers that a machine understands and can learn from. But what if the first step instead was to develop a system that describes the data--or malware in this case--in a way that both human and machine can understand?

We refer to this approach as explainable machine learning. To succeed, it must be built on a static analysis system that converts objects into human readable indicators that describe the intent of the code found within them. Regardless of what the analysed object is, either a simple file or compound “blob”, static analysis systems can, within just a few milliseconds, go through all its components and describe them in an approachable and easy to understand way.

With a foundation of human readable indicators, explainable machine learning can detect malware with results that are always interpretable by a human analyst. Quite simply, if a system makes a classification decision it must be able to defend it with a description included with any malware it detects. The human perspective comes first, and the machine can then serve as the ultimate companion.

Classifying threats

This is why explainable machine learning systems must be built from the bottom up instead. At ReversingLabs we believe these systems must be built on the concept that declaring which malware type has been detected is its most important feature. Combined with the human readable indicators, machine learning explainability means that the result the system provides must be logical. Human analysts must therefore be given the ability to read the list of provided indicators and agree that the detected malware type has had its functionality described correctly. This same level of transparency in an explainable machine learning model is also critical when prioritising indicators, as they are not all created equal. Only some of them are a contributing factor for the final malware detection. Understanding which indicators are at play is critical to the analyst decision making process. This final piece of the puzzle builds trust in the accuracy of the classification system and underscores the value of exposing models’ reasoning to the human analysts.

Today, most machine learning classifiers are built from the top down. Companies that implement them usually start by making simple classifiers that discern good from bad. Data scientists then can create millions of features extracted from millions of objects. Given enough compute power, machine learning models then find optimal curves that split these datasets based on these labels. However, results wind up losing all of their explainability in the process.

Knowing good from bad is certainly the crux of malware detection, but it isn’t the most important answer a detection system must provide. The second question that an analyst will pose to a machine learning expert is “exactly what did the system detect?” Analyst response to the threat any piece of malware poses is hugely dependent on the answer to this question.

Transparency in decision making

With explainable machine learning, interaction with indicators changes drastically. Transparency in the decision-making process highlights the most important malware family properties. That information is key for assessing the organisational impact that a malware infection has, and the starting point from which a response is planned.

Machine learning models are a great choice for the first line of defence. These signatureless heuristic systems do a great job of identifying if something is malware or not, and even pinpointing what type of malware it is. Their detection outcomes are predictive, not reactive, and that makes detecting new malware variants possible. Even brand-new malware families can be detected without models explicitly being trained on how to do so. In terms of reliability, they also require fewer updates when compared to conventional signatures, and their effective detection rates decay slower.

Tomislav Pericin co-founder, Chief Architect, ReversingLabs 

Tomislav Pericin co-founded ReversingLabs in 2009 and serves as Chief Architect leading all aspects of the company's product and services strategy as well as implementation. He has been analyzing and developing software packing and protection methods for the last 15 years. As chief software architect, he has conceived and driven the development of such projects as TiCore, TitanEngine, NyxEngine and RLPack. Recently, he spoke at BlackHat, ReCon, CARO Workshop, SAS and TechnoSecurity conferences.