There are high expectations for Machine Learning (ML) in cybersecurity, and for good reasons. With the help of ML algorithms, we can sift through massive amounts of security events looking for anomalies, deviations from normal behaviour that are often indicative of malicious activity. These findings are then presented to the analyst for review and vetting, and the results of his determination fed back into the system for training. As we process more and more data through the system, it evolves: it learns to recognise similar events and, eventually, the underlying traits of malicious behaviour that we’re trying to detect.
The first part of this process, the anomaly detection, is called Unsupervised Learning. It is inexpensive, can be done at machine speeds and on large volumes of data, but it’s extremely noisy. Electronic signals that we analyse, especially the ones that reflect human activity, might fluctuate naturally, resulting in superficial anomalies. Forwarding these for analysis overwhelms analysts, creating alert fatigue and desensitising them to the real anomalies. The best known example of this is the 2013 breach at Target, where malware infection was actually detected by the monitoring software, but the alert was lost amongst hundreds, if not thousands, of other alerts received by the analysts, leading to the compromise of over 40 million credit and debit cards.
There are several ways to reduce false positives. Cross-domain correlation is used to look at the event from multiple angles: malicious activity might manifest itself through several anomalies, and aggregating them should produce a much stronger signal than any one of them in isolation. This type of analysis is typically done through complex threat modelling that can also correlate temporally separated events. Organising potential threats into the kill chain of the attack facilitates earlier detection of the attack through risk amplification along the chain, giving defenders a chance to prevent the later, most damaging stages of the attack. Another way of reducing false positives is through the peer group analysis. Peer groups are formed based on similar characteristics or activities, under the assumption that such grouping reflects common functionality and, therefore, common normal activities. When an individual's behaviour exhibits some abnormality, if these anomalies are within the norm for his peer group, they are likely false positives and can be ignored.
The second part of the process, the training of the ML system, is called Supervised Learning. It requires labelled data: each event has to be labelled good or bad. The most common way to label the data is to use human analyst to vet the events, but it’s also the most expensive one, and it hardly matches the scale of the incoming data. Recent advances in generative modelling employ human expertise to create high-level labelling functions instead of case-by-case labelling to produce large-scale weak supervision models, but we are yet to see it applied to cybersecurity. Sometimes the knowledge from one domain can be transferred to label a different dataset, for example where hashes of known malware files are used to label a dataset with behavioural characteristics of the malware. In general, though, supervised learning mainly relies on manually labelled datasets. Well-designed threat models can make the labelling process more efficient by reducing the number of false positives cases the analyst has to review. Aggressive training schedule, where an optimal mix of positive and negative cases is served to the analyst, can also expedite the learning process.
ML models have to be periodically updated to address concept drift (change in underlying relationships) and to incorporate new data points. The frequency of updates depends on the rate of data change, the magnitude of concept drift, accuracy requirements, as well as the size of the model and your computational capacity. User behaviour, for example, is fluid, and profiles have to be updated at least daily to capture new trends and reduce false positives. Supervised models that capture analyst feedback might require even more frequent updates, preferably near real-time, to prevent the analyst from having to review many similar cases. These requirements, as well as the volume of data to be analysed, are likely to push you from the comfort zone of batch learning to streaming analytics and online learning models.
Adding more training data seems like a sure way to improve model quality, but it works only for as long as the new data increases diversity of the dataset, adding to the informational content of the model. In addition to proper feature engineering, parameter tuning and over fitting control, diversity is also a key factor in creating a well-generalised model. To increase the diversity, we need to bring in data from different customers, industries, business sizes, regions. Due to the sensitivity of cybersecurity data, we cannot combine these datasets directly, but we can apply Federated Learning for secure averaging of the individual model weights to guarantee privacy of customer data.
Like with any new technology, successful introduction of ML into cybersecurity builds on the credibility of its results. To establish this trust, you have to gradually grow the program from the ground up, progressing from simpler, easier to understand behavioural indicators to more complex hierarchical threat models to complete kill chains of the attacks. Explaining predictions of complex ML algorithms is a non-trivial exercise: we had to develop an entirely new algorithm to explain predictions of our ensemble learning methods to the analyst.
Key takeaways for maximising the value ML can bring to your cybersecurity program:
- Create as many behavioural indicators as possible to not to miss any signs of malicious behaviour.
- Use peer group analysis and hierarchical threat models to reduce false positives.
- Design kill chains for known attack scenarios and anomaly catch-all bucket for unknowns.
- Collect all analyst feedback to label datasets for supervised and weakly supervised learning.
- Update your models timely to eliminate concept drift.
- Strive towards well-generalised models through dataset diversification.
- Crawl, walk, run: build a transparent, credible and well-understood ML ecosystem.
Igor Baikalov, Chief Scientist, Securonix