Skip to main content

A Gentle Introduction to Apache Spark and Clustering for Anomaly Detection

This article was originally published on Technology.Info.
As part of our continuing strategy for growth, ITProPortal has joined forces with Technology.Info to help us bring you the very best coverage we possibly can.

Viewer Takeaways

  • How clustering can be applied to anomaly detection
  • How to solve common problems in anomaly detection solutions
  • How to build clustering models using Apache Spark
  • How to score new data as anomalous using a model in Spark

There has been an explosion of interest in Apache Spark as a new, alternative computing paradigm for Hadoop. It offers something to interest data scientists of all stripes, from an interactive REPL to distributed functional programming to implementations of standard machine learning techniques.Spark promises big scalability improvements over MapReduce for iterative algorithms, like k-means clustering, which can be used to detect anomalous data in a huge data set,.This session will walk through a complete example of anomaly detection using Apache Spark and its MLlib subproject, as applied to the well-known network intrusion detection data set from KDD Cup ‘99. It will impart a taste of Scala (Spark’s native language), Spark’s core concepts like RDDs, and usage of MLlib for k-means clustering, in real-time on a Hadoop cluster. It will also introduce the concept of k-means clustering and how a dåata scientist would iteratively improve clustering in a session with Spark.

Download PDF