Analysing data analytics

Michael Upchurch talks about the challenges of Big Data, and possible solutions.

1. How big is the Big Data challenge?

In 2008, all the systems in the world generated less than 1 billion terabytes of data.  In 2020 it is estimated that 44 billion terabytes will be generated and 260 billion terabytes of data will be stored. That’s a lot of data, but the challenge is not how to store the data; it’s how to use the data. 

Companies will struggle to leverage analytics using traditional approaches.  Too much data, too many systems, not enough processing power (in legacy platforms) and too few people to keep up with demand.  It’s not only a big challenge, its complex.

2. What’s the background to Fuzzy Logix?

Fuzzy Logix was founded in 2007 based on real frustration with traditional analytics platforms.  Partha Sen and I were at Bank of America and we’d get questions like “Can you tell me the intra-day value of all our market positions?” or “How do we run analytics on data too big to fit into our SAS server.”  We also realised that we spent about 80% of our time moving data from our data warehouse to some kind of analytics server, and  only 20% of our time actually analysing it. We knew there had to be a better way.  And so Fuzzy Logix was founded, based on the concept of “why move the data to the analytics if you can move the analytics to the data?”

We also saw three emerging trends; the use of analytics in industries such as healthcare and retail; the improvements in the processing power of databases platforms and rapid data growth.   Based on this, we decided to build models that would be valuable for multiple industries, but also optimised to take advantage of parallelism and platform capabilities so we could analyse large sets of data efficiently.   

3. What’s the traditional approach to data analytics?

Historically, you had to move the data from where it was stored to some other physical analytics platform because the data was in one place and models ran in a different place.  When data was small, that was OK, but as it grew and the questions became more complex (requiring a broader set of data), the penalties for having a multi-tiered environment grew.  

The penalties can be severe and include cost (expensive hardware and software), slow turn around, due to constraints with the traditional analytic environments being much smaller than the data source, time wasted moving too much data, duplicate data storage and security infrastructure and high annual recurring costs.

Traditional environments also prevent business users from running their own analytics because you need specialised knowledge and system access to even call it a fully developed production model. 

4. What’s different about the way Fuzzy approaches the problem?

We wanted to create a simple yet elegant and scalable solution.  

Simple, in that our analytics models install in your existing database or Hadoop environment in less than 1 day, inherit the security of that environment and eliminate duplicate data.   

Elegant, because you run the models using SQL - the most common language of the data scientist. Also, the models can be embedded into existing BI tools that don’t have analytics, such as Tableau.  

Scalable, because we wrote our models for optimised performance on big data.  We leverage both data and computational parallelism which means processes that take days on traditional analytics take hours, those that take hours take minutes and those that take minutes can now run in seconds. 

5. Will this new approach impact the role of the Data Scientist?

Data scientists will be able to build models using very large amounts of data and many variables.  These models will run 10X to 100X faster than traditional analytics which expedites insight.  For example, one of our customers had a table with 486 billion rows and their analytics took 30 hours to complete.  Once we moved their process to run in-database, it took 17 minutes.  Let’s say you are in the model build stage and are trying a bunch of models and parameter settings.  The ability to ask a question every 17 minutes vs. every 30 hours creates a huge leap in model building efficiency.  

An additional benefit is that a data scientist can deploy models to business users.  Too many data scientists spend too much time running the same models over and over.  Our models are run using SQL, so a data scientist could build models and deploy them in existing applications or BI tools and let business users run the models.  And let’s be clear, almost no business users will ever be able to select a model, but they can run models and free up data scientist time and gain the ability to get immediate results.

6. Tell us about the role of Chief Data Officer

Given the increase and complexity and types of data, the role of the Chief Data Officer is incredibly important; even more so when data is being used for analytics.  First, many data scientist are better at science than data, so anything the CDO can do to cleanse, fuse or organise data helps the data scientist move faster and produce more accurate results.  The quality of the data is super important because the data scientist only bears part of the burden of its accuracy (controversial I know, but seriously, are you going to ask your PHD in statistics to trace data lineage through the 53 source systems that feed the EDW?). The Chief Data Officer and their team are key to the success of your analytics initiatives.

7. Which industries will stand to benefit most?

Healthcare, retail and supply chain all have tremendous opportunities to leverage more data science.  All three industries will see a dramatic rise in information from Internet of Things devices and have historically trailed other industries in the use of analytics.  Imagine having historical health, sales or performance data paired with real-time sensor data from devices such as cell phones, glucose meters and all the devices in the home that have an IP address.  We’re in very early days with tremendous upside ahead.  

8. Are there any specific use cases you can share?

We worked with one of the world’s largest grocery chains to improve their model performance.  Previously, it took a week to forecast at region and product category level data.  We moved them from using traditional architecture to in-database processing and now the can run store-SKU level models in 3 hours.  With a forecast that wasn’t previously possible, they are now saving over £10M annually.  

In healthcare, because our customer did a phenomenal job with data management, we were able to find hidden opiate abusers.  We started with 742 predictors and reduced the number to 44.  We then refined the models further until they were happy with the accuracy.  The entire process end-to-end was 5 days.  The turn-around time to change the model parameters and rerun it, is less than 5 minutes. 

These are both possible because of the combination of using in-database analytics to prevent data movement and running our parallelised, but easy to use, algorithms for processing.

9. What advances can we expect to see in the analytics space in the next 12 months? 

For analytically mature companies, I’m seeing a move towards more complex models, the use of new kinds of data and the analysis of data with increased granularity.  For example, companies are starting to use complex models for micro-segmentation for tens of millions of customers.  Solving these kinds of problems will require massively powerful solutions that include things like in-database and GPU-based analytics.

I also see infatuation with machine learning as the next magic solution.  I think we’ll see some less-than-optimal results because no black box model is capable of solving all problems.  Nonetheless, I do believe there will be some success and those kinds of tools can help companies get started with analytics.  

Michael Upchurch, COO and founder, Fuzzy Logix
Image source: Shutterstock/Wright Studio