Skip to main content

Making sense of big data and its role in your business

Will security and compliance issues put big data developments on hold?

Large organisations around the world are working to develop and integrate big data analytical facilities alongside their existing business intelligence structures. These initiatives are motivated in nearly equal parts by the conviction that new insights and opportunities are buried in the avalanche of new data, by the knowledge that conventional business intelligence systems are unequal to the task, and by the fear that competitors will be first to master and exploit the available new data streams.

Because the phenomenon of big data (opens in new tab) analytics is only a few years old, few standards exist to ensure that these new systems and the analytical activities they support are successfully integrated into the current frameworks that ensure governance, compliance and security. One of those critical policy domains – data security – has the potential to hinder many of these developments and block the realisation of their benefits if not adequately addressed.

This guide presents a comprehensive solution that cost-effectively ensures the security of sensitive information in big data environments without impairing their operational flexibility or computational performance.

Defining big data

Most accounts now distinguish big data from the established domain of enterprise management information by three characteristics first noted by Gartner: volume, velocity and variety.

Volume - Very large data sets – think terabytes and petabytes of information – are not a new phenomenon, but the rise of ecommerce and social media, the global distribution of machine intelligence business networks and personal electronic devices (opens in new tab), and the exponential growth of commercial and scientific sensor networks are making them commonplace. There are now many organisations with volumes of data that exceed the ability of conventional methods to organise, search and analyse in meaningful time intervals.

Velocity - One reason these data sets are so large is their unprecedented growth rate. In a recent Harvard Business Review (opens in new tab)article, Andrew McAfee and Erik Brynjolfsson report that:

  • As of 2012, approximately 2.5 exabytes of data are created every day, a number that is expected to double roughly every 40 months.
  • More data now crosses the Internet each second than was stored in the entire Internet just 20 years ago.
  • It is estimated that Wal-Mart collects 2.5 petabytes of customer transaction data every hour.

Variety - Big data includes a wide and growing range of data types, many of them new: text messages, social media posts, e-commerce click streams, GPS location traces, machine logs and sensor measurements. Structured, unstructured or semi-structured, much of this data is incompatible with the relational database repositories at the heart of most business intelligence facilities.

Rapid rise, quick commercialisation

Until 2004, the three Vs seemed to put big data (opens in new tab) beyond the reach of practical commercial analysis. That’s when Jegg Dean and Sanjay Ghemawat published their seminal paper on the MapReduce programming model developed at Google for parallel, distributed processing of large datasets on scalable commodity clusters. The model was quickly embraced by the open source community, leading to the Apache Hadoop project and the development of a complete software framework for distributed analytical processing. This success promptly launched start-ups like Cloudera and Hortonworks to commercialise the new technologies.

The combination of big data analytics based on the MapReduce programming model, open source software, and commodity hardware clusters offers some extremely appealing business benefits for organisations with large data sets at their disposal, including:

  • The ability to derive business insights and competitive advantages from data streams that cannot be addressed with conventional BI tools. To ask questions that were previously unanswerable.
  • The ability to respond more quickly and intelligently to changing business environments and emerging opportunities.
  • A game changing cost differential of up to 20:1 relative to proprietary business intelligence solutions.

The conspicuous success of online companies like Google (opens in new tab), Yahoo and Facebook in using big data techniques to manage and query very large data volumes has stimulated intense interest and accelerating adoption in other industries. While up to 45 per cent of annual investment (opens in new tab) remains targeted at social media, social network and content analytics, the majority of spending now represents a diverse range of market sectors, including financial services, communications, pharmaceuticals, healthcare and government. Each of these segments brings its own interests in sensitive data types: social security and national ID numbers, payment card account numbers, personal health records – each with its own set of security mandates.

Data security: A sinkhole in the big data roadmap

As sensitive data flows into new big data facilities, many of them still pilot stage developments, the issue of security becomes an increasingly urgent problem for business sponsors eager to bring them into production.

Unless these systems can be rendered compliant with the full range of global data security and privacy regulations, their potential business impacts may remain a matter of purely academic interest.

But data security in big data environments is no small challenge. Their processing and storage clusters typically encompass hundreds or thousands of nodes. The software stack is entirely open source, with many alternatives for most key components, most of them still in very active development. Compared to a proprietary business intelligence infrastructure, a big data (opens in new tab) facility presents a large attack surface with all the vulnerabilities associated with rapid, ongoing change.

The one similarity is the extreme sensitivity of administrators and business users alike to any imposition by security on query response times.

Existing security solutions: A gap analysis

In these environments, none of the conventional approaches to system and data security are satisfactory or sufficient:

Perimeter security and user access controls are essential starting points but inadequate on their own. Even the best solutions are sometimes defeated by today’s blended, persistent threats.

File-system encryption only protects data at rest. Sensitive data is immediately exposed to theft or compromise as soon as it is decrypted for transmission and use by an application. Decryption on access is required because the encryption process destroys the original data formats, rendering it useless to applications without extensive recoding. Needless to say, this approach also introduces significant processing overhead for the continuous write encryption (opens in new tab) and read decryption.

Data masking is typically a one way conversation technique that destroys the original data values. It is useful in de-identification for testing and development, but problematic when used in many analytic use cases. For example, if masked data is used in a financial fraud detection application it may be possible to identify suspicious transactions, but not to quickly recover the relevant user and account identities for corrective action. Data masking also requires the creation and maintenance of large lookup tables which quickly become a significant management project in their own right.

Needed: Data security that’s high strength, low impact

What’s needed to ensure the viability of big data analytics is a data-centric solution that:

  • Protects sensitive data wherever it is stored, moved or used, with no exposure between storage, transmission and processing.
  • Enables compliance with most global data security, privacy and data residency mandates.
  • Integrates quickly and affordably with existing infrastructure and adapts flexibly to new analytical applications and data sources.
  • Allows quick policy based retrieval of original data values by properly authorised and authenticated users and applications.
  • Imposes no significant overhead on analytical performance.
  • Preserves the formats and referential integrity of protected data, so that existing analytics and ad hoc queries don’t need to change.

Mark Bower is Vice President of Product Management at Voltage Security.