Investing in a data lake? Shore up the Big Data gateway

Where do you store your most important data these days? Does it all fit there?

Because companies manage and use data with increased volumes, variety, and velocity than in the past, existing data architecture is evolving beyond traditional databases, data stores, data warehouses, and the like into a more unfiltered repository known as the data lake.

Given that Forrester predicts a heavy convergence of business analytics and big data in 2015 to achieve greater customer insights, and with big-data market projected to grow nearly 20 per cent every year for the next decade, the need for quickly accessible data flows will only grow as well.

This demand for increased agility and accessibility for information analysis drives the data lake movement, and for a number of good reasons. But that’s not to say that SQL databases, enterprise data warehouses, and the like will be immediately replaced by data lakes. Rather, these tools are likely to be augmented by them, as data sources, data sinks, or both.

But the reality is, these “big data” projects aren’t just any projects. These are massive undertakings spanning various complex system infrastructures. Organisations leveraging big data are discovering that integration support for the data exchange requirements (file types, volumes, connectivity, file exchange tools, etc.) needed for such analytics and business intelligence projects is a must-have.

At this scale, traditional ingestion approaches often take months to review results on the information, but data lakes capture previously unimagined extraction power. Managing file ingestion from a data lake into Hadoop and other big data ecosystems – and centrally governing the movement and format in the right sequence for the right purpose with a powerful big data gateway – delivers a unique competitive advantage.

Benefits of a Data Lake

By capturing largely unstructured data for a low cost and storing various types of data in the same place, a data lake:

  • Breaks down silos and routes information into one navigable structure. Data pours into the lake lives there until it is needed, when it flows back out again.
  • Gathers all information into one place without first qualifying whether a piece of data is relevant or not.
  • Enables analysts to easily explore new data relationships, unlocking latent value. Data distillation can be performed on demand based on business needs, allowing for identifying new patterns and relationships in existing data.
  • Helps deliver results faster than a traditional data approach. Data lakes provide a platform to utilise heaps of information for business benefits in near real-time.

So in an era where business value is based largely on how quickly and how analytical you can get with your data, connecting your organisation to a modern data lake facilitates lightning-quick decision-making, advanced predictive analytics, and agile data-based determinations.

Consider this: What if you had to store clothes in your closet based on the type of occasion you would wear each to, and you could only wear the assigned articles for the assigned occasion? You would have to plan for every type of occasion you might need to attend before ever picking up a hanger, pre-designating items for any and all business, social, casual, and other ad hoc scenarios. That would be a pretty rigid model.

The data lake grants you the “schema on read” functionality (more on this below) that we know and love in real life, where we’re able to analyse the occasion and the entire data set (i.e., clothing line) and make a decision on how to proceed. While this decide-the-schema-later approach makes database design a lot easier on implementation, it does impose harder work on the computing resources to compensate in production. But that’s okay: Today’s more powerful computing resources are up to the job.

Data Lake Drivers

An enhanced customer experience commonly drives data lake investment across industries, but benefits of increased analytics specific to certain verticals include:

  • Retail: Aggregating, mobile, e-commerce, and point-of-sale transactions can be analysed to help grow revenue via enhanced sales and marketing insights of a company’s expansive customer base. This type of analysis is instantly seen in Amazon’s “People Also Bought” recommendation section, but quickly analysing data across business dimensions (store location, time of year, day of week, customer demographics, etc.) can reap huge benefits for any retailer.
  • Manufacturing: Manufacturers always seek to increase volume, reduce cost, and improve quality. Drug manufacturers, for instance, take the data on one product, whether from the labs, process managers or other sources, to get faster turnarounds on test results.
  • Media/telecommunications: Think about how much user data a company like Netflix would have to manage from its more than 60 million users. From TV, Internet, and mobile viewing patterns to social media updates and customer service records, Netflix can analyse viewer tendencies to increase revenue by providing recommendations, booking additional original programming, and streamlining subscription models.
  • Healthcare: Health systems maintain and analyse millions of records for millions of people to improve ambulatory care and patient outcomes. Quick insight and action on such records also can improve operational efficiency and enable an accountable care organisation.
  • Logistics: Transport companies manage geolocation information to map more fuel-efficient routes and improve employee safety. Airlines and railroads amass various EDI data to deliver goods (and people, for that matter) more efficiently and at a lower cost.
  • Financial services: Batch processing can help monitor peak activity, enhance fraud accuracy, and reduce organisational risk by analysing credit-worthiness from a variety of sources.
  • Law enforcement: Law enforcers can compare MOs across multiple databases (local, state, federal) and case management tools to solve crimes faster.

But some concerns surround the data lake concept, including security, access, and the scalability required to accommodate future streams while retaining all current data for future analysis. Essentially, companies only get out what they put into data management, and an optimised delivery gateway ensures a proper return on data lake investment.

The Requirements

“Purpose-built systems” whose core capabilities are carrier-grade scalability, secure data transfers, and the ability to connect to non-traditional storage repositories (Hadoop, NoSQL, Software-Defined Storage, etc.) can solve the security, access control, and scalability challenges of data lakes, which are more suited to handle today’s less structured data.

The modern big data gateway, which varies from traditional ETL (Extract, Transform, Load) architectures, supports the “schema on read” data lake principle, meaning organisations do not need to know how they will use the data when storing it.

Schema-on-read advocates keeping raw, untransformed data, and without transformation on ingestion, companies can move faster and create new acquisition feeds quickly without thinking about mapping, granting your business data agility now while asking the compelling data-use questions later.

Additionally, transformation often results in discarding supposedly worthless information that later may turn out to be the dark matter comprising the bulk of your information universe, so data lakes’ schema-on-read functionality proves exponentially more useful.

Some other key big data gateway characteristics to support a data lake strategy include:

  • Elastic scalability: It’s going to need to scale to handle some pretty massive sized content. Think zillions of little messages and periodic massive files, too. Can one solution handle both of these extremes?
  • Security and governance: You’ll need to secure and track data feeds originating from both within and beyond your enterprise. If you derive critical business value from the end result, you’ll need governance, tracking, auditing, and alerts on these data flows and feeds to ensure continuity of service.
  • Easy. Easy: Easy to acquire. Easy to implement. Easy to operate. Easy to adapt and add on. Much like the decide-schema-later aspects of your big data initiatives, you’ll need a big data gateway solution that affords you the flexibility to interoperate with any communication protocol, digital signatures, data formats, OS platform, and more so you can decide later how things will be integrated. And it’s got to be easy and quick to implement (without consultants) so you can stay agile and deliver business value faster.
  • Collaborative community management: New technology blurs the lines between traditional data integration and human interaction, and dynamic interaction between content and people must occur at a lot of stages. Collaborative management functionality enables a central platform for easy data routing and expanded workflow capabilities.

Shoring up these functions significantly increases the benefit gain of a data lake investment.

The Solution

Data lakes deliver added ability to monitor and analyse the historical performance of organisations to better achieve future results, but the promise of improved analytics and business agility is broken when data is not easily accessible.

Companies undoubtedly must have that connected data. After all, a data lake with stagnant (or worse – non-existent!) information flows becomes more of a data swamp.

Pave the road for your advanced data initiatives with a big data gateway solution built for the access and control of today’s modern enterprise.

John Thielens, Vice President of Technology, Cleo