Skip to main content

How to know when your data is "big"

Most companies today are working with some level of data operations. Many are wondering whether they've crossed the threshold into big data. Finding the answer to this question can be challenging because big data isn't defined by any specific variable. Every individual has his/her own idea of what it means to be a part of big data.

Answering this question may be difficult, but it is important nonetheless. Big data platforms, such as Apache Hadoop, boast incredible storage and computing power. Tools of this caliber are far superior to traditional data processing applications. Given the tremendous accessibility of such tools, businesses are taking a closer look at their data needs.

So let's start by looking at a few general characteristics of businesses that use big data.

1. They have large amounts of data.

2. They have a large variety of data.

3. They collect data from many different sources.

4. They hold on to their data for extended periods of time.

A company that uses big data can typically be characterized by three or more of these five attributes. Let's go ahead and clarify each even further.

Large Amounts of Data

The three most common database sizes talked about today are one gigabyte, one terabyte and one petabyte. These database sizes vary greatly in terms of both volume and functionality.

First let's take a look at the difference in volume. If you were to make a visual comparison between the three, one gigabyte would be the size of a 10-inch dinner plate. The next level up, a terabyte, would be the size of two and a half football football fields. Finally, the petabyte database would be like driving across the state of California. Take a look at this visual that illustrates the striking difference in capacities:

Photo: Relative Scale of Databases from The Executive's Guide to Big Data and Apache Hadoop by Robert D. Schneider

In addition to their storage capacities, each of these databases are more suitable for different data functionalities.

One gigabyte

A one gigabyte database generally consists of transactional data stored in relational databases. It typically uses SQL to store and retrieve data, and is not a good choice for handling unstructured data.

One terabyte

Data warehouse databases typically start at one terabyte in size, and can increase to a few dozen terabytes, and often consist of an aggregate of many smaller databases. This size is ideal for enterprise analytics and business intelligence applications.

One petabyte

Petabyte databases work with unstructured data on a regular basis, and are cost-effective solutions for managing the sheer volume of data collected and stored by enterprise customers.

In summary, if your database is less than a terabyte, using a Hadoop platform to analyze your data may not be the best solution. If you do have a terabyte or more of data, also consider your needs in terms of data functionality. Your database size could be impacting those capabilities.

Large variety of data

Another consideration is the variety of your data. Some companies depend on transactional data for all of their data needs. Others take advantage of a seemingly endless variety of unstructured data. Different types of data can include photos, video, audio, reviews, social conversations, email messages and more. By utilizing a higher variety of data, you are able to have a broader context for the analytics you derive from it.

Variety, much like volume, is considered a standard qualifier for big data users. If you are dependent on structured data alone, it's unlikely that you have moved into the big data category.

Many different data sources

One major benefit of using a big data platform is the ability to integrate a wide range of data sources.

Agriculture is one industry that is using big data to transform agricultural practices. Agriculture companies need to understand the impact that soil types, water levels and heat patterns have on their crops. Data sources include aerial surveillance, in-field sensor networks and meteorological data. Computations for these large-scale data sets require a robust big data platform such as Hadoop.

Here's a sample of other data sources used today:

  • Medical devices
  • Location data
  • Machine-to-machine communication
  • Warehouse sensors
  • Satellites
  • Smartphone applications
  • Social media

If your business is pulling in a large variety of data from many different sources, chances are you have are also analyzing a large volume of data as well. In this case, you have a strong case for using Hadoop for ongoing discovery and data analysis.

Archiving historical data

Why do many companies need to hold on to their data for extended periods of time? There are three main reasons:

Government mandates

For some industries, government regulations require that certain types of data need to be stored for several years. Examples include consumer protection regulations and criminal investigation data retention laws.

Consumer requests

Sometimes a company chooses to retain data because their product is consumer-driven. Facebook is one example, which has kept vast amounts of user data since they opened their doors over nine years ago.

Historical analytics

Businesses today can benefit by analyzing data from both the present and past, enabling decision-makers to predict the consequences of critical business choices.

Storing vast amounts of data for extended periods of time can be prohibitively expensive with traditional data applications.

If you need to store a large amount of data for an extended period of time, a big data platform such as Hadoop may be a cost-effective solution.

When will your company approach the big data threshold? There's no quick answer, and it really depends on many factors, including your existing data legacy workloads and management tools. However, if you need to analyze a large amount of varied data from different sources, then you should consider taking a closer look at big data technologies such as Hadoop to give you a game-changing competitive advantage.

Michele Nemschoff is vice president of corporate marketing at bigdata platform solutions firm MapR Technologies.