Skip to main content

Making decisions with data – Is data prep really preparing you for success?

(Image credit: Image source: Shutterstock/alexskopje)

Data preparation is an essential tool for everyone who wants to get value from sources of information. Whether this data is held in internal applications, derived from partners or supplied by third parties, it will need some work before it can be analysed. More importantly, prepared data must be connected and related to other analytic instances in the enterprise, so teams can collaborate and make better decisions with data.  

The data preparation phase can be substantial. Ventana Research estimates that the average business analyst spends about 45 percent of his or her time on preparing data for analysis. These analysts already are adept at working with data, but what about those in other areas of the business without this experience? Making data preparation easier – and more importantly, smarter – is therefore an important priority for 2017.

Making data preparation work with self-service

To make analytics accessible to people across the business without depending on a central IT organisation, self-service data preparation capabilities need to be in place. However, extending data preparation outside of IT and making it fully self-service covers a multitude of areas, so it’s important to plan ahead on how to make the most of this technology within your approach. 

A lot of this planning will depend on the complexity of data that needs to be analysed across the organisation, how it will be used and who will be involved in working with it. For companies that have lots of people clamoring for access to analytics, this stage can help the initial expansion be successful. By starting with a small control group, it’s possible to expand outwards in a more managed – and potentially more successful – way.

Using three axes – data complexity, data goals and data users – it’s then possible to look at who will be using existing information and who will be enriching it with their own data sets. Internal data for activities such as sales or marketing can be valuable for the departments involved, but it can also be used by other teams too. At this point, it’s worth looking at how that data gets prepared and used in context.

Rather than sharing physical reports, which then have to be “prepared” again for desktop analysis, it’s worth looking at how multiple people can access the same virtual set of data in a more governed way. By sharing cloud-based, virtual data from a common analytical network, the need for extensive data feeds can be reduced. The additional benefit is that a lot of the data preparation work will have already been carried out, so no wheels have to be reinvented. This can also ensure that everyone is using the same source of truth for their analysis.

For users who want to bring in their own data sets alongside existing company data, more data preparation will be required so that analytics can be carried out. This normally requires extensive skills in cleaning and joining data so that it can be used alongside other data sets. However, modern tools and technologies make it possible to simplify this process, if not automate much of it. This opens up analytics to more business users that don’t have data preparation skills today, but that could benefit from blending their own data with corporate data.

Turning data preparation into a self-service activity requires a business user-friendly experience, particularly through the use of more familiar “drag and drop” interfaces rather than ETL scripting or programming languages. However, this approach should also include ways to track design decisions and why particular data transformations have been chosen. When preparing data, the history or “lineage” of data transformations can be recorded to check for accuracy and ensure that users understand where data comes from and what it means.

Supporting self-service data preparation across the business

As part of this, it is worth understanding how conventional self-service data preparation tools work in comparison with cloud-based tools. Most data prep tools live on the desktop as a physical application and work with data sets that are physically present on that desktop too. Cloud-based data prep tools can take data and then work with it in the cloud. Both sets of tools produce analytics results that are ready for people to use, but take different routes to deliver them.

However, there are other elements that should be considered around data preparation tool. The first is how many people will use the results of data preparation activities. For desktop tools, sharing the analytic-ready data can be more difficult as all the files involved are present on one person’s machine. While it’s possible to hand over a bundle of files to another analyst, the prep work may have to be re-done in order to be completed. 

The other element is that everyone has to know that this work has been carried out in the first place. The upshot of this is that it’s possible for multiple people to carry out the same data prep and enrichment tasks independently, essentially re-doing the same work. Moving this phase over to the cloud can therefore help in making it clear what data preparation work has already been completed and what needs to be done. 

Networking these sets of data together can also means that multiple people can reuse the results, as they need them. More importantly, the cloud-based approach should include the results of the analytics and the data itself as a complete package. This provides a degree of transparency into the lineage of data and analytic results over time that shows how the sets of data are transformed.

Sharing is caring - cloud-based analytics versus file-based approaches to data

Moving data into the Cloud can help make analytics easier for non-specialists to adopt, as well as avoiding the amount of data preparation rework that can otherwise be required. However, this strategy will need some preparation itself on how teams will adopt analytics in their day-to-day activities.

By looking at the goals that exist around how analytics can be used across the organisation, it’s possible to empower teams to enrich their own analytics with the insights of others and vice versa. This more holistic approach relies on teams working with data as part of a network, as well as the IT infrastructure to support this ”organic” growth of analytics without sacrificing data governance. By bringing in data sources as part of a connected network, analytics can be made easier and more trusted for all.

To get the most out of their self-service data preparation efforts, users must be able to relate and share their analytic insights as well. The traditional approach has been to export physical spreadsheets or PDF files. These files ultimately act as data silos that compromise trust in the data and undermine the decision-making process

By using the cloud to share virtual data and corresponding metadata, rather than files, teams can avoid this problem. By taking a user’s prepared data and making it part of a network of analytics, the insights already available in the network can be used to extend the user’s analytics. In addition, everyone can re-use the transformation steps for their own data without re-inventing the wheel. More importantly, it encourages more collaboration around data in total, based on users meeting their own needs rather than relying on specialists.

For experienced data professionals, self-service data preparation cuts down the amount of time that they have to spend on the management of data sources and lets them deliver analytic-ready data much faster. However, self-service has even more value for business users. By making it easier for more people to prepare data and collaborate as part of a network of analytics, companies can transform the way they use data to drive the business forward, extending analytic capabilities outside the IT organisation.

Pedro Arellano, Vice President, Product Strategy, Birst
Image source: Shutterstock/alexskopje

Pedro Arellano
Pedro Arellano is vice president, product strategy at Birst, leading development around networked data and analytics. Prior to Birst, he led marketing at MicroStrategy and hosted the Stereo Gol radio show.