Skip to main content

Why digital businesses should understand their data better

(Image credit: Image Credit: Bbernard / Shutterstock)

The world is creating more and more data than ever before. It is estimated that 463 exabytes (each unit is the equivalent to 1 billion gigabytes) will be created each day in the run-up to 2025. This data is being generated globally across a host of digital services, and businesses are in an arms race to catch up and offer consumers the best digital experience possible. In fact, IDG found that nearly nine in 10 (89 per cent) businesses are seeking to adopt a digital-first approach thanks to the global adoption of the Internet.

With so many businesses jumping into the digital space, the focus of enterprise data strategies has moved to the collection, maintenance and provisioning of data. This has proven to be a significant challenge for the industry, especially for the insurance, healthcare and finance sectors, as on average between 60 per cent and 73 per cent of all data within an enterprise is unused for analytics.

While the efficient use of data is essential for a business in a myriad of ways (from customer acquisition to retention, to driving digital efficiencies, to the creation of new products and much more) a critical issue also remains around data security and privacy.

From a European perspective, regulation of data protection has seen a huge overhaul of how businesses process and handle data thanks to the General Data Protection Regulation (GDPR). Meanwhile, in the UK, the Data Protection Act 1998 was replaced by a new act in 2018. These new regulatory requirements on businesses are in response to the digital world we now inhabit. Outside of the handling of data, these newer regulations also require businesses to obtain consent in certain situations in order to process data. In the case of GDPR, much has been made of the potential for sizeable monetary fines (of €20 million) that could be imposed on firms who do not comply with the regulation.

In this digital-first business world, with such significant regulatory requirements, how best can businesses, especially within technology, insulate themselves from bottlenecks in the flow of data within the business and implement a powerful digital data strategy?

Understanding data better

To implement a robust digital data strategy, it is crucial to be informed about how data for developing and testing purposes flows within the organisation. It is important to both understand the quality of data being collected and utilised on a daily basis and also about its ethical aspects.

Traditionally there were two main categories for data: original and anonymous. Original, as its name suggests, is just that. It has personally identifiable information (called PII for short) including names, addresses and transactional details. By contrast, anonymous data (in a general sense) does not include PII, yet does have transactional information. In essence, original data by definition has better quality than anonymous data, but it has its natural, ethical limitations.

Many businesses, most notably those in the insurance space, depend on data to support a number of functions such as informed risk selection, underwriting and claims management. They would like to automate these functions, but it’s just not that simple. Before the processes can be automated, the quality of data needs be understood and efficient data provisioning methodologies need to be implemented.

A new approach - synthesised data

With the proliferation of advanced technology, especially thanks to artificial intelligence (AI) and machine learning (ML), whereby in 2019 alone, over $70 billion was estimated to have been invested, a new, better approach to the data conundrum is being offered.

It comes in the form of synthesised data which refers to computer-generated data that is underpinned by best-in-class ML technology that, in effect, is able to mimic original data.

This approach has proven to be essential for companies’ data strategies for a number of reasons.

Firstly, synthesised data is an accurate version of original data. Yet, it is not a pure copy of original data; instead, it creates this data from a statistical approach that is generalised. This ensures that statistical properties are preserved while also making it nearly impossible to differentiate whether data is synthesised or original with the naked eye. In fact, research originating from MIT revealed that synthesised data can give the same results as real, original data.

Secondly, due to the way it is generated, as outlined above, synthesised data ensures a 0 per cent risk of non-compliance with data regulations such as GDPR. It provides complete peace of mind to technology companies worried about heavy fines and reputation damage from any issues around data security and privacy. This regulatory compliance is only likely to grow as more and more data is collected and stored in our Internet-first society.

Finally, the technology behind synthesised data is so powerful that generating this data can take as little as 10 minutes. This agility can unlock huge efficiencies internally, particularly when over 10 per cent of staff time can be lost due to the time-consuming efforts of collecting data. This really is a hidden cost to a business, but by utilising synthesised data, this frees up staff’s capacity to invest more time in product and service development to help a company grow exponentially.

Synthesised data is now a powerful and credible alternative to the historical methods of data collection and processing. In fact, I would argue that synthesised data actually unleashes the full potential of all the commercial information a technology business has. Given the volume of data such companies are now dealing with, the true potential offered by synthesised data can no longer be ignored.

Dr Nicolai Baldin, CEO and Co-Founder, Synthesized (opens in new tab)

Nicolai, a former machine learning researcher from the University of Cambridge, is CEO and Co-Founder of Synthesized, a company that has pioneered data synthesis - computer-generated data that mimics real data.