Skip to main content

Synthetic data: Fueling a data-centric approach to AI

Head outline in blue with circuits
(Image credit: Gerd Altmann from Pixabay)

Despite increased interest in and adoption of Artificial Intelligence (AI) and Machine Learning (ML) in the enterprise, studies suggest that 85-96 percent of projects never make it to production. These numbers are even more astonishing given the growth of AI and ML in recent years, begging the question, what accounts for such a high failure rate?

Data may be the new “oil,” but it’s often the catalyst to the downfall of many AI and ML projects. The process of collecting, cleaning, and organizing raw data is long and arduous, and translating it to accurate, functioning AI certainly adds to the complexity. Like oil, data for AI and ML needs to be refined and put into a functioning engine in order to be useful. 

Model-centric vs. data-centric 

In a series of talks and related articles, prominent researcher, technologist, and entrepreneur Andrew Ng points to the elephant in the room of AI: the data. While it is true that data is perhaps the most valuable modern business asset, the role of data quality is often neglected in AI and ML projects. The vast majority of effort is concentrated on the model itself, and as a result, hundreds of hours can be wasted on tuning a model built on low-quality data. In this form of model-centric AI, the data is cleaned and held constant, and finding the best algorithm to improve performance becomes the primary focus. It’s essentially the “one and done” approach to data. However, there is a recent revolution - more simply a new line of thinking - to the success of AI in the enterprise. 

Data-centric AI and ML optimization is a topic that has gained significant traction over the past few months. At its core, it involves a paradigm shift over the way AI systems are developed, emphasizing the role of good data as the cause of good model performance as opposed to the specifics of the model. Ideally, data-centric approaches are highly iterative, driven in a closed-loop fashion based on model performance. However, this is not feasible with today’s human annotation approaches which require long cycles to collect and prepare data for ML training.  

Data labeling: The real elephant in the room 

Data-centric AI is not without its challenges. Collecting and annotating data is a critical component of the data-centric approach, but it has become an increasing problem, if not a hard constraint, on AI research and development. Today, most available datasets are manually labeled and thus are costly from both a resource and time perspective.

Moreover, despite being the most established, hand-labeled data is not immune to human error. Researchers at MIT found that an average of 3.4 percent of data was mislabeled in the top ten most-cited datasets. Many of these datasets served as ground truths to compare the performance of machine learning models. At times, the hand-labeled data might be outright incorrect. 

Dataset bias is another potential risk with human-annotated data. Conceivably, an AI or ML model trained on a human-labeled dataset can learn the inherent biases of the labeler. Take Amazon’s recruiting model that showed bias against women or Google’s hate speech detection algorithm that discriminated against people of color. Both examples demonstrate how human biases in data translate to unfair and potentially harmful results replicated at scale. 

Human annotation is also fundamentally limiting as humans can not label key elements, like 3D position or depth, for key use-cases like autonomous vehicles, robotics and AR/VR.

Synthetic data is like owning an oil well.

Fortunately, there is a solution to these challenges. Synthetic data, or artificially generated and labeled data that models the real world, can be used as an alternative to real-world data to train AI models. Synthetic data shows promise in its ability to fill in the gaps with data-centric approaches and deliver comprehensive training data at a fraction of the cost and time of current approaches. By merging technologies from the visual effects industry and generative neural networks, synthetic data delivers perfectly labeled, realistic datasets and simulated environments at scale - meaning data scientists can use it to overcome a massive barrier to entry. Since the data is generated, information about every pixel is explicitly known, and an expanded set of labels are automatically generated. This enables systems to be built and tested virtually and allows AI developers to iterate orders of magnitude more quickly since training data can be created on-demand.

In addition to streamlining the development process, synthetic data can also help reduce human biases often seen in traditional AI datasets due to non-representative real-world data, enhance privacy and play a pivotal role in democratizing access to AI.  

On-demand synthetic data generation enables rapid iteration and closed-loop data optimization to drive model performance. 

The burgeoning field of synthetic data generation has seen exponential growth in recent years. Gartner predicts that most of the data used by AI will be artificially generated by 2024 and that one would not be able to build high-quality AI without synthetic data. The ability for AI to improve itself using synthetic data makes it a uniquely powerful technology and key to enhanced quality and quantity of robust data for advanced models and simulations. 

The shift to data-centric development will ultimately usher in higher-performing AI models. Synthetic data is a much-needed ally to achieve a proper and successful data-centric paradigm in the development of AI solutions.

You might also want to check out our list of the best website builders for small businesses today

Yashar Behzadi, CEO, Synthesis AI

Yashar Behzadi is the CEO of Synthesis AI and has 14 years of experience building data-centric technology companies in Silicon Valley in the AI, medical technology, and IoT markets.