Skip to main content

Can automated feature engineering produce machine learning that finally lives up to its name?

(Image credit: Image source: Shutterstock/Vasin Lee)

Automated machine learning (ML) sounds like the stuff of business leaders’ dreams.

Take the question, ‘which customers should we focus our marketing budget on this year?’ ML can now deliver robust answers to these types of business questions even faster than before, through greater use of automation.

Data in at one end, seriously useful business insight out of the other.

That is partly why Forbes predicts businesses will be investing $125bn a year on Artificial Intelligence and Machine Learning by 2025.

But even though numerous vendors boast of “AutoML” capabilities, the reality is that the act of developing ML models is still very much driven by humans and requires an awful lot of manual trial and error, performed by (expensive) experts.

Whilst the human element will never completely disappear, new automation techniques will help to reduce the vast amount of labor intensive work required. Not only will this reduce the overall cost and effort, it should reduce the levels of skill and experience required to build reliable ML models.

By today’s standards, it is certainly an unfortunate fact that manual effort still accounts for 80 percent of the machine learning development process. The most important part of this manual effort is the feature engineering process, where different data elements are combined and enriched to generate the most potent formula for predicting future events.

In the case of working out which customers might churn in the next year, for example, the data may include the size of their last discount. But prediction accuracy would improve if further features were engineered such as the time since the last discount, the average time between discounts and how the discount compares to those offered to other customers.

The challenge here is that nobody knows for sure whether these feature combinations will work until they have been developed, tested and fully assessed together as part of an ML model.

Specialist knowledge has been essential in these endeavors: you can’t produce a good algorithm without a subject matter expert knowing something about which features may be the most significant, or without experienced data scientists with deep knowledge of the ML process.

This need for expensive experts is one of the factors that have limited the application of ML to the organizations with the skills, patience and deep pockets to indulge lengthy developments, and to low-risk use cases with the clear potential for high levels of return on investment. But this is now starting to change.

One area of data science development that offers the potential to transform this endeavor is automated feature engineering: Using a computer to shortcut one of the most manually-intensive aspects of ML development.

The challenge of bringing automation to every stage of the ML workflow is one that my company, Peak Indicators, has been exploring for years. From this work, we created Tallinn ML, a platform providing all of the components required to build and deploy predictive models automatically, significantly reducing the reliance on highly-skilled data scientists.

Tallinn ML includes a unique feature-engineering engine that drastically cuts the time taken to develop new predictive algorithms by generating and testing thousands of different metrics as part of the data engineering, a process of trial and error that can take human months or even years to deliver.

Earlier this year we applied it on Kaggle - Google’s online home for the world’s data scientists and machine-learning experts, a kind of Premier League of ML. Kaggle set an unusual challenge. Can you develop an algorithm to predict which people were most likely to survive the world’s most infamous shipwreck - the Titanic?

Competitors were given a set of features, such as passenger age or gender, and asked to develop the most powerful algorithm to predict who would survive. Among Kaggle’s 1 million users are some of the world’s best-known researchers and data science teams. Peak’s Tallinn ML algorithm reached the top 5 percent for accuracy.

While other world-class competitors developed their models through manual means, our model was produced automatically. It involved no coding and no manual trial and error. It proved that machine learning has now reached a new level of automation.

The business impact of automated ML

So what difference does this make to business? Well, potentially a huge one.

The insights provided by predictive analytics and machine learning have been seen for some time as potentially revolutionary for business. Suddenly firms are far better able to answer crucial questions like:

  • What are the impacts of a particular marketing campaign likely to be for specific target customers?
  • Which of our employees are likely to leave in the next year?
  • Which transactions in an account are most likely to be fraudulent?
  • What seems to be causing a particular business problem?

Those questions are just the start. Answering them reliably means resources can be put where they are most needed. Inefficiently-used time and money can be reallocated to more productive tasks. Robust new insights into what is needed next appear magically.

But making that promise a reality is difficult. As Gartner highlighted just last year, “doing predictive analytics is tough. Your team needs to possess the right skills, understand business priorities and deal with data accuracy”.

That meant that any business, according to Gartner’s research, had previously to ask an important question: “What’s the likelihood you’ll sink under the weight of your organization’s data or swim to successful results?”.

Now that question is no longer so pressing. An automated solution makes it far more likely an organization will swim, because it will eliminate a considerable amount of time and effort in ML projects, and significantly reduce the need for very high-level expertise. The chances of an organization sinking - or treading water - in a sea of data become far smaller.

Problems that took months to solve previously can now be addressed in a matter of hours and days, and it has become economically viable to use ML to solve a much more extensive range of problems. We expect to see more experimentation and innovation using ML across all areas of business, including use-cases that didn’t justify the cost of data science projects lasting several months before.

Trials of Tallinn ML at a global retail and consumer-goods company produced a predictive model in two hours that was 18 percent more accurate, and delivered 19 times fewer false positives, than one developed over a three-month period by a team of experienced data scientists.

Another at a global financial-services organization showed that Tallinn ML’s automated feature engineering improved the accuracy of employee-churn predictions by 51 percent.

Beyond these improvements in pace and accuracy, this new approach promises to bring the benefits of ML to a much wider range of organizations. Automating the entire ML workflow democratizes data science to the point that any organization with an IT manager and big data sets to explore can start to derive value from it.

ML and the ability for algorithms to improve automatically through experience has long been recognized for its potential to bring greater intelligence and automation to the world of business. But to date, it has relied on expert humans to set up the machines to do what they do best.

Fully automating the development of ML models means that, for the first time, ML can deliver on its full promise. Efficiency. Productivity. Speed. Precision in prediction. Seriously useful business insight. Genuinely letting the machine take the strain, and freeing up humans to do what they do best.

Antony Heljula, Technical Director, Peak Indicators