Don’t let feature engineering stagnate your ML projects


Have your machine learning (ML) services stalled out? Are stakeholders beginning to doubt their veracity? The application of machine learning is often mistaken to be an efficient, intelligent process. Data goes in and precise insights are quickly produced. The fact is, machine learning is still very much driven by humans. It is an iterative process that often takes much longer than enterprises would prefer and can provide less-than-accurate results. This is particularly true when it comes to feature engineering, which is an arduous process for data scientists and can take months to accomplish – all before machine learning and artificial intelligence (AI) are even applied.  

This then begs the question: why are so many organisations so hot on deploying AI and ML technology if it’s so hard to prepare and use? According to Research and Markets, “the artificial intelligence market is expected to reach USD 190.61 Billion by 2025 from USD 21.46 Billion in 2018.” The firm cited “growth in big data” as a top driver for this explosion in the AI market. Despite this, one Gartner analyst said that roughly 85 per cent of big data projects fail, so regardless of the incredible growth in investments in AI and ML solutions to make sense of the data sets, organisations are clearly approaching their projects incorrectly, leading to the high rate of failure.

Manual feature engineering – the bane of machine learning

The slow and manual process of deploying ML is especially relevant when it comes to the feature engineering stage of any data project. Feature engineering is the process of extracting from a raw data set the explanatory variables that are fed into a machine learning model to make the algorithms work. The features, or variables, can then be used to train the model to make predictions. It is the core of leveraging machine learning and is the most tedious, time-consuming, and labour-intensive portion of training a machine learning model to perform problem predictions. Data scientists must follow precise steps to prepare the data sets for analysis. Most often, data is dispersed across multiple sources and must be consolidated into a single table with the observations comprising the rows and the features in the columns.  

The tables have to be combined and must include training examples and explanatory variables — the features. This is called the feature matrix and feature engineering consists of identifying and extracting predictive features from the data — manually, a process that can take several months. Utilising domain knowledge, the features have to be built one at a time, and the code for manual feature engineering is problem-dependent and usually must be re-written to accommodate each new dataset.

This common approach illustrates one of the more frustrating challenges in machine learning for data scientists, software engineers, developers and business leaders. It is extremely time-consuming to manually define prediction problems, identify relevant features, collect disparate data and feed it into machine learning models. Taking the time to do this, even thoroughly and methodically, doesn’t guarantee accurate outcomes and is one of the many reasons why data projects seem to fail at such an alarming rate.

Feature engineering also requires data scientists with domain expertise to facilitate the brainstorming of ideas. Then, engineers with technical expertise must implement them. This two-step, multi-team process has resulted in a major bottleneck in many organisations’ overall machine learning processes and is contributing to the previously mentioned statistic about failed projects. In this human-driven process, the ideal team would be comprised of data scientists, who also possess the technical knowledge required to not only identify features for testing, but also knows how to architect and implement them. As more enterprises learn this, we will start seeing the emergence and rise of the “ML architect.”

Until then, automated feature engineering is proving to be a panacea for ML projects, for its ability to reduce the process from potentially months to days, and the ability to be utilised by non-data scientists. It is also more efficient and repeatable than performing these steps manually, which allows teams to build better predictive models faster. In addition to speeding a manual undertaking, automated feature engineering also leads to increased accuracy.

Accenture taps automated feature engineering to accelerate ML

Global management consulting agency, Accenture was growing tired of discovering issues with their software projects after the fact, resulting in tedious and time-consuming investigations. This included an intense, high-effort re-examination of data sets to determine what went wrong, taking resources away from hundreds of other data projects.

Accenture wanted to get out in front of the issue and try to reduce the volume of problems. It began to address this by identifying patterns in complex volumes of data. Once done, the firm built machine learning models and used them to anticipate the occurrence of critical issues. Accenture dubbed it their “AI Project Manager.” After analysing historic data from various software projects, the solutions trained a machine learning-based model to predict, weeks in advance, whether a problem might occur. Through automated feature engineering, the tool identified 40,000 patterns. Domain experts were then able to reduce that amount to the 100 showing the most promise. Today, the Accenture AI Project Manager still predicts warnings with an 80 per cent success rate.

This case illustrates a fundamental lesson about machine learning: The biggest problem with machine learning is not that it does not work – it is that companies struggle to use it effectively

In order for enterprises and other organisations hoping to gain the most from their data, they must be able to learn from it quickly, even if that means a fast failure. The sheer volume of time it takes to perform feature engineering manually is among the biggest obstacles to gaining actionable business results from a successful ML project. When organisations can execute this step faster and more accurately, they will get more value from their own data and their ML investments will finally deliver greater returns, moving them out of the 85 per cent of big data projects that fail and into the group of 15 per cent that succeed.

John Donnelly III, chief operations officer. Feature Labs
Image Credit: Razum / Shutterstock