Skip to main content

Let's use DevOps thinking to make machine learning fit easier into your it environment

(Image credit: Image Credit: Geralt / Pixabay)

There’s a problem with Machine Learning (ML), but it’s not the usual concerns that people bring up, that it’s too complicated or is a black box.  No, the bigger problem is that the right people aren’t using an advantageous approach to getting it industrialized quicker—yet.

That useful approach is MLOps, a proven method for improving collaboration and communication between data scientists and IT professionals to help better manage the ML production lifecycle. Yup, the clue’s in the name: just as DevOps has cemented Agile and rapid app and software development based on what the business wants, MLOps could do the same for bridging the gap between complexity and production.

The heart of the problem is as ML gets more mainstream, we’re creating more and better models (the outputs of applying ML to your data), but they’re not getting to the finish line in the business yet; Gartner’s Erick Brethenoux estimated last year that less than half (47 percent) are going into production.

Two teams aren’t working together as seamlessly as they could.

That’s not satisfactory. As Machine Learning has started to gain traction, the speed at which data scientists (and, increasingly, citizen data scientists) can build models through AutoML technology has begun to increase, but if the models don’t get used, ML’s not adding value.

Different mindsets

What’s the hold-up?  Time was the bottleneck, as was getting hold of data and storing it for ML use, but Big Data and systems like Hadoop solved that for us. That moved the issue to discovering patterns out of that data, and we’ve started to solve that problem by building more and more Machine Learning models through AutoML. But now, we’ve just moved the latency to allowing the business to gain value from these models by putting them into production. And even when our models get deployed, the speed that it takes an organization to move it into production can be weeks or even months.

Our analysis suggests that the reason may be that when models are operationalized, MLOps is performed by the data science team. You have the data scientists who tend to build or create the models, and then you have a production IT team that then looks to deploy and manage them.  However, the two teams aren’t working together as seamlessly as they could, because:

They have completely different mindsets. Data scientists tend to be creative and like to be experimental. They work in an R&D culture, so they don’t like process, they don’t follow a structure, and they’re very focused on just building the best model possible.  IT plays by different rules, because they have responsibility for production systems, and need to make things work; they want to make sure that the systems that they’re building are robust and available 24/7 and ideally conform to standards and processes. When you put a model into production, the data scientist has created it, but IT has got to manage it, and they’re not always as aligned with each other’s approach as they should be.

Secondly, each team has different competencies. Your data science team is focused on the tools needed to build a great, accurate model; they don’t care as much about production environments and code requirements. IT understands production environments and what software can be used in production, but they don’t know the intricate details of how an ML model gets built. They don’t really understand what a Machine Learning algorithm is, nor do they have a clear view of what a Machine Learning language is, such as Python or R.

Lack of problem ownership

Data science teams are now under more and more demand from different parts of the organization to solve more and more business problems. There’s a significant constraint on people resource, and it’s not efficient for them to be spending their time understanding IT systems because you’ve already got an IT team doing that. Equally, it doesn’t make sense for your IT team to understand the nuts and bolts of a Machine Learning model. What we want to do is to avoid any kind of “Let’s build a model and throw it over the wall to IT, and they will just sort it out” type of thinking.

Finally, you have a lack of ownership of the problem. When the data science team is building a credit risk model for the risk team, the risk team is the customer, the data science team is building the model, but the IT team is deploying it, so each of those three teams are responsible for solving the problem. There’s also the area of governance, too; so when you put a model into production, and it’s making regular decisions every second, every minute, every hour about the business process, it’s now become mission-critical, so you need to ensure this asset is governed fully and only certain people have access to those production models. Only selected people should see it and understand how it’s working, but as you add and retire models, you need to make sure that you’re tracking the changes that are going on. Who made the change? When? How often is that change happening? Tracing that process for compliance, but also troubleshooting it is vital.

Against this set of challenges, MLOps is the best way to better scale and govern Machine Learning activity. It allows data science teams and IT to collaborate, and it will enable your IT operations team, now, really, your model ops team, to centrally manage the everyday operations needed to keep models healthy, keep them running and ensure they’re performing.

With this approach, you would only really let the data science team step in when there’s a severe problem with the model, allowing them to focus on building more and more models. You get your new ‘model ops’ team to quickly deploy and manage models, scaling the whole process.

MLOps makes sense because there are very similar operational processes, even with the most sophisticated and complex Machine Learning model; testing it to make sure it’s performing in a specific way, and then deploying it into production. These are all disciplines IT teams are very used to and are proficient in. 

Six months to get models into production can go down to one

I know that MLOps works in the real world, too, as our customers are benefiting from it. We’re working with a large European financial institution that has built 250 models that used to take somewhere between six and nine months to get into production; but by using an MLOps approach, that is now just a month—an impressive, indeed radical productivity gain. And as Covid lingers, resources are going to be tight, so the only way you can scale up your Machine Learning effort is to either hire more people to do the model building and deployment, or to create frameworks that allow automation.

Done right, a shift to MLOps allows you to automate a lot, freeing you up to do what you really want, which is AI-powered digital transformation; taking data, applying Machine Learning to it, but surrounding it by the right people and processes to make sure it delivers value. And finally, there’s a valuable lesson for the data science side of the equation, which tends to just look at code and so can miss how there’s so much more around the edges that makes a successful Machine Learning project. Model monitoring, making sure production environments are available, and that you’re using the right infrastructure to execute those models.

Ultimately, Machine Learning is just another business software app. So let’s put everything closer together to make it as standardized and easy to get online as your SAP or Salesforce utilities, via Machine Learning Ops thinking.

John Spooner, head of Artificial Intelligence, EMEA, H2O.ai

John Spooner, head of Artificial Intelligence, EMEA, H2O.ai, leads the EMEA technical and data science teams, in advising organisations on democratising artificial intelligence and embedding it into their business decision-making.