For many business leaders, machine learning (ML) has become an integral tool for generating insights and improving results across their businesses. While the reasons for implementing ML vary across businesses, the same challenges tend to impact most companies.
Among these common challenges is the difficulty to collaborate among and between data engineers and data scientists. While data engineers and data scientists often work together on projects, their roles and responsibilities differ, as do their skills, backgrounds and the tools they use.
Data scientists: Expert detectives discovering business insights
One of a data scientist’s primary goals is to help businesses extract insights and improve business decisions. In order to accomplish this, they work with large quantities of data, applying their expertise with statistics, programming and related disciplines.
Most often educated in mathematics and statistics, data scientists have many responsibilities, including:
- Collecting and processing data, and extracting trends, patterns and other insights
- Working with stakeholders to exploit data to drive better business decisions
- Training and deploying ML models into production
- Identifying the relevancy and accuracy of data sets
- Presenting data science results and statistical insights to key decision makers
Data engineers: Gatekeepers of businesses’ most valuable data
Data engineers’ primary focus is to create and maintain data infrastructures for analytics. They most often have Computer Science and Engineering backgrounds and a wide range of responsibilities:
- Building, maintaining and optimising scalable data architectures
- Developing dataset processes, workflows and modelling pipelines
- Aligning the data infrastructure with business demands
- Solving issues concerning data reliability and quality
- Preparing large datasets for the training and testing of ML models
- Developing analytics tools to gain better insights
- Collaborating with data scientists in the deployment of ML models
The role of reproducibility in enabling efficient collaboration
In order for data scientists and data engineers to collaborate efficiently, they must be able to reproduce each other’s work easily. In software engineering, we’ve seen how GitHub has enabled developers to collaborate with each other––reproducing work locally and then contributing changes back. However, despite version control like Git, software engineers still struggled to collaborate because it remained difficult to reproduce each other’s’ environments.
Shortcomings in collaboration between and among data scientists and data engineers
Many organisations with big data, e.g. financial institutions, operate in a world where compliance plays an important role. It is imperative for institutions to have a clear path from business decisions back to ML models and the environments that produced them.
Collaboration in data science and data engineering remains an issue, and ML projects must be reproducible for collaboration to be efficient.
Unlike in a software engineering project where versioning the code and the environment has led to huge increases in efficiency, ML projects need to track:
- Model training and test data
- Data engineering and model training code
- Parameters and Hyper-Parameters to the models
- Data engineering or model training environment
ML data source #1: Model training and test data
Even if working on the same ML problem and implementing identical code, two data scientists can come to widely differing results if there are any differences in the data sets used.
In order to prevent this, data scientists should utilise the same data and procedures when working on the same ML problem as another data scientist colleague.
This is where data engineers can play a vital role. Not only can data engineers help provide data scientists with initial data sets for exploration, they also help to design appropriate data infrastructure and optimise data pipelines for data ingestion in ML solutions.
ML data source #2: Data engineering and model training code
ML development involves a lot of code at various stages of the model development lifecycle.
Efficient code collaboration can be achieved by sharing code via a version control system, such as Git, Subversion and Mercurial. To ensure consistency, it’s important that team members use the same code not only when running ML models but also in other phases like data pre-processing.
ML data source #3: Model parameters and hyper-parameters
Despite being part of the code, model parameters and hyper-parameters require additional attention. When training a model, data scientists can waste time looking through previous versions of someone else’s code. In order to maximise efficiency, data engineers can help data scientists gain visibility into which parameters were used in previous training runs so they can pick up from where a colleague left off.
Monitoring ML model environments
A reason why data scientists encounter difficulties in reproducing the results of other team members is the differences in their computing environment. The results of ML models can differ due to different versions of libraries being used even when the training code and data are otherwise identical.
Enabling reproducibility with metric tracking
Model training is often a collaborative process managed by a team of data scientists. Teams may find that they come to significantly different results as different team members try varying approaches to solving the same problem. As a result, it’s vital for data scientists to be able to analyse which approaches show promise without needing to re-run each other’s work.
Unfortunately, many data science teams record their metrics using manual tracking tools, rather than by using more efficient tooling and processes. The volume of information is frequently too vast to record manually, and as a result, people often choose to record only the variables deemed most important. This, in turn, eliminates the ability for teams to fully reproduce each other’s work.
What is the cost of inefficient collaboration between data scientists, data engineers and what can organisations gain by minimising friction?
Shortcomings in collaboration can lead to a wide array of costs for companies:
- Project delays and budget overruns
- Repetitive work
- Less trust in models and less reuse of ML results
- Deterioration of results of ML models
- Diminished trust of management in data science projects
Organisations that manage to improve collaboration between members of ML projects can benefit through reduction of costs, time saving, better data driven business decisions and improved trust towards ML and data science solutions.
Luke Marsden, CEO and founder, Dotscience