Skip to main content

Rising to the challenges of remote data science

data
(Image credit: Image source: Shutterstock/Carlos Amarillo)

For data-driven organizations, remote work is the ultimate test of operational resilience, robustness and efficiency. Issues that appear in the office may be at least partially mitigated by informal discussions around the water cooler, followed by face to face problem solving sessions. However, when data scientists are forced to work remotely, major business disruptors may appear due to problems associated with data access, collaboration and infrastructure.

How can we solve the challenges of remote data science?

A data team set up to be efficient remotely opens new opportunities for productivity; however, many teams may find that they have hurdles to overcome first. Connecting to underlying data systems may be a challenge in a remote working environment, and access to various data sources as well as computational capabilities can also be problematic.

So, what can we do to make remote data science work?

Centralized storage

First and foremost, data teams need a central location to store their work. Many small teams work on AI projects in an ad-hoc fashion, meaning team members store their work locally and don’t have any reproducible processes or workflows, figuring things out along the way. But with more than just a few team members and more than one project, or, given current events, with everyone remote, this becomes unruly quickly.

Remote working has truly underscored the need for centralized, data science, machine learning and AI platforms that are designed to allow people across organizations to access data and work together in a central location. This also encourages good data governance and collaboration practices.

A single access point

One element successful to remote data science that is absent in many situations is having a single access point designed so that data does not need to be moved for processing. This is important, because teams should have instant access to format and schema data, regardless of where it is stored: this could be an analytical MPP database, or in cloud databases, operational databases, NOSQL stores, Hadoop, cloud object storage, or a remote data source.

From a remote work standpoint, it means that the way people work with data stays consistent and secure, regardless of changes in underlying systems and staff.

Horizontal and vertical collaboration of data efforts

Due to the high level of operationalization we’re now seeing with data science, machine learning and AI platforms, data science is no longer just the realm of data scientists. Data projects are also not only about data - they also require strong involvement from business teams to build experience, generate buy-in, and validate relevance. They also require data engineering and other teams to help with the operationalization steps.

This means that these platforms must function to work for both technical and less technical users, and to underpin what is known as horizontal collaboration amongst people working together with roughly the same skills, toolsets and training. They also must function to support vertical collaboration across teams working together who might have vastly different responsibilities.

Remote platforms must have full sets of features for all types of users to encourage success of cross-team collaboration, as well as a work-from-home setup that is not disruptive to execution.

A true end-to-end platform

Many tools and platforms today say they are end-to-end, but they actually only handle one or two parts of the data process. Inevitably, businesses find they need to purchase other tools and cobble them together to make the data workflow seamless between them.

An end-to-end system takes care of everything data related: from unified UK, to connecting to data, to ETL, model creation, operationalization, and model monitoring in production.

The benefits of remote-ready data science

What are the business benefits companies will see if they take the time to put this infrastructure in place?

1. Firstly, having remote-ready data science will open up doors to data talent that is not based where the company is based. According to the 2019 State of Remote Work Survey6, 99 percent of respondents said they would like to work remotely, at least some of the time, for the rest of their career. Not offering the possibility of remote work for data professionals eliminates a lot of talent.

2. For many reasons, data teams are hardly ever all sitting in the same place - meaning in the same building, much less the same country. Enabling remote work not only helps those who are always fully remote - it also infuses best practices that benefit distributed teams and unites data practices across companies.

3. Being remote-ready allows companies to be more agile, which is helpful in a landscape where unforeseen circumstances are the new normal. It means companies can adapt without interruption to data science and machine learning projects. This is especially important as businesses become more mature and more invested in their AI journey. When data projects underpin increasingly essential activities for a company, interruptions to those projects can cause serious business disruption. For example, when retailers have machine learning-based pricing models in production, these models need to be evaluated and maintained continuously. Any interruption to this process can result in a demonstrable loss of business.

In looking at unprecedented events that threaten business continuity and collaboration, the lesson to be learned surrounding remote data science is clear: organizations need to do what they can to encourage collaboration within teams, backing it with strong methodology and the right technology. Data projects are rarely one-person jobs, and businesses need to act now in order to avoid siloed execution of projects that could stem from lack of physical proximity.

For many reasons, which range from cultural to regulatory, most data teams rarely ever all sit in the same building, much less the same country. Enabling remote work not only helps those who are always fully remote, but in parallel, it infuses best practices that can benefit distributed teams and unite data practices across an entire organization. This is set to become more crucial as businesses become more mature in their AI journey, with data projects underpinning business growth and strategy.

Florian Douetteau, CEO, Dataiku