Most of a data scientists' time is spent on non data science tasks, such as building or setting up ML infrastructure, DevOps semantics or many resource conflicts or dependencies. Ideally, this time should be spent on building algorithms, research, experiments, and monitoring/iterating models in production. MLOps discipline was developed as a form of DevOps for machine learning tasks, to help decrease the technical complexity and the technical debt associated with machine learning infrastructure.
When dealing with datasets for machine learning, data scientists often waste hours pulling datasets from remote storage, and repeating this same dataset pull for different runs of the models (possible by different team members). It’s not uncommon to have large datasets feeding machine learning models. However, those datasets may live far away from the actual compute that is used for training the models, such as in the public cloud or in a centralized data lake. Not only that, but each time a data scientist downloads the data, they often end up paying for each download and escalating upward infrastructure costs.
Dataset caching for machine learning
For machine learning, the challenge is that multiple training experiments must run in parallel in different compute environments, each requiring a download of the dataset from the data lake, which is an expensive and time-consuming process. Proximity of the dataset to the compute (especially in the instance of a hybrid cloud) is not guaranteed. In addition, other team members that run their own experiments and require the same dataset will go through the same lengthy process. Beyond the slow data access, challenges include difficulties tracking dataset versions, dataset sharing, collaboration, and reproducibility.
But, there is a way to cache the needed datasets (with their associated versions) and make sure that they’re placed in the storage node attached to the GPU cluster or CPU cluster that is exercising the training. Once the needed datasets are cached, they can be used multiple times by different team members. In this post, I’ll present the novel solution that allows data science practitioners to build a data hub, and create a cache of their datasets in proximity to the compute wherever it is located. As a result, not only is the high-performance model training accomplished, additional benefits are created such as collaboration of multiple AI practitioners and the ability to create a dataset version hub.
How does dataset caching for machine learning work?
Dataset caching works by downloading the chosen commit (version) of your dataset onto an NFS drive in your cluster. Then creates maps and mounts the local folders on your remote compute making the data available immediately within the ML workload.
To enable this sort of behavior, you need to create an NFS node within your Kubernetes cluster, using the DevOps tools provided by your cloud or on-premise toolkit. Then, download a copy of your data (SQL, data warehouse, S3 etc.) onto that NFS drive. When you want to use that data in a workload, mount the NFS drive (IP and folder path) as local folders within the job. You could improve this efficiency by using ready built tools for updating the NFS (downloading data, updating data versions, clearing unused data etc.) and easy mounting of local folders.
Benefits of caching datasets
Dataset caching allows datasets to be ready to use in seconds rather than hours. Whenever you start a job the dataset will be available immediately and will not need to be copied onto your remote system. This is useful as cloning time can be a significant bottleneck when running experiments and performing data processing or model training. With a dataset that is terabytes in size, that has to copy onto a grid search of 100 experiments, the time to copy will be multiplied by 100 causing your compute time and valuable time to be wasted, instead of spending that time running other jobs or doing other data science work.
Improved sharing and collaboration
If set up properly, cached datasets can be authorized and used by multiple teams in the same compute cluster connected to the cached data. That way, all collaborating engineers can use the same dataset with the attached compute cluster (freedom to pick any node) without additional downloads from the data lake. Which brings us to the next benefit...
Since caching can offer a more collaborative process, teams working on the same datasets can reduce cost by running fewer dataset downloads. In addition, models are pulling the datasets from the cache, not from remote storage, which may require payments per download. This can rack up significant costs over time, including the cost of repeated downloads of the same datasets. With dataset caching, you only need to pull the data once, drastically reducing the number of times you need to download data for a machine learning job.
If done right, dataset caching can drastically accelerate machine learning workflows, and increase productivity and efficiency for your data science team. Together with MLOps automations, your team can also increase collaboration between your ML engineers and data scientists and reduce costs. A recent technical report from NetApp proposes a possible infrastructure for building dataset caching for machine learning. Depending on which tools you use to build your dataset caching mechanism, you’ll get different levels of benefits. So, whether you decide to use a simple NFS setup with DevOps associated, or a more advanced automation tool, dataset caching will improve your machine learning workflow, and improve MLOps and productivity.
Leah Kolben, CTO and Co-founder, cnvrg.io