The code for this tutorial can be found here
This introductory tutorial will walk you through how to track different versions of your datasets using DVC and experiments (hyperparameters, results, datasets used, etc.) using MLFlow. This walkthrough is designed to give a quick idea of how to use both these tools. However, since there are some overlaps in the features of DVC and MLFlow, you can use just one tool or the other for your applications too to achieve similar results.
This is based on Data Versioning and Reproducible ML with DVC and MLflow - Youtube.
Please refer to the dvc and mlflow official tutorials for more info.
This tutorial was tested with Python 3.9.2.
Create virtual environment:
Install requirements:
This guide assumes you are already inside a git repo. If not, please initialize a git repo by doing git init
or some other method.
Run:
This initializes dvc and also adds some of the newly created files to git staging.
If you run git status
it’ll show something like:
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
new file: .dvc/.gitignore
new file: .dvc/config
new file: .dvc/plots/confusion.json
new file: .dvc/plots/confusion_normalized.json
new file: .dvc/plots/default.json
new file: .dvc/plots/linear.json
new file: .dvc/plots/scatter.json
new file: .dvc/plots/smooth.json
new file: .dvcignore
Reference: dvc add
The remote storage in dvc can be s3
, gs
, gdrive
, etc. For this example we’ll use a local folder ~/tmp/dvc-storage
for simplicity.
# We'll add a location inside a local ~/tmp folder for testing
dvc remote add -d dvc-remote ~/tmp/dvc-storage
# You can also add GCP or other remote storage as the remote.
# Please refer to https://dvc.org/doc/command-reference/remote/add
This adds the following to the .dvc/config file
If you would like to add a remote storage like GCP, you can do so by:
Let’s now copy some data to a local folder data/
Copy an example file;
For this we’ll copy wine-quality.csv example file from mlflow into the data/ folder.
Add the file to dvc:
DVC creates a wine-quality.csv.dvc file and also adds a .gitignore file inside the data/
folder.
Let’s also add a git tag to make it easier to track the data versions through git:
# Add and commit:
git add data/wine-quality.csv.dvc data/.gitignore
git commit -m "data: track"
git tag -a 'v1' -m 'raw data'
Now let’s sync our local data with our remote:
Let’s now see what’s inside our remote
This gives:
/home/user/tmp/dvc-storage/:
total 4
drwxrwxr-x 2 user user 4096 Apr 13 14:47 5d
/home/user/tmp/dvc-storage/5d:
total 260
-r--r--r-- 1 user user 264426 Apr 13 14:47 6f24258e3c50bb01a61194b5401f5d
Now, we can remove the local data if required:
Since we deleted our data in the section above, we can bring them back from remote using dvc pull
# Let's do a simple modification to our csv file
sed -i '2,1001d' data/wine-quality.csv
# let's check dvc status
dvc status
dvc status
command shows that our file was changed:
Now let’s add the new data file to dvc:
Now let’s do a git commit:
Let’s also add a git tag:
Let’s also push our data to remote storage:
Also remember to push your tag to the remote repo by doing git push --tags
.
import dvc.api
data_url = dvc.api.get_url(
path = 'data/wine-quality.csv',
repo = '.',
# You can use different values for rev here.
# This can be any revision such as a branch tag name or a comit hash
# ref: https://dvc.org/doc/api-reference/get_url
rev = 'v2'
)
# Then you can use the data in your favourite tool, eg: pandas;
data = pd.read_csv(data_url)
git checkout
combined with dvc checkout
is one way to do this. Please refer to dvc checkout documentation for more info.
For this, we’ll use this file from mlflow examples.
This file has been downloaded to train.py. ( exact same version of file used can be accessed through this link )
This guide shows one of the many ways you can use mlflow and dvc together. However, please note that these two tools also have some overlaps between their features which allow you to use these tools in many other different ways (either together or independently). Please follow the docs for mlflow and dvc for more info.
MLFlow allows to track hyperparameters, performance metrics, using a helpful python package. Values of python variables can be tracked across different runs by using the log_param
function.
mlflow.log_param('data_url', data_url)
mlflow.log_param('data_version', VERSION)
mlflow.log_param('input_rows', data.shape[0])
mlflow.log_param('input_cols', data.shape[1])
MLFLow runs can be recorded using the following different methods [ref]:
For this example we’ll use a local file method. This is the easiest way to do things. But if we want to track and share our experiments with colleagues, using one of the other supported methods to store experiment data is recommended.
Now let’s try two different experiments with two different versions of data:
open train.py and find the following constants (lines 25 to 28):
# Constants for dvc
PATH = 'data/wine-quality.csv'
REPO = '.' # Path to the Git repo
VERSION = 'v2' # This is the GitHub tag corresponding to the data version
The VERSION
constant here defines the GitHub tag associated with the data version we want to use. Run train.py (python train.py
) and see the results, now change the VERSION
constant to VERSION=v1
and run train.py again. You can see based on the results that a different data version has been used.
You can now open the MLFlow ui by running,
And visit http://127.0.0.1:5000/ to view experiments.
This section shows how to create the development environment from scratch.
Create virtual environment:
Install dependencies:
pip install mlflow
# dvc has many installation options such as [all], [s3], [gdrive], etc.
# Please refer to: https://dvc.org/doc/install/linux
pip install dvc[gs]
# Dependencies for the MLFlow, DVC combined tutorial
pip install scikit-learn
Freeze dependencies:
Please refer to this guide to install and setup gsutil
and make sure you are logged in.
Please refer to the relevent GCP section in this guide to see the most up to date information about setting up gcp credentials.