Data drift in Azure Machine Learning

Rating & reviews (0 reviews)
Study notes

You train a machine learning model using a historical dataset that is representative at a point of time.Over time there may be trends that change the profile of the data, making your model less accurate.
Data Drift monitoring helps you to keep your model predictions valid over the time.

Azure Machine Learning supports data drift monitoring through the use of datasets. You can capture new feature data in a dataset and compare it to the dataset with which the model was trained.

To monitor data drift using registered datasets, you need to register two datasets:
  • A baseline dataset - usually the original training data.
  • A target dataset that will be compared to the baseline based on time intervals. This dataset requires a column for each feature you want to compare, and a timestamp column so the rate of data drift can be measured.
After creating these datasets, you can define a dataset monitor to detect data drift and trigger alerts if the rate of drift exceeds a specified threshold.

from azureml.datadrift import DataDriftDetector

monitor = DataDriftDetector.create_from_datasets(workspace=ws,
name='dataset-drift-detector',
baseline_data_set=train_ds,
target_data_set=new_data_ds,
compute_target='aml-cluster',
frequency='Week',
feature_list=['age','height', 'bmi'],
latency=24)

After creating the dataset monitor, you can backfillto immediately compare the baseline dataset to existing data in the target dataset

import datetime as dt

backfill = monitor.backfill( dt.datetime.now() - dt.timedelta(weeks=6), dt.datetime.now())

Scheduling alerts
Data drift monitoring works by running a comparison at scheduled frequency, and calculating data drift metrics for the features in the dataset that you want to monitor. You can define a schedule to run every Day, Week, or Month.

Data drift is measured using a calculated magnitude of changein the statistical distribution of feature values over time.
You can define a threshold for data drift magnitude above which you want to be notified, and configure alert notifications by email.

alert_email = AlertConfiguration('data_scientists@contoso.com')
monitor = DataDriftDetector.create_from_datasets(ws, 'dataset-drift-detector',
baseline_data_set, target_data_set,
compute_target=cpu_cluster,
frequency='Week', latency=2,
drift_threshold=.3,
alert_configuration=alert_email)

References:
Monitor data drift with Azure Machine Learning - Training | Microsoft Learn