Exam DP-100: Designing and Implementing a Data Science Solution on Azure
Very good to practice (labs)
Azure Machine Learning Exercises
and
Azure Databricks Machine Learning Exercises
There are not only exercises. Use them as starting point and make your own library, your cases, your area of expertise.
New to Azure? Create a trial account.
Otherwise PAYG or buy a Visual Studio subscription and you have included $50 or more monthly to spend on Azure.
1. Create machine learning models
1 You have a NumPy array with the shape (2,20). What does this tell you about the elements in the array? |
The array is two dimensional, consisting of two arrays each with 20 elements |
The array contains 2 elements, with the values 2 and 20 |
The array contains 20 elements, all with the value 2 |
2. You have a Pandas DataFrame named df_sales containing daily sales data. The DataFrame contains the following columns: year, month, day_of_month, sales_total. You want to find the average sales_total value. Which code should you use? |
df_sales['sales_total'].avg() |
df_sales['sales_total'].mean() |
mean(df_sales['sales_total']) |
3. You have a DataFrame containing data about daily ice cream sales. You use the corr method to compare the avg_temp and units_sold columns, and get a result of 0.97. What does this result indicate? |
On the day with the maximum units_sold value, the avg_temp value was 0.97 |
Days with high avg_temp values tend to coincide with days that have high units_sold values |
The units_sold value is, on average, 97% of the avg_temp value |
1. You are using scikit-learn to train a regression model from a dataset of sales data. You want to be able to evaluate the model to ensure it will predict accurately with new data. What should you do? |
Use all of the data to train the model. Then use all of the data to evaluate it |
Train the model using only the feature columns, and then evaluate it using only the label column |
Split the data randomly into two subsets. Use one subset to train the model, and the other to evaluate it |
2. You have created a model object using the scikit-learn LinearRegression class. What should you do to train the model? |
Call the predict() method of the model object, specifying the training feature and label arrays |
Call the fit() method of the model object, specifying the training feature and label arrays |
Call the score() method of the model object, specifying the training feature and test feature arrays |
3. You train a regression model using scikit-learn. When you evaluate it with test data, you determine that the model achieves an R-squared metric of 0.95. What does this metric tell you about the model? |
The model explains most of the variance between predicted and actual values. |
The model is 95% accurate |
On average, predictions are 0.95 higher than actual values |
1. You plan to use scikit-learn to train a model that predicts credit default risk. The model must predict a value of 0 for loan applications that should be automatically approved, and 1 for applications where there is a risk of default that requires human consideration. What kind of model is required? |
A binary classification model |
A multi-class classification model |
A linear regression model |
2. You have trained a classification model using the scikit-learn LogisticRegression class. You want to use the model to return labels for new data in the array x_new. Which code should you use? |
model.predict(x_new) |
model.fit(x_new) |
model.score(x_new, y_new) |
3. You train a binary classification model using scikit-learn. When you evaluate it with test data, you determine that the model achieves an overall Recall metric of 0.81. What does this metric indicate? |
The model correctly predicted 81% of the test cases |
81% of the cases predicted as positive by the model were actually positive |
The model correctly identified 81% of positive cases as positive |
1. K-Means clustering is an example of which kind of machine learning? |
Supervised machine learning |
Unsupervised machine learning |
Reinforcement learning |
2. You are using scikit-learn to train a K-Means clustering model that groups observations into three clusters. How should you create the KMeans object to accomplish this goal? |
model = KMeans(n_clusters=3) |
model = KMeans(n_init=3) |
model = KMeans(max_iter=3) |
1. You are creating a deep neural network to train a classification model that predicts to which of three classes an observation belongs based on 10 numeric features. Which of the following statements is true of the network architecture? |
The input layer should contain three nodes |
The network should contain three hidden layers |
The output layer should contain three nodes |
2. You are training a deep neural network. You configure the training process to use 50 epochs. What effect does this configuration have? |
The entire training dataset is passed through the network 50 times |
The training data is split into 50 subsets, and each subset is passed through the network |
The first 50 rows of data are used to train the model, and the remaining rows are used to validate it |
3. You are creating a deep neural network. You increase the Learning Rate parameter. What effect does this setting have? |
More records are included in each batch passed through the network |
Larger adjustments are made to weight values during backpropagation |
More hidden layers are added to the network |
4. You are creating a convolutional neural network. You want to reduce the size of the feature maps that are generated by a convolutional layer. What should you do? |
Reduce the size of the filter kernel used in the convolutional layer |
Increase the number of filters in the convolutional layer |
Add a pooling layer after the convolutional layer |
2. Microsoft Azure AI Fundamentals: Explore visual tools for machine learning
1. An automobile dealership wants to use historic car sales data to train a machine learning model. The model should predict the price of a pre-owned car based on its make, model, engine size, and mileage. What kind of machine learning model should the dealership use automated machine learning to create? |
Classification |
Regression |
Time series forecasting |
2. A bank wants to use historic loan repayment records to categorize loan applications as low-risk or high-risk based on characteristics like the loan amount, the income of the borrower, and the loan period. What kind of machine learning model should the bank use automated machine learning to create? |
Classification |
Regression |
Time series forecasting |
3. You want to use automated machine learning to train a regression model with the best possible R2 score. How should you configure the automated machine learning experiment? |
Set the Primary metric to R2 score |
Block all algorithms other than GradientBoosting |
Enable featurization |
1. In Azure Machine Learning studio, what can you use to author regression machine learning pipelines using a drag-and-drop interface? |
Notebooks |
Automated machine learning |
Designer |
2. You are creating a training pipeline for a regression model. You use a dataset that has multiple numeric columns in which the values are on different scales. You want to transform the numeric columns so that the values are all on a similar scale. You also want the transformation to scale relative to the minimum and maximum values in each column. Which module should you add to the pipeline? |
Select Columns in a Dataset |
Clean Missing Data |
Normalize Data |
3. Why do you split data into training and validation sets? |
Data is split into two sets in order to create two models, one model with the training set and a different model with the validation set. |
Splitting data into two sets enables you to compare the labels that the model predicts with the actual known labels in the original dataset. |
Only split data when you use the Azure Machine Learning Designer, not in other machine learning scenarios. |
1. You're using Azure Machine Learning designer to create a training pipeline for a binary classification model. You've added a dataset containing features and labels, a Two-Class Decision Forest module, and a Train Model module. You plan to use Score Model and Evaluate Model modules to test the trained model with a subset of the dataset that wasn't used for training. What's another module should you add? |
Join Data |
Split Data |
Select Columns in Dataset |
2. You use an Azure Machine Learning designer pipeline to train and test a binary classification model. You review the model's performance metrics in an Evaluate Model module, and note that it has an AUC score of 0.3. What can you conclude about the model? |
The model can explain 30% of the variance between true and predicted labels. |
The model predicts accurately for 70% of test cases. |
The model performs worse than random guessing. |
3. You use Azure Machine Learning designer to create a training pipeline for a classification model. What must you do before deploying the model as a service? |
Create an inference pipeline from the training pipeline |
Add an Evaluate Model module to the training pipeline |
Clone the training pipeline with a different name |
1. You are using an Azure Machine Learning designer pipeline to train and test a K-Means clustering model. You want your model to assign items to one of three clusters. Which configuration property of the K-Means Clustering module should you set to accomplish this? |
Set Number of Centroids to 3 |
Set Random number seed to 3 |
Set Iterations to 3 |
2. You use Azure Machine Learning designer to create a training pipeline for a clustering model. Now you want to use the model in an inference pipeline. Which module should you use to infer cluster predictions from the model? |
Score Model |
Assign Data to Clusters |
Train Clustering Model |
3. Build and operate machine learning solutions with Azure Machine Learning
1. Which of the following descriptions accurately describes Azure Machine Learning? |
A Python library that you can use as an alternative to common machine learning frameworks like Scikit-Learn, PyTorch, and Tensorflow. |
A cloud-based platform for operating machine learning solutions at scale. |
An application for Microsoft Windows that enables you to create machine learning models by using a drag and drop interface. |
2. You are using the Azure Machine Learning Python SDK to write code for an experiment. You must log metrics from each run of the experiment, and be able to retrieve them easily from each run. What should you do? |
Add print statements to the experiment code to print the metrics. |
Save the experiment data in the outputs folder. |
Use the log* methods of the Run class to record named metrics. |
1. You want to use a script-based experiment to train a PyTorch model, setting the batch size and learning rate hyperparameters to specified values each time the experiment runs. What should you do? |
Create multiple script files – one for each batch size and learning rate combination you want to use. |
Set the batch_size and learning_rate properties of the ScriptRunConfig before running the experiment |
Add arguments for batch size and learning rate to the script, and set them in the arguments property of the ScriptRunConfig. |
2. You have run an experiment to train a model. You want the model to be stored in the workspace, and available to other experiments and published services. What should you do? |
Register the model in the workspace. |
Save the model as a file in a Compute Instance. |
Save the experiment script as a notebook. |
1. You have a reference to a Workspace named ws. Which code retrieves the default datastore for the workspace? |
default_ds = Datastore.get(ws, 'default') |
default_ds = ws.Datastores[0] |
default_ds = ws.get_default_datastore() |
2. A datastore contains a CSV file of structured data that you want to use as a Pandas dataframe. Which kind of object should you create to make it easy to do this? |
A datastore. |
A tabular dataset. |
A file dataset. |
3. You want a script to stream data directly from a file dataset. Which mode should you use? |
as_mount() |
as_download() |
as_upload() |
1. You're using the Azure Machine Learning Python SDK to run experiments. You need to create an environment from a Conda configuration (.yml) file. Which method of the Environment class should you use? |
create |
from_conda_specification |
from_existing_conda_environment |
2. You must create a compute target for training experiments that require a graphical processing unit (GPU). You want to be able to scale the compute so that multiple nodes are started automatically as required. Which kind of compute target should you create? |
Compute Instance |
Compute Cluster |
Inference Cluster |
1. You're creating a pipeline that includes two steps. Step 1 preprocesses some data, and step 2 uses the preprocessed data to train a model. What type of object should you use to pass data from step 1 to step 2 and create a dependency between these steps? |
Datastore |
OutputFileDatasetConfig |
Dataset |
2. You've published a pipeline that you want to run every week. You plan to use the Schedule.create method to create the schedule. What kind of object must you create first to configure how frequently the pipeline runs? |
Datastore |
PipelineParameter |
ScheduleRecurrence |
1. You've trained a model using the Python SDK for Azure Machine Learning. You want to deploy the model as a containerized real-time service with high scalability and security. What kind of compute should you create to host the service? |
An Azure Kubernetes Services (AKS) inferencing cluster |
A compute instance with GPUs |
A training cluster with multiple nodes. |
2. You're deploying a model as a real-time inferencing service. What functions must the entry script for the service include? |
main() and score() |
base() and train() |
init() and run() |
1. You are creating a batch inferencing pipeline that you want to use to predict new values for a large volume of data files. You want the pipeline to run the scoring script on multiple nodes and collate the results. What kind of step should you include in the pipeline? |
PythonScriptStep |
ParallelRunStep |
AdlaStep |
2. You have configured the step in your batch inferencing pipeline with an output_action="append_row" property. In which file should you look for the batch inferencing results? |
output.txt |
stdoutlogs.txt |
parallel_run_step.txt |
1. You plan to use hyperparameter tuning to find optimal discrete values for a set of hyperparameters. You want to try every possible combination of a set of specified discrete values. Which kind of sampling should you use? |
Random Sampling |
Grid Sampling |
Bayesian Sampling |
2. You are using hyper parameter tuning to train an optimal model based on a target metric named "AUC". What should you do in your training script? |
Import the logging package and use a logging.info() statement to log the AUC |
Include a print() statement to write the AUC value to the standard output stream |
Get a reference to the Azure ML run context and use a run.log() statement to write the AUC value to the run log |
1. You are using automated machine learning to train a model that predicts the species of a penguin based on its bill and flipper measurements. Which kind of task should you specify for automated machine learning? |
Regression |
Forecasting |
Classification |
2. You want to use automated machine learning to find the model with the best AUC_weighted metric. Which parameter of the AutoMLConfig object should you set? |
task='AUC_weighted' |
label_column_name= 'AUC_weighted' |
primary_metric='AUC_weighted' |
1. How does differential privacy work? |
All numeric values in the dataset are encrypted and cannot be used in analysis. |
Noise is added to the data during analysis so that aggregations are statistically consistent with the data distribution but non-deterministic. |
All numeric column values in the dataset are converted to the mean value for the column. Analyses of the data use the mean values instead of the actual values. |
2. In a differential privacy solution, what is the effect of setting an epsilon parameter? |
A lower epsilon reduces the impact of an individual's data on aggregated results, increasing privacy and reducing accuracy |
A lower epsilon reduces the amount of noise added to the data, increasing accuracy and reducing privacy |
Setting epsilon to 1 enables differential privacy. Setting it to 0 disables differential privacy. |
1. You have trained a classification model, and you want to quantify the influence of each feature on a specific individual prediction. What should you examine? |
Global feature importance |
Local feature importance |
Recall and Precision |
2. Which explainer uses an architecture-appropriate SHAP algorithm to interpret a model? |
PFIExplainer |
MimicExplainer |
TabularExplainer |
1. You are training a binary classification model to support admission approval decisions for a college degree program. How can you evaluate if the model is fair, and doesn't discriminate based on ethnicity? |
Evaluate each trained model with a validation dataset, and use the model with the highest accuracy score. An accurate model is inherently fair. |
Remove the ethnicity feature from the training dataset. |
Compare disparity between selection rates and performance metrics across ethnicities. |
2. You have used Fairlearn to evaluate a model in a notebook. You register the model in your Azure Machine Learning workspace. You want to be able to select the model in Azure Machine Learning studio and from there view its fairness dashboard to compare disparity for performance metrics. What should you do? |
Run an experiment in which you upload the dashboard metrics for the model. |
Save the notebook in your Azure Machine Learning workspace. |
Use the selection_rate_group_summary function to get the fairness data, and save it as a file dataset in your Azure Machine Learning workspace. |
3. You plan to use the Grid Search mitigation technique to find an optimal model for a binary classifier that predicts whether or not a candidate will be successful in an employment role. You want to ensure that the model selects an equal number of candidates from each category in the Gender feature. Which parity constraint should you use? |
Demographic parity. |
Error rate parity. |
Bounded group loss. |
1. You have deployed a model as a real-time inferencing service in an Azure Kubernetes Service (AKS) cluster. What must you do to capture and analyze telemetry for this service? |
Redeploy the model as an ACI service. |
Enable application insights. |
Move the AKS cluster to the same region as the Azure Machine Learning workspace. |
2. You want to include custom information in the telemetry for your inferencing service, and analyze it using Application Insights. What must you do in your service's entry script? |
Use the Run.log method to log the custom metrics. |
Save the custom metrics in the ./outputs folder. |
Use a print statement to write the metrics in the STDOUT log. |
1. You have trained a model using a dataset containing data that was collected last year. As this year progresses, you will collect new data. You want to track any changing data trends that might affect the performance of the model. What should you do? |
Collect the new data in a new version of the existing training dataset, and profile both datasets. |
Collect the new data in a separate dataset and create a Data Drift Monitor with the training dataset as a baseline and the new dataset as a target. |
Replace the training dataset with a new dataset that contains both the original training data and the new data. |
2. You are creating a data drift monitor. You want to automatically notify the data science team if a significant change in data distribution is detected. What must you do? |
Define an AlertConfiguration and set a drift_threshold value. |
Set the latency of the data drift monitor to allow time for data scientists to review the new data. |
Register the training dataset with the model, including the email address of the data science team as a tag. |
1. What does creating Custom Roles allow you to do? |
Custom roles allow you to customize what users can and can’t access in a workspace |
Custom roles allow you to only perform read-only actions in a workspace |
Custom roles allow you to customize the speaking voice used |
2. What would you use the Azure Key Vault for? |
To keep track of your house, work, and car keys |
To hide your training models and data |
To securely store and access secrets |
3. Why use a virtual private network (VPN) for machine learning? |
To assign roles |
To secure your training models and data from malicious actors by encrypting your connection |
To work offline |
4. Build and operate machine learning solutions with Azure Databricks
1. Alice creates a notebook on Azure Databricks to train her datasets, before using them with Spark ML. Which of the following languages are supported for doing that in a notebook? |
Java |
Python |
C# |
2. You want to train a neural network with TensorFlow. You don't want to install the library manually to avoid extra overhead. What should you do? |
Create a cluster with the Databricks Runtime for Machine Learning. |
Create a single node cluster. |
Create a Python notebook. |
3. Which description of DBFS is correct? |
You can upload a file to the DBFS using the UI. |
You can only access data in Azure Databricks if it's stored on DBFS. |
Data uploaded to the DBFS is only stored as long as your cluster is running. |
1. Bob has to ingest data in Azure Databricks, coming from multiple data sources. He'll then run some ETLs to move the data into the data lake. Which of his data sources can be mounted under DBFS? |
FTP |
Azure Data Lake Store |
SMB |
2. Dataframe df needs to be visualized so that the results can be presented to the business users. Which function should be used to plot the data as a bar chart or other type of graph? |
df.show() |
df.describe() |
df.display() |
3. Logan is preparing a dataset, which will be used to train a model. It seems like each record appears twice in the table df. She wants to delete half of the rows to make sure each observation is only included once. Which function should she use? |
df.fillna(2) |
df.dropna |
df.dropDuplicates |
1. John is looking to train his first machine learning model. One of his inputs includes the size of the T-Shirts, with possible values of XS, S, M, L, and XL. What is the best approach John can employ to preprocess the T-Shirt size input feature? |
Standardization |
One-hot Encoding |
Normalization |
2. You have dataset with a 1000 rows and after some exploration, find out that there are some missing values. Only 5 records seem to contain no value. What is the best strategy to deal with this missing data? |
Duplicate the records. |
Scale the records. |
Drop the records. |
1. Daniel is new to Azure Databricks and wants to use a library for machine learning that is native to Apache Spark. Which library should be used? |
MLLib |
Spark ML |
RDD |
2. Which of the following options is a key abstraction for model training in Spark ML? |
Iterator |
Processor |
Transformer |
3. Alexandra has trained a linear regression model, which is named lrModel, and now wants to validate it. Which code should be run? |
predictions = lrModel.transform(testData) |
predictons = lrModel.fit(testData) |
predictions = lrModel.fit(trainingData) |
1. Which of the following options is the name of an MLflow component? |
MLflow Framework |
MLflow Tracking |
MLflow Training |
2. An experiment is best defined as which of the following statements? |
A collection of tests |
A collection of runs |
A collection of notebooks |
3. What are two necessary assets included in a MLflow project? |
A model and its metrics. |
A Python script and an experiment. |
A Python script and the environment. |
1. Lucas has trained a model using Azure Databricks and now wants to serve it for real-time scoring. What should be the next action? |
Version the model. |
Transition the model to the archived stage. |
Register the model. |
2. An MLflow run executes a Python script to train a model. The model is logged during the run but should also be registered. What should be included in the code to register the model during the run? |
register_model |
registered_model_name |
log_model |
3. Hanna has Read permissions and wants to transition a model from Staging to Production through the Azure Databricks UI. What can someone with Read permissions do? |
Perform the transition. |
Request the transition. |
She can only read the current stage. |
1. What is the correct method to log a model metric, _rmse, in MLflow? |
mlflow.log("RMSE", _rmse) |
mlflow.log_artifact("RMSE", _rmse) |
mlflow.log_metric("RMSE", _rmse) |
2. Leon wants to run an Azure Machine Learning experiment in Azure Databricks. An MLflow experiment is configured and Leon is about to run it. He realizes he forgot a step. What should have been done first? |
Register the model in Azure Machine Learning. |
Log the experiment metrics with MLflow. |
Configure MLflow tracking URI to use Azure Machine Learning. |
3. You want to run an Azure Machine Learning Pipeline. The first step of the pipeline is to preprocess the data by running a Python script on Azure Databricks Compute. The configuration is defined and now you need to create the compute. Which type of object should you use? |
DatabricksAttachConfiguration |
ComputeTarget |
DatabricksCompute |
1. To support real-time inferencing in production applications, which is the best choice as a deployment target for the scoring web service? |
Azure Kubernetes Service (AKS). |
Azure Container Instances (ACI). |
Azure Machine Learning Compute Clusters. |
2. You want to deploy a model to Azure Container Instances. You have registered the trained model, created an entry script and an environment. You want to combine the entry script and environment. What kind of object should you use? |
InferenceConfig |
WebserviceDeploymentConfiguration |
ComputeTarget |
1. A Python script to train a model is created in Azure Databricks and attached to a cluster with the Databricks Runtime for Machine Learning. To run the code and enable automated MLflow when tuning hyperparameters, which method should be used? |
ParamGridBuilder() |
CrossValidator |
RegressionEvaluator() |
2. Which arguments are needed to run the Hyperopt function fmin()? |
The evaluation metric, the model, and the data. |
The objective function, the search space, and the model. |
The objective function, the search space, and the search algorithm. |
1. To use HorovodRunner, we have to set the number of processes by specifying the parameter np. What should the value for np be if we want to train on a single node? |
np=2 |
np=-1 |
np=1 |
2. In the Horovod training script, what is the primary purpose of hvd.callbacks.BroadcastGlobalVariablesCallback(0) callback? |
To ensure consistent initialization of all workers when training is started with random weights or restored from a checkpoint. |
To average metrics among workers at the end of every epoch. |
Reduce the learning rate if training plateaus. |
Resource:
Exam DP-100: Designing and Implementing a Data Science Solution on Azure - Certifications | Microsoft Learn