How to train and deploy in Azure ML

Created:

Updated:

Topic: Azure ML

Introduction

My goal for this post is to show you the simplest way to train your model and deploy it in the cloud, using Azure ML. Why would you want to train and deploy in the cloud? Training in the cloud will allow you to handle larger ML models and datasets than you could train on your development machine. And deploying your model in the cloud will allow your system to scale to many more inference requests than a development machine could handle.

I recommend reading my introduction to Azure ML before reading this post, but you should still be able to follow along even if you don’t.

Any task you want to accomplish using Azure ML can be done in three ways: using the Azure ML CLI, the Python SDK, or the Studio UI. This post will cover all three, which I hope will enable you to choose the best approach for each scenario you encounter in the future.

I’ve created two GitHub repositories to accompany this post. The first GitHub repo shows how to train and deploy using the Azure ML CLI, and the second GitHub repo shows how to use the Python SDK. For Azure and project setup, please refer to the README files of each GitHub repo.

Training and inference on your development machine

Your development machine may be your local machine, a GitHub Codespace, or an Azure ML compute instance. I’ll first discuss how to train a machine learning model and use it for inference on your development machine. In later sections, I’ll cover how you can use Azure ML to train and deploy your model at scale in the cloud.

To run the training code in your development machine, select the “Train locally” and press F5. This should take a few minutes.

The data we’ll be using in our example is small, so we’ll run our training and inference code using the complete dataset during development. If your data is large, you may need to use just a subset during development. It’s always a good idea to test your code out on your dev machine first, since that involves less overhead than running in the cloud.

The GitHub projects associated with this post contain code that trains the Fashion MNIST dataset. If you’re not familiar with this dataset or with the machine learning code, I recommend reading my introduction to PyTorch article. I won’t describe the training code in detail here, but I do want to call your attention to the portion of the code that saves the model:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/src/train.py 
...
def save_model(model_dir: src, model: nn.Module) -> None:
    """
    Saves the trained model.
    """
    input_schema = Schema(
        [ColSpec(type=DataType.double, name=f"col_{i}") for i in range(784)])
    output_schema = Schema([TensorSpec(np.dtype(np.float32), (-1, 10))])
    signature = ModelSignature(inputs=input_schema, outputs=output_schema)

    code_paths = ["neural_network.py", "utils_train_nn.py"]
    full_code_paths = [
        Path(Path(__file__).parent, code_path) for code_path in code_paths
    ]
    shutil.rmtree(model_dir, ignore_errors=True)
    logging.info("Saving model to %s", model_dir)
    mlflow.pytorch.save_model(pytorch_model=model,
                              path=model_dir,
                              code_paths=full_code_paths,
                              signature=signature)
...

As you can see, this code saves the model using the open-source MLflow framework. This brings us several benefits. First of all, it allows us to visualize any metrics we log during training without having to manually create any graphs. Here’s how I logged the loss and accuracy in the training code:

def train(data_dir: str, model_dir: str, device: str) -> None:
    ...
    for epoch in range(epochs):
        ...
        metrics = {
            "training_loss": training_loss,
            "training_accuracy": training_accuracy,
            "validation_loss": validation_loss,
            "validation_accuracy": validation_accuracy
        }
        mlflow.log_metrics(metrics, step=epoch)

    save_model(model_dir, model)

And here’s how I visualize these metrics locally:

mlflow ui

When I run this command, I get a link that I can click on to see the graphs.

Another benefit of MLflow is that I can invoke the trained model on my dev machine, which helps me to fully test it out before I deploy it in the cloud. MLflow supports test data in CSV and JSON forms, and I include both types of test data in the project, to give you options. Here are the commands I use:

cd aml_command_cli
mlflow models predict --model-uri "model" --input-path "test_data/images.csv" --content-type csv
mlflow models predict --model-uri "model" --input-path "test_data/images.json" --content-type json

You can see the code I wrote to generate the CSV and JSON data in the generate_images.py file.

The output of the JSON generation code looks like this:

[
  {"0": -3.6867828369140625, "1": -5.797521591186523, "2": -3.2098610401153564, "3": -2.2174417972564697, "4": -2.5920114517211914, "5": 3.298574686050415, "6": -0.4601913094520569, "7": 4.433833599090576, "8": 1.1174960136413574, "9": 5.766951560974121}, 
  {"0": 3.5685975551605225, "1": -7.8351311683654785, "2": 12.533431053161621, "3": 1.6915751695632935, "4": 6.009798049926758, "5": -6.79791784286499, "6": 7.569240570068359, "7": -6.589715480804443, "8": -2.000182628631592, "9": -8.283203125}
]

If you’re familiar with the Fashion MNIST dataset, you may have guessed that the keys in this prediction dictionary correspond to clothing items, and the values represent how likely each clothing item is to be correct. For this particular example, keys 9 and 2 have the highest values, which correspond to “Ankle boot” and “Pullover.”

You might be wondering why the model doesn’t simply return the strings “Ankle boot” and “Pullover.” That’s a slightly more advanced scenario, which I’ll cover in a future post. But I think that this basic scenario has value too — for example, if you’re localizing your app to different languages, you might want to translate the predicition to a string in your client app rather than the server.

Once you’re able to get a good prediction for your model on your development machine, you’re ready to move training to the cloud.

Training and deploying using the Azure ML CLI

You’ll need to create a few Azure ML resources in order to train and deploy a model using Azure ML. In this section, I’ll show you how to create those resources using the Azure ML CLI. Check out my introductory article for an overview of the major resources supported by Azure ML. Here are the resources we’ll use for this simple scenario:

  • Compute — We’ll create a cluster of CPU machines to run training in the cloud.
  • Data — We’ll copy our MNIST data to the cloud so that it’s easily accessible to our training job.
  • Job — We’ll create a CommandJob (the simplest type of Job supported by Azure ML) to train the model.
  • Model — Once the training job produces a model, we’ll register it with Azure ML so that we can deploy it as an endpoint.
  • Managed Online Endpoint — We’ll use this particular type of endpoint to make predictions because it’s designed to process smaller requests and give near-immediate responses.
  • Managed Online Deployment — Our endpoint can accommodate one or more deployments; we’ll just use one.

Let’s create our compute resource. We’ll start by defining the details of the compute we want in a YAML file. As you can see below, we name our resource “cluster-cpu” and specify that we want between 0 and 4 machines. We also specify that we want a machine of size Standard_DS4_v2. How do you know which machine size to choose? You can learn more about that in my blog post about compute.

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/cluster-cpu.yml
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: cluster-cpu
type: amlcompute
size: Standard_DS4_v2
location: westus2
min_instances: 0
max_instances: 4

Notice that we also specify a schema, which gives us intellisense and warns us of errors when editing this file. If you press “Ctrl + Space” with the cursor on a new line, you’ll see that VS Code tells you all the other properties that can go in this file. And if you type a non-supported property name, VS Code will alert you by underlining the property name with red squiggles.

How do you know the schema URI for each resource? You can find URIs for all resources in the documentation or you can use the Azure ML extension for VS Code. If you have this extension installed, you can go to the left menu in Visual Studio, click on the symbol for the extension, select your Azure subscription, and pick your desired ML Workspace. You can then browse your existing cloud resources by navigating the tree. Clicking on the ”+” icon to the right of a resource name generates a new YAML file for that resource type, populated with the appropriate schema and a few commonly used properties.

Screenshot showing the Azure ML extension to VS Code.

Now that we have the compute details specified, we need to instruct Azure ML to create the resource in the cloud. This can be done by executing the following command in the terminal:

az ml compute create -f cloud/cluster-cpu.yml

You can verify that your resource was created by visiting the Azure ML Studio. Click on “Compute” in the left menu, then “Compute clusters,” and you should see a cluster named “cluster-cpu” listed on that page.

Screenshot showing the "Compute Clusters" page of the Azure ML Studio.

Congratulations! You created your first Azure ML resource! :)

We can follow similar steps to create the data resource. Our YAML configuration file specifies that we want to upload the “data” local folder into the cloud, and register it under the name “data-fashion-mnist”:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/data.yml
$schema: https://azuremlschemas.azureedge.net/latest/data.schema.json
name: data-fashion-mnist
description: Fashion MNIST Dataset.
path: ../data/
type: uri_folder

We can execute a similar CLI command in the terminal:

az ml data create -f cloud/data.yml

We can then go to the Azure ML Studio, click on “Data,” and verify that a data resource with name “data-fashion-mnist” was created. If you click on the resource name, and then on “Explore,” you’ll see all the Fashion MNIST data files listed there.

Next we’ll create the job resource. In order to train our model, we need to specify the following information:

  • The compute hardware used, which is the compute cluster we defined earlier.
  • The software environment we want installed on that hardware. For more information about environments, check out my blog post on the topic.
  • Where our training code is located, which in our case is within the “src” directory.
  • Inputs to the training code. In our scenario, that’s just the data resource we created earlier.
  • Outputs of the training code. In our scenario, we output the trained model.

You can see that all of this information is specified in the YAML definition file:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/job.yml
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json

type: command
description: Trains a simple neural network on the Fashion-MNIST dataset.
experiment_name: "aml_command_cli"
compute: azureml:cluster-cpu

inputs:
  fashion_mnist:
    path: azureml:data-fashion-mnist@latest
outputs:
  model:
    type: mlflow_model

code: ../src
environment:
  image: mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04:latest
  conda_file: conda.yml
command: python train.py --data_dir ${{inputs.fashion_mnist}} --model_dir ${{outputs.model}}

Creating and starting the job in the cloud should look familiar by now:

az ml job create -f cloud/job.yml

This command works, but it’s not quite enough for our scenario. That’s because once the job finishes training the model, we want to register the model, and that requires a reference to the job instance. One way to get this reference is by running the following command (assuming you’re using bash or zsh):

run_id=$(az ml job create -f cloud/job.yml --query name -o tsv)

The --query parameter specifies that we want to query the JSON returned by the command using JMESPath language, as you can see in the documentation for the “az ml job create” command. We specify that we want to extract the value associated with the “name” key from the returned JSON. The -o parameter specifies that we want the output to be formatted using tab-separated values — you can learn more about output formats in the documentation.

In the Azure ML Studio, you can go to “Jobs”, and look for the name “aml_command_cli.” Click on it, and you’ll see all the job instances associated with this job definition. If you execute the CLI command again, it will add another entry to this page. You’ll need to wait a few minutes for the job to complete. That means that the training is done, and the ML model is ready for use.

When training has completed, you can create an Azure ML resource for the trained model. I could have created another YAML file with the model specification, but I want to show you a different way of creating a resource. Since I only have three properties to set in this case, I can provide the values on the command line:

az ml model create --name model-command-cli --path "azureml://jobs/$run_id/outputs/model" --type mlflow_model

Keep in mind that we need to specify (using --type mlflow_model) that the model was created using MLflow. Also, notice the syntax that I’m using to refer to the output of the job. I like the fact that I can create the model directly from the job output, without having to download it first to my local machine. But you could download the model to use locally if you wanted to, with the following command:

az ml job download --name $run_id --output-name "model"

You can verify that your model was created correctly by going to the Azure ML Studio, clicking on “Models,” and then looking for the “model-command-cli,” which is the name we specified in the CLI command.

Great! We now have a trained model, and want to create an endpoint that we can use to invoke it. In Azure ML, an endpoint can have several deployments, which specify the compute and model we want to use. This is useful, for example, if you want to direct some percentage of your traffic to one deployment and the rest to another. But we’ll keep it simple here, and use a single deployment that handles all traffic. You can see here the YAML definitions for the endpoint and deployment:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/endpoint.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: endpoint-command-cli
auth_mode: key
https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/deployment.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: endpoint-command-cli
model: azureml:model-command-cli@latest
instance_type: Standard_DS4_v2
instance_count: 1

And here are the commands we can execute to create these resources:

az ml online-endpoint create -f cloud/endpoint.yml
az ml online-deployment create -f cloud/deployment.yml --all-traffic

As usual, you can go to the Azure ML Studio, click on “Endpoints,” and see your endpoint and deployment creation in progress. The deployment creation in particular may take several minutes.

Once the endpoint and deployment are created, we’re ready to invoke the endpoint, which we can do with the following command:

az ml online-endpoint invoke --name endpoint-command-cli --request-file test_data/images_azureml.json

You may have noticed that this is not the same JSON file I used to test the endpoint locally, using MLflow. The only difference is that this JSON wraps the JSON we used previously with a dictionary with key “input_data” — this is currently a requirement for Azure ML. You can look at the test image generation code to see how I generated the images_azureml.json file. Also, keep in mind that the CSV format is not supported by Azure ML at the moment.

Before we wrap up, you might want to delete the endpoint, to avoid getting charged:

az ml online-endpoint delete --name endpoint-command-cli -y

And that’s all there is to it.

Training and deploying using the Azure ML Python SDK

All the Azure ML CLI steps that I presented in the previous section can also be accomplished using the Azure ML Python SDK. You can look at this GitHub repo to see how, including instructions on how to run it. I won’t delve into the details, except I want to call your attention to how similar the SDK syntax is to the YAML syntax. This makes it very intuitive to port your implementation from one method to the other, and to mix the two methods in the same project.

For example, let’s compare the YAML and Python SDK code used to create the compute cluster:

https://github.com/bstollnitz/aml_command_cli/blob/master/aml_command_cli/cloud/cluster-cpu.yml
$schema: https://azuremlschemas.azureedge.net/latest/amlCompute.schema.json
name: cluster-cpu
type: amlcompute
size: Standard_DS4_v2
location: westus2
min_instances: 0
max_instances: 4
https://github.com/bstollnitz/aml_command_sdk/blob/master/aml_command_sdk/cloud/job.py
...
    cluster_cpu = AmlCompute(
        name="cluster-cpu",
        type="amlcompute",
        size="Standard_DS4_v2",
        location="westus2",
        min_instances=0,
        max_instances=4,
    )
    ml_client.begin_create_or_update(cluster_cpu)
...

As you can see, the YAML and Python SDK code are very similar. Once you understand the CLI method well, it should be easy to learn the Python SDK.

I often recommend that Azure ML users learn the CLI method first because it’s much easier to keep the ML-specific files that run locally separate from the cloud-specific files. If in the future you decide to switch to a different ML cloud solution (such as AWS Sagemaker or Google Vertex AI), your code should be organized in a way that makes this switch super easy. When using the CLI method this separation comes for free because all your ML code is in Python files and your Azure ML code is in YAML files. So if you want to switch to a different provider, you can simply set aside all your YAML files and continue using the rest.

When using the SDK this separation takes a bit more planning, but it’s not hard to achieve. Here’s how I separate the code in the project for this post:

cloud
    common.py
    conda.yml
    delete_endpoint.py
    endpoint.py
    job.py
src
    neural_network.py
    train.py
    utils_train_nn.py

I keep all the files that I need to train my neural network locally in the src directory, and I keep all the code I need to train and deploy on Azure ML in the cloud directory. Since I’m using the SDK to create the job and endpoint, all my cloud files (with the exception of the conda file) are Python files. But because I keep them in a different folder, it’s clear to me which code is cloud-specific. If in the future I decide to switch providers, I can continue using all my code under the src directory.

So, as long as you separate your ML code from your cloud-specific code, the SDK method is as powerful and flexible as the CLI method.

Training and deploying using the Azure ML Studio

Using the Azure ML Studio should also be intuitive once you’re familiar with the CLI method. Instead of creating resources by writing a YAML file and executing a command, you create them directly using the UI. The decisions you need to make when creating each resource are basically the same as the ones we specified in the YAML files.

For example, let’s see how you would create the compute cluster. You would click on “Compute” in the left menu, then “Compute clusters” in the top menu, then ”+ New.”

Screenshot of Azure ML Studio UI for creating a compute cluster

A window opens that allows you to choose the compute location and VM size. One advantage of using the UI to create your compute cluster is that you get a lot of help when choosing the VM size, including information about what each machine is optimized for, how many cores you have available in your subscription, and the cost of using each machine per hour. You can select a recommended machine size, or search all machines.

Screenshot of Azure ML Studio UI for creating a compute cluster

After pressing Next, a new window opens that allows you to choose a compute name, the minimum and maximum number of nodes, and a few other settings.

Screenshot of Azure ML Studio UI for creating a compute cluster

You can create other types of resources in a similar way.

The Azure ML Studio is a great option to create resources that you’ll only need to create once, because it offers so much guidance. However, resource creation is not as easily repeatable in the Studio as it is with the CLI or SDK, so it’s not as efficient for resources that you plan to re-create often.

Conclusion

In this post, you learned how to train and deploy a model using three methods: the Azure ML CLI, the Azure ML Python SDK, and the Azure ML Studio. Now you’re well prepared to choose the appropriate method for each scenario you’ll encounter in the future.

Read next: How to train using pipelines and components in Azure ML