## Introduction

Suppose you’ve trained a machine learning model to accomplish some task, and you’d now like to provide that model’s inference capabilities as a service. Maybe you’re writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints — they provide a simple web-based API for feeding data to your model and getting back inference results.

Azure currently supports two types of endpoints: batch endpoints and online managed endpoints.

Batch endpoints are designed to handle large requests, working asynchronously and generating results that are held in blob storage. Online endpoints, on the other hand, are designed to quickly process smaller requests and provide near-immediate responses. For more information about the different endpoint types and which one is right for you, check out the documentation.

In this post, I’ll show you how to work with online managed endpoints. We’ll start by training and saving two machine learning models, one using PyTorch and another using TensorFlow. We’ll then write scoring functions that load the models and perform predictions based on user input. After that, we’ll explore several different options for creating online managed endpoints that call our scoring functions. And finally, we’ll demonstrate a couple of ways to invoke our endpoints. The code for this project can be found on GitHub.

Throughout this post, I’ll assume you’re familiar with machine learning concepts like training and prediction, but I won’t assume familiarity with Azure.

## Azure ML prerequisites

Here’s how you can set up Azure ML to follow the steps in this post.

• You need to have an Azure subscription. You can get a free subscription to try it out.
• Create a resource group.
• Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
• Install the Azure CLI (command-line interface) on your platform of choice. This post has been tested using the CLI on WSL2 (Windows subsystem for Linux) with Ubuntu.
• Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
• Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or if you have a machine setup similar to mine, by looking at ~/.azure/azureProfile.json.
• Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
• You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
• Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code, if you’re using VS Code.

You’re now ready to start working with Azure ML!

## Training and saving the models

Let’s start by training two machine learning models to classify Fashion MNIST images — one using PyTorch and another using TensorFlow. For a full explanation of the PyTorch training code, check out my PyTorch blog post. For a full explanation of the TensorFlow training code, see my Keras and TensorFlow posts. I’ve included the relevant training code in the pytorch_src/train.py and tf_src/train.py files of the current project.

Here we’re saving just the weights of the model, not the entire model. In our particular scenario the space we save by keeping just the weights is negligible, but it may be different in your scenario.

pytorch_src/train.py

torch.save(model.state_dict(), 'pytorch_model/weights.pth')

tf_src/train.py

model.save_weights('tf_model/weights')


To keep the Azure portion of this post simple and focused on endpoints, we run the training code locally. If you’d like to train on Azure, you can look at the documentation on how to do that. I also intend to cover this topic in future posts. For this project, I checked in the saved models under the pytorch_model and tf_model folders, so you don’t have to run the training on your machine. If you want to recreate the models yourself, you first need to create the pytorch-managed-endpoint-train and tf-managed-endpoint-train conda environments using the files under the conda directory. Then you need to delete the folders where the current models are located, pytorch_model and tf_model. And finally, you need to activate one conda environment at a time and run the corresponding training file, pytorch_src/train.py or tf_src/train.py.

## Creating the models on Azure

Let’s use the weights that we generated locally to create the models on Azure. In our scenario we saved just weights, not the whole model, but we can register these weights as an Azure model as if we had a whole model. There are many different ways to create ML resources on Azure. My preferred way is to have a separate YAML file for each resource, and to use a CLI command to create the resource according to the specifications in the YAML file. This is the method I’ll show in this post.

Let’s start by looking at the YAML files for our models, cloud/pytorch-fashion-weights.yml and cloud/tf-fashion-weights.yml. You can see that these files start by specifying a schema, which is super helpful because it enables VS Code to make suggestions and highlight any mistakes we make. The attributes in this file make it clear that an Azure model consists of a name, a version, and a path to the location where we saved the trained model files locally. The PyTorch model consists of a single file, so we can point directly to it; the TensorFlow model consists of several files, so we point to the directory that contains them.

cloud/pytorch-fashion-weights.yml

$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json name: pytorch-fashion-weights version: 1 local_path: '../pytorch_model/weights.pth'  cloud/tf-fashion-weights.yml  $schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: tf-fashion-weights
version: 1
local_path: '../tf_model/'


How will you select the correct schema when creating a new resource? You can always copy the schemas from my blog or from the documentation, but the easiest way is to use the Azure Machine Learning extension for VS Code. If you have it installed, you can select the Azure icon in VS Code’s left navigation pane, expand your subscription and ML workspace, select “Models”, and click the ”+” button to create a YAML file with the correct model schema and attributes.

Now that we have the YAML files containing our model specifications, we can simply run CLI commands to create these models on Azure:

az ml model create -f cloud/pytorch-fashion-weights.yml
az ml model create -f cloud/tf-fashion-weights.yml


If you go to the Azure ML studio, and use the left navigation to go to the “Models” page, you’ll see our newly created models listed there.

In order to deploy our PyTorch and TensorFlow models as Azure ML endpoints, we’ll use YAML files to specify the details of the endpoint configurations. I’ll show bits and pieces of these YAML files throughout the rest of this post as I present each setting. We’ll create six endpoints with different configurations to help you understand the range of alternatives available to you. If you look at the YAML files used for endpoint creation in this project, you’ll notice that each of them refers to one of these two models. For example:

cloud/endpoint_1/endpoint.yml

...
model: azureml:pytorch-fashion-weights:1
...

cloud/endpoint_2/endpoint.yml

...
model: azureml:tf-fashion-weights:1
...


As you can see, the model name is preceded by “azureml:” and followed by a colon and the version number we specified in the model’s YAML file.

## Creating the scoring files

When invoked, our endpoint will call a scoring file, which we need to provide. This scoring file needs to follow a prescribed structure: it needs to contain an init() function that will be called when the endpoint is created or updated, and a run() function that will be called every time the endpoint is invoked. Let’s look at these in more detail.

First we’ll take a look at the scoring code for the PyTorch model (you’ll find similar TensorFlow code in the post’s project):

pytorch_src/score.py

def init():
logging.info('Init started')

global model
global device

device = 'cuda' if torch.cuda.is_available() else 'cpu'
logging.info(f'Device: {device}')

model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'weights.pth')

model = NeuralNetwork().to(device)
model.eval()

logging.info('Init completed')


In our simple scenario, the init() function’s main task is to load the model. Because we saved just the weights, we need to instantiate a new version of the NeuralNetwork class before we can load the saved weights into it. Notice the use of the AZUREML_MODEL_DIR environment variable, which gives us the path to the model root folder on Azure. Notice also that since we’re using PyTorch, we need to ensure that both the loaded weights and the neural network we instantiate are on the same device (GPU or CPU).

I find it useful to add logging.info() calls at the beginning and end of the function to make sure that it’s being called as expected. When we cover invoking the endpoint, I’ll show you where to look for the logs. I also like to add a logging.info() call that tells me whether the code is running on GPU or CPU, as a sanity check.

Now let’s look at the run() function:

pytorch_src/score.py

labels_map = {
0: 'T-Shirt',
1: 'Trouser',
2: 'Pullover',
3: 'Dress',
4: 'Coat',
5: 'Sandal',
6: 'Shirt',
7: 'Sneaker',
8: 'Bag',
9: 'Ankle Boot',
}

def predict(model: nn.Module, X: Tensor) -> torch.Tensor:
y_prime = model(X)
probabilities = nn.functional.softmax(y_prime, dim=1)
predicted_indices = probabilities.argmax(1)
return predicted_indices

def run(raw_data):
logging.info('Run started')

X = np.array(X).reshape((1, 1, 28, 28))
X = torch.from_numpy(X).float().to(device)

predicted_index = predict(model, X).item()
predicted_name = labels_map[predicted_index]

logging.info(f'Predicted name: {predicted_name}')

logging.info('Run completed')
return predicted_name


Notice that run() takes a raw_data parameter as input, which contains the data we specify when invoking the endpoint. In our scenario, we’ll be passing in a JSON dictionary with a data key corresponding to a matrix containing an image with float pixel values between 0.0 and 1.0. Our run() function loads the JSON, transforms it into a tensor of the format that our predict() function expects, calls the predict() function, converts the predicted int into a human-readable name, and returns that name.

Let’s look at our first two endpoint YAML files, which refer to the PyTorch and TensorFlow scoring files we discussed:

cloud/endpoint_1/endpoint.yml

...
code_configuration:
code:
local_path: ../../pytorch_src/
scoring_script: score.py
...

cloud/endpoint_2/endpoint.yml

...
code_configuration:
code:
local_path: ../../tf_src/
scoring_script: score.py
...


## Creating the environments

An Azure Machine Learning environment specifies the runtime where we can run training and prediction code on Azure, along with any additional configuration. In our scenario we’re only running prediction in the cloud, so we’ll focus on inference environments. Azure supports three different types of environments:

1. Environments created from prebuilt Docker images for inference

These prebuilt Docker images are provided by Microsoft, and they’re the easiest to get started with. In addition to Ubuntu and optional GPU support, they include different versions of TensorFlow and PyTorch, as well as many other popular frameworks and packages. I prefer to use prebuilt Docker images over the other two types — they deploy quickly and their pre-installed packages cover most of my needs.

The full list of prebuilt Docker images available for inference can be found in the documentation. The docs show which packages are pre-installed in each Docker image, and two ways of referring to each image: an “MCR path” and a “curated environment.” Using the MCR path requires creating an environment YAML file (a topic we’ll return to when describing the next environment type). Using a curated environment, on the other hand, avoids the need for an environment YAML file — and that’s the approach I’ll show here.

Once I’ve selected a curated environment that has all the packages I need, I just need to refer to it in my endpoint YAML file. Here are the relevant lines from the first and second endpoints in our scenario:

cloud/endpoint_1/endpoint.yml

...
environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11
...

cloud/endpoint_2/endpoint.yml

...
environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cpu-inference:11
...


To determine the version number for a particular curated environment, you can look in Azure ML studio under “Environments” then “Curated environments”:

Or you can use the Azure ML extension for VS Code — click on the Azure icon in the left navigation pane, expand your subscription and ML workspace, then expand “Environments” and “Azure ML Curated Environments”. Right-click on a curated environment and select “View Environment” to see the version number.

For the scenario in this post, we’re able to use curated environments that include all the packages we need to run our code. If your scenario requires additional packages, then you’ll need to extend the environment, which you can do in one of three ways: by specifying the MCR path together with a conda file in an environment YAML file (as described under the next environment type), using dynamic installation, or with pre-installed Python packages.

2. Environments created from base images

These are Docker images provided by Microsoft that contain just the basics: Ubuntu, and optionally CUDA and cuDNN. Keep in mind that these don’t contain Python or any machine learning package you may need, so when using these environments, we typically include an additional conda file. A full list of available base images can be found in this GitHub repo.

I use base images in endpoints 3 and 4. Because they don’t contain Python, PyTorch, and TensorFlow, I had to extend them using conda files. Note that I also added the azureml-defaults package, which is required for inference on Azure. Let’s take a look at the conda files:

cloud/endpoint_3/score-conda.yml

name: pytorch-managed-endpoint-score
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- python=3.7
- pytorch=1.7
- pip
- pip:
- azureml-defaults

cloud/endpoint_4/score-conda.yml

name: tf-managed-endpoint-score
channels:
- conda-forge
- defaults
dependencies:
- python=3.7
- pip
- pip:
- tensorflow==2.4
- azureml-defaults


Next we create YAML files that contain all the information we need to create the environments: a name, a version, the base image, and a conda file.

cloud/endpoint_3/pytorch-gpu-managed-env.yml

$schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json name: pytorch-gpu-managed-env version: 1 docker: image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04 conda_file: file:score-conda.yml  cloud/endpoint_4/tf-gpu-managed-env.yml  $schema: https://azuremlschemas.azureedge.net/latest/environment.schema.json
name: tf-gpu-managed-env
version: 1
docker:
image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04
conda_file: file:score-conda.yml


In the curated environments section, I chose CPU environments. Here I’m choosing GPU base images, so that you see the range of options available to you.

We can now create the environments using the CLI:

az ml environment create -f cloud/endpoint_3/pytorch-gpu-managed-env.yml
az ml environment create -f cloud/endpoint_4/tf-gpu-managed-env.yml


If you look at the “Environments” section of the Azure ML studio, you should see the newly created environments listed there.

Now that the environments are created, we can refer to them in the endpoint’s YAML file:

cloud/endpoint_3/endpoint.yml

...
environment: azureml:pytorch-gpu-managed-inference-env:1
...

cloud/endpoint_4/endpoint.yml

...
environment: azureml:tf-gpu-managed-inference-env:1
...

3. User-managed environments

You can also create your own container and use it as an inference environment. I won’t go into detail on this topic, but you can take a look at the documentation.

## Choosing the instance type

Now we’ll choose the machine where we’ll be deploying the environments and inference code for our endpoints. You can find the list of all VMs (virtual machines) supported for inference in the documentation.

Endpoints 1 and 2 of this project rely on curated environments that run on the CPU, so there’s no point in paying for a VM with a GPU. For these endpoints, I chose a “Standard_DS3_v2” VM because a small size is enough for our purposes. Endpoints 3 and 4 rely on base image environments that require GPU support, so we’ll pair them with a GPU VM — I chose a “Standard_NC6s_v3” VM, which is also small. Our scenario doesn’t require a GPU for scoring, but I decided to show both options here because your scenario might be different.

cloud/endpoint_1/endpoint.yml

...
instance_type: Standard_DS3_v2
...

cloud/endpoint_3/endpoint.yml

...
instance_type: Standard_NC6s_v3
...


You should have no problem using a “Standard_DS3_v2” CPU machine, but your subscription may not have enough quota for a “Standard_NC6s_v3” GPU machine. If that’s the case, you’ll see a helpful error message when you try to create the endpoint. In order to increase your quota and get access to machines with GPUs, you’ll need to submit a support request, as is explained in the documentation. For this particular type of machine, you’ll need to ask for an increase in the quota for the “NCSv3” series, as shown in the screenshot below:

The support request also asks how many vCPUs you want access to. The NCSv3 family of machines comes in three flavors: small (Standard_NC6s_v3) which uses 6 vCPUs, medium (Standard_NC12s_v3) which uses 12 vCPUs, and large (Standard_NC24s_v3) which uses 24 vCPUs.

## Choosing the scale settings

The scale settings allow you to choose how many VM instances you want to use for your scoring code, and how you want the scaling to be done. At the moment Azure only supports Manual settings, so leave scale_type set to Manual. The instance_count setting determines how many machines you want running at deployment, which in our scenario we’ll set to one. There are also min_instances and max_instances keys, which will come into play when Azure supports automated scaling.

cloud/endpoint_1/endpoint.yml

...
scale_settings:
scale_type: Manual
instance_count: 1
min_instances: 1
max_instances: 1
...


## Choosing the authentication mode

There are two authentication modes you can choose from: key authentication never expires, while aml_token authentication expires after an hour. The project for this post uses key authentication for all of its endpoints except for endpoint 5, which demonstrates how to use aml_token. The authentication mode can be set in the endpoint YAML in the following way:

cloud/endpoint_1/endpoint.yml

...
auth_mode: key
...

cloud/endpoint_5/endpoint.yml

...
auth_mode: aml_token
...


The difference between the two will become clear when we invoke the endpoints.

## Ensuring a safe rollout

Let’s imagine a scenario where we’ve deployed our PyTorch model and it’s already in use by clients, who invoke it through an online managed endpoint. Suppose our team then decides to migrate all our machine learning code from PyTorch to TensorFlow, including the prediction code for the live endpoint. We create a new version of the code in TensorFlow, which works fine in our internal testing. But opening it up to all clients is a risky move that may reveal issues and cause instability.

That’s where Azure ML’s safe rollout feature comes in. Instead of making an abrupt switch, we can use a “blue-green” deployment approach, where we roll out the new version of the code to a small subset of clients, and tune the size of that subset as we go. After ensuring that the clients calling the new version of the code encounter no issues for a while, we can increase the percentage of clients, until we’ve completed the switch.

Endpoint 6 in the accompanying project shows how we can do this by specifying two deployments:

cloud/endpoint_6/endpoint.yml

...
traffic:
blue: 90
green: 10

deployments:
- name: blue
model: azureml:pytorch-fashion-weights:1
...

- name: green
model: azureml:tf-fashion-weights:1
...


## Creating the endpoints

Before you create an endpoint, you need to choose its name. Choose any name you’d like, but keep in mind that the name needs to be unique within your region — if it’s not you’ll get a helpful error when you create the endpoint.

At this point, you’ve learned about every single line of YAML code in all six endpoint specification files of the accompanying project. For example, here is the specification for our first endpoint:

cloud/endpoint_1/endpoint.yml

\$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: fashion-endpoint-1
type: online
auth_mode: key
traffic:
blue: 100

deployments:
- name: blue
model: azureml:pytorch-fashion-weights:1
code_configuration:
code:
local_path: ../../pytorch_src/
scoring_script: score.py
environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11
instance_type: Standard_DS3_v2
scale_settings:
scale_type: Manual
instance_count: 1
min_instances: 1
max_instances: 1


You can create the endpoints using the following CLI commands:

az ml endpoint create -f cloud/endpoint_1/endpoint.yml
az ml endpoint create -f cloud/endpoint_2/endpoint.yml
az ml endpoint create -f cloud/endpoint_3/endpoint.yml
az ml endpoint create -f cloud/endpoint_4/endpoint.yml
az ml endpoint create -f cloud/endpoint_5/endpoint.yml
az ml endpoint create -f cloud/endpoint_6/endpoint.yml


Again, if you get an error indicating that an endpoint name is already used in your region, you’ll need to change the name in the endpoint YAML file.

You can now go to the Azure ML studio, click on “Endpoints”, and in the “Real-time endpoints” page you’ll see the list of endpoints you created.

## Creating the sample request

Before we can invoke the endpoints, we need to create a file containing input data for our prediction code. Recall that in our scenario, the run() function takes in the JSON representation of a single image encoded as a matrix, and returns the class that the image belongs to as a string, such as “Shirt”.

We can easily get an image file from our dataset for testing, but we need to convert it into JSON. You can find code to create a JSON sample request in pytorch_src/create-sample-request.py. This code loads Fashion MNIST data, gets an image from the dataset, creates a matrix of shape containing the image’s pixel values, and adds it to a JSON dictionary with key data.

pytorch_src/create-sample-request.py

def create_sample_request() -> None:
batch_size = 64

X = X_batch[0, 0, :, :].cpu().numpy().tolist()
with open('sample_request/sample_request.json', 'w') as file:
json.dump({ 'data': X }, file)


Here’s a bit of the generated sample_request.json file:

sample_request/sample_request.json

{"data": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.42352941632270813, 0.43921568989753723, 0.46666666865348816, 0.3921568691730499, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16
...


I’ve checked in the sample request JSON, so you only need to run this code if you want to re-generate it.

## Invoking the endpoints

We’re ready to invoke the endpoints!

Let’s first invoke them using the CLI. The only two pieces of information we need to pass to the invocation are the name of the endpoint and the request file, as you can see below. (Replace fashion-endpoint-1 with the name of the endpoint you’d like to invoke.)

az ml endpoint invoke -n fashion-endpoint-1 --request-file sample_request/sample_request.json

"\"Shirt\""


Let’s take a look at the logs for this endpoint, by going to the Azure ML studio, clicking on the endpoint name, and then “Deployment logs”.

If you scroll down a bit, you’ll find the logging we added to the init() function of the scoring file. I invoked the endpoint twice, so I can also see the logging of the run() function printed twice.

We can also invoke the endpoint using the REST (representational state transfer) protocol. Let’s now come back to the two different authentication modes, key and aml_token, and see how we can invoke endpoints created with each of these alternatives.

Let’s first consider the key authentication mode, which we used for endpoint 1. To find the REST scoring URI for this endpoint and its authentication key, we go to the Azure ML studio, select “Endpoints”, click on the name of the endpoint, and then select the “Consume” tab.

In key authentication mode, our key never expires, so we don’t need to worry about refreshing it. We can execute the following curl command to do a POST that invokes the endpoint:

curl --location \
--request POST https://fashion-endpoint-1.westus2.inference.ml.azure.com/score \
--data @sample_request/sample_request.json

"Shirt"%


Similar to the CLI invocation, we get a “Shirt” string back.

Now let’s consider endpoint 5, which was created using aml_token authentication mode.

As you can see, just like in the previous endpoint, the Azure ML studio gives us a REST scoring URI. And even though it doesn’t give us a token, it tells us what we need to do to get one. Let’s follow the instructions and execute the following command:

az ml endpoint get-credentials --name fashion-endpoint-5


You’ll get a JSON dictionary with key accessToken and a long string value, which we’ll abbreviate as <TOKEN>. We can now use it to invoke the endpoint:

curl --location --request POST https://fashion-endpoint-5.westus2.inference.ml.azure.com/score \
--data @sample_request/sample_request.json

"Shirt"%


Tokens expire after one hour, and you can refresh them by executing the same get-credentials call I show above.

## Conclusion

In this post, you’ve seen how to create and invoke online managed endpoints using Azure ML. There are many methods for creating Azure ML resources — here we showed how to use a separate YAML file to specify the details for each resource, and how to use the CLI to create them in the cloud. We then discussed the main concepts you need to know to make the right choices when creating an endpoint YAML file. And finally, we saw different ways to invoke an endpoint. I hope that you learned something new, and that you’ll try these features on your own!

The complete code for this post can be found on GitHub.

Thank you to Sethu Raman and Shivani Sambare from Microsoft for reviewing the content in this post.