Creating managed online endpoints in Azure ML

Created:
Updated:
Topic: Azure ML

Introduction

Suppose you’ve trained a machine learning model to accomplish some task, and you’d now like to provide that model’s inference capabilities as a service. Maybe you’re writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints — they provide a simple web-based API for feeding data to your model and getting back inference results.

Azure ML currently supports three types of endpoints: batch endpoints, Kubernetes online endpoints, and managed online endpoints. I’m going to focus on managed online endpoints in this post, but let me start by explaining how the three types differ.

Diagram showing an overview of the types of endpoints.

Batch endpoints are designed to handle large requests, working asynchronously and generating results that are held in blob storage. Because compute resources are only provisioned when the job starts, the latency of the response is higher than using online endpoints. However, that can result in substantially lower costs. Online endpoints, on the other hand, are designed to quickly process smaller requests and provide near-immediate responses. Compute resources are provisioned at the time of deployment, and are always up and running, which depending on your scenario may mean higher costs than batch endpoints. However, you get real-time responses, which is criticial to many scenarios. If you want to deploy an online endpoint, you have two options: Kubernetes online endpoints allow you to manage your own compute resources using Kubernetes, while managed online endpoints rely on Azure to manage compute resources, OS updates, scaling, and security. For more information about the different endpoint types and which one is right for you, check out the documentation.

In this post, I’ll show you how to work with managed online endpoints. We’ll start by training and saving two machine learning models, one using PyTorch and another using TensorFlow. We’ll then write scoring functions that load the models and perform predictions based on user input. After that, we’ll explore several different options for creating managed online endpoints that call our scoring functions. And finally, we’ll demonstrate a couple of ways to invoke our endpoints. The code for this project can be found on GitHub.

Throughout this post, I’ll assume you’re familiar with machine learning concepts like training and prediction, but I won’t assume familiarity with Azure.

Azure ML prerequisites

Here’s how you can set up Azure ML to follow the steps in this post.

  • You need to have an Azure subscription. You can get a free subscription to try it out.
  • Create a resource group.
  • Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
  • Install the Azure CLI (command-line interface) on your platform of choice. This post has been tested using the CLI on WSL2 (Windows subsystem for Linux) with Ubuntu.
  • Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
  • Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or if you have a machine setup similar to mine, by looking at ~/.azure/azureProfile.json.
  • Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
  • You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
  • Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code, if you’re using VS Code.

You’re now ready to start working with Azure ML!

Training and saving the models

Let’s start by training two machine learning models to classify Fashion MNIST images — one using PyTorch and another using TensorFlow. For a full explanation of the PyTorch training code, check out my PyTorch blog post. For a full explanation of the TensorFlow training code, see my Keras and TensorFlow posts. I’ve included the relevant training code in the pytorch-src/train.py and tf-src/train.py files of the current project.

Here we’re saving just the weights of the model, not the entire model. In our particular scenario the space we save by keeping just the weights is negligible, but it may be different in your scenario.

pytorch-src/train.py
torch.save(model.state_dict(), WEIGHTS_PATH)
tf-src/train.py
model.save_weights(WEIGHTS_PATH)

To keep the Azure portion of this post simple and focused on endpoints, we run the training code locally. If you’d like to train on Azure, you can look at the documentation on how to do that. I also intend to cover this topic in future posts. For this project, I checked in the saved models under the pytorch-model and tf-model folders, so you don’t have to run the training on your machine. If you want to recreate the models yourself, you first need to create the pytorch-managed-endpoint-train and tf-managed-endpoint-train conda environments using the conda files provided. Then you need to delete the folders where the current models are located, pytorch-model and tf-model. And finally, you need to activate one conda environment at a time and run the corresponding training file, pytorch-src/train.py or tf-src/train.py.

Creating the models on Azure

Let’s use the weights that we generated locally to create the models on Azure. In our scenario we saved just weights, not the whole model, but we can register these weights as an Azure model as if we had a whole model. There are many different ways to create ML resources on Azure. My preferred way is to have a separate YAML file for each resource, and to use a CLI command to create the resource according to the specifications in the YAML file. This is the method I’ll show in this post.

Let’s start by looking at the YAML files for our models, cloud/model-pytorch-fashion.yml and cloud/model-tf-fashion.yml. You can see that these files start by specifying a schema, which is super helpful because it enables VS Code to make suggestions and highlight any mistakes we make. The attributes in this file make it clear that an Azure model consists of a name, a version, and a path to the location where we saved the trained model files locally. The PyTorch model consists of a single file, so we can point directly to it; the TensorFlow model consists of several files, so we point to the directory that contains them.

cloud/model-pytorch-fashion.yml
$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: model-pytorch-fashion
version: 1
local_path: "../pytorch-model/weights.pth"
cloud/model-tf-fashion.yml
$schema: https://azuremlschemas.azureedge.net/latest/model.schema.json
name: model-tf-fashion
version: 1
local_path: '../tf-model/'

How will you select the correct schema when creating a new resource? You can always copy the schemas from my blog or from the documentation, but the easiest way is to use the Azure Machine Learning extension for VS Code. If you have it installed, you can select the Azure icon in VS Code’s left navigation pane, expand your subscription and ML workspace, select “Models”, and click the ”+” button to create a YAML file with the correct model schema and attributes.

Screenshot showing how to create a new model using the Azure Machine Learning extension for VS Code.

Now that we have the YAML files containing our model specifications, we can simply run CLI commands to create these models on Azure:

az ml model create -f managed-endpoint/cloud/model-pytorch-fashion.yml
az ml model create -f managed-endpoint/cloud/model-tf-fashion.yml

If you go to the Azure ML studio, and use the left navigation to go to the “Models” page, you’ll see our newly created models listed there.

In order to deploy our PyTorch and TensorFlow models as Azure ML endpoints, we’ll use deployment and endpoint YAML files to specify the details of the endpoint configurations. I’ll show bits and pieces of these YAML files throughout the rest of this post as I present each setting. We’ll create six endpoints with different configurations to help you understand the range of alternatives available to you. If you look at the deployment YAML files in this project, you’ll notice that each of them refers to one of these two models. For example:

cloud/endpoint-1/deployment.yml
...
model: azureml:model-pytorch-fashion:1
...
cloud/endpoint-2/deployment.yml
...
model: azureml:model-tf-fashion:1
...

As you can see, the model name is preceded by “azureml:” and followed by a colon and the version number we specified in the model’s YAML file.

Creating the scoring files

When invoked, our endpoint will call a scoring file, which we need to provide. This scoring file needs to follow a prescribed structure: it needs to contain an init() function that will be called when the endpoint is created or updated, and a run(...) function that will be called every time the endpoint is invoked. Let’s look at these in more detail.

First we’ll take a look at the init() function for the PyTorch model (you’ll find similar TensorFlow code in the post’s project):

pytorch-src/score.py
import json
import logging
import os

import numpy as np
import torch
from torch import Tensor, nn

from neural_network import NeuralNetwork

def init():
    logging.info('Init started')

    global model
    global device

    device = 'cuda' if torch.cuda.is_available() else 'cpu'
    logging.info('Device: %s', device)

    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'weights.pth')

    model = NeuralNetwork().to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()

    logging.info('Init completed')

In our simple scenario, the init() function’s main task is to load the model. Because we saved just the weights, we need to instantiate a new version of the NeuralNetwork class before we can load the saved weights into it. Notice the use of the AZUREML_MODEL_DIR environment variable, which gives us the path to the model root folder on Azure. Notice also that since we’re using PyTorch, we need to ensure that both the loaded weights and the neural network we instantiate are on the same device (GPU or CPU).

I find it useful to add logging.info() calls at the beginning and end of the function to make sure that it’s being called as expected. When we cover invoking the endpoint, I’ll show you where to look for the logs. I also like to add a logging.info(...) call that tells me whether the code is running on GPU or CPU, as a sanity check.

Now let’s look at the run(...) function:

pytorch-src/score.py
labels_map = {
    0: 'T-Shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',
}

def predict(model: nn.Module, x: Tensor) -> torch.Tensor:
    with torch.no_grad():
        y_prime = model(x)
        probabilities = nn.functional.softmax(y_prime, dim=1)
        predicted_indices = probabilities.argmax(1)
    return predicted_indices

def run(raw_data):
    logging.info('Run started')

    x = json.loads(raw_data)['data']
    x = np.array(x).reshape((1, 1, 28, 28))
    x = torch.from_numpy(x).float().to(device)

    predicted_index = predict(model, x).item()
    predicted_name = labels_map[predicted_index]

    logging.info('Predicted name: %s', predicted_name)

    logging.info('Run completed')
    return predicted_name

Notice that run(...) takes a raw_data parameter as input, which contains the data we specify when invoking the endpoint. In our scenario, we’ll be passing in a JSON dictionary with a data key corresponding to a matrix containing an image with float pixel values between 0.0 and 1.0. Our run(...) function loads the JSON, transforms it into a tensor of the format that our predict(...) function expects, calls the predict(...) function, converts the predicted int into a human-readable name, and returns that name.

Let’s look at our first two deployment YAML files, which refer to the PyTorch and TensorFlow scoring files we discussed:

cloud/endpoint-1/deployment.yml
...
code_configuration:
  code:
    local_path: ../../pytorch-src/
  scoring_script: score.py
...
cloud/endpoint-2/deployment.yml
...
code_configuration:
  code:
    local_path: ../../tf-src/
  scoring_script: score.py
...

Creating the environments

An Azure Machine Learning environment specifies the runtime where we can run training and prediction code on Azure, along with any additional configuration. In our scenario we’re only running prediction in the cloud, so we’ll focus on inference environments. Azure supports three different types of environments:

  1. Environments created from prebuilt Docker images for inference

    These prebuilt Docker images are provided by Microsoft, and they’re the easiest to get started with. In addition to Ubuntu and optional GPU support, they include different versions of TensorFlow and PyTorch, as well as many other popular frameworks and packages. I prefer to use prebuilt Docker images over the other two types — they deploy quickly and their pre-installed packages cover most of my needs.

    The full list of prebuilt Docker images available for inference can be found in the documentation. The docs show which packages are pre-installed in each Docker image, and two ways of referring to each image: an “MCR path” and a “curated environment.” I use a curated environment in this sample because I’m using an image that contains all the packages I need. You would want to use the “MCR path” if you need to extend the image with a conda file — I’ll come back to this later.

    Once I’ve selected a curated environment that has the packages I need, I just need to refer to it in my deployment YAML file. Here are the relevant lines from the first and second endpoints in our scenario:

    cloud/endpoint-1/deployment.yml
    
    ...
    environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11
    ...
    
    cloud/endpoint-2/deployment.yml
    
    ...
    environment: azureml:AzureML-tensorflow-2.4-ubuntu18.04-py37-cpu-inference:11
    ...
    

    To determine the version number for a particular curated environment, you can look in Azure ML studio under “Environments” then “Curated environments”:

    Screenshot showing how to get a list of curated environments and their version.

    Or you can use the Azure ML extension for VS Code — click on the Azure icon in the left navigation pane, expand your subscription and ML workspace, then expand “Environments” and “Azure ML Curated Environments”. Right-click on a curated environment and select “View Environment” to see the version number.

    For the scenario in this post, we’re able to use curated environments that include all the packages we need to run our code. If your scenario requires additional packages, then you’ll need to extend the environment, which you can do in one of three ways: by specifying the MCR path together with a conda file in the deployment file (as described under the next environment type), using dynamic installation, or with pre-installed Python packages.

  2. Environments created from base images

    These are Docker images provided by Microsoft that contain just the basics: Ubuntu, and optionally CUDA and cuDNN. Keep in mind that these don’t contain Python or any machine learning package you may need, so when using these environments, we typically include an additional conda file. A full list of available base images can be found in this GitHub repo.

    I use base images in endpoints 3 and 4. Because they don’t contain Python, PyTorch, and TensorFlow, I had to extend them using conda files. Note that I also added the azureml-defaults package, which is required for inference on Azure. Let’s take a look at the conda files:

    cloud/endpoint-3/score-conda.yml
    
    name: pytorch-managed-endpoint-score
    channels:
      - pytorch
      - conda-forge
      - defaults
    dependencies:
      - python=3.7
      - pytorch=1.7
      - pip
      - pip:
        - azureml-defaults
    
    cloud/endpoint-4/score-conda.yml
    
    name: tf-managed-endpoint-score
    channels:
      - conda-forge
      - defaults
    dependencies:
      - python=3.7
      - pip
      - pip:
        - tensorflow==2.4
        - azureml-defaults
    

    Now we can specify the base images and conda files in the YAML deployment files:

    cloud/endpoint-3/deployment.yml
    
    ...
    environment:
      conda_file: score-conda.yml
      image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04:latest
    ...
    

    In the curated environments section, I chose CPU environments. Here I’m choosing GPU base images, so that you see the range of options available to you.

    cloud/endpoint-4/deployment.yml
    
    ...
    environment:
      conda_file: score-conda.yml
      image: mcr.microsoft.com/azureml/openmpi4.1.0-cuda11.0.3-cudnn8-ubuntu18.04:latest
    ...
    

    In the curated environments section, I chose CPU environments. Here I’m choosing GPU base images, so that you see the range of options available to you.

  3. User-managed environments

    You can also create your own container and use it as an inference environment. I won’t go into detail on this topic, but you can take a look at the documentation.

Choosing the instance type

Now we’ll choose the machine where we’ll be deploying the environments and inference code for our endpoints. You can find the list of all VMs (virtual machines) supported for inference in the documentation.

Endpoints 1 and 2 of this project rely on curated environments that run on the CPU, so there’s no point in paying for a VM with a GPU. For these endpoints, I chose a “Standard_DS3_v2” VM because a small size is enough for our purposes. Endpoints 3 and 4 rely on base image environments that require GPU support, so we’ll pair them with a GPU VM — I chose a “Standard_NC6s_v3” VM, which is also small. Our scenario doesn’t require a GPU for scoring, but I decided to show both options here because your scenario might be different.

cloud/endpoint-1/deployment.yml
...
instance_type: Standard_DS3_v2
...
cloud/endpoint-3/deployment.yml
...
instance_type: Standard_NC6s_v3
...

You should have no problem using a “Standard_DS3_v2” CPU machine, but your subscription may not have enough quota for a “Standard_NC6s_v3” GPU machine. If that’s the case, you’ll see a helpful error message when you try to create the endpoint. In order to increase your quota and get access to machines with GPUs, you’ll need to submit a support request, as is explained in the documentation. For this particular type of machine, you’ll need to ask for an increase in the quota for the “NCSv3” series, as shown in the screenshot below:

Screenshot how to ask for a quota increase for NCSv3 machines.

The support request also asks how many vCPUs you want access to. The NCSv3 family of machines comes in three flavors: small (Standard_NC6s_v3) which uses 6 vCPUs, medium (Standard_NC12s_v3) which uses 12 vCPUs, and large (Standard_NC24s_v3) which uses 24 vCPUs.

Choosing the instance count

As the name implies, the instance_count setting determines how many machines you want running at deployment. We’ll set this setting to one for all endpoints.

cloud/endpoint-1/deployment.yml
...
instance_count: 1
...

Choosing the authentication mode

There are two authentication modes you can choose from: key authentication never expires, while aml_token authentication expires after an hour. The project for this post uses key authentication for all of its endpoints except for endpoint 5, which demonstrates how to use aml_token. The authentication mode can be set in the endpoint YAML in the following way:

cloud/endpoint-1/endpoint.yml
...
auth_mode: key
...
cloud/endpoint-5/endpoint.yml
...
auth_mode: aml_token
...

The difference between key and aml_token will become clear when we invoke the endpoints.

Notice that this setting affects all deployments in the endpoint, therefore it’s set in the endpoint.yml file, not the deployment.yml file. The section on “Enruing a safe rollout” explains the responsibilities of a deployment and an endpoint in the practical sense.

Creating the endpoints

At this point, you’ve learned about every single line of YAML code in all endpoint and deployment specification files of the accompanying project. For example, here are the endpoint and deployment files for our first endpoint:

cloud/endpoint-1/endpoint.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: endpoint-managed-fashion-1
auth_mode: key
cloud/endpoint-1/deployment.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: endpoint-managed-fashion-1
model: azureml:model-pytorch-fashion:1
code_configuration:
  code:
    local_path: ../../pytorch-src/
  scoring_script: score.py
environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11
instance_type: Standard_DS3_v2
instance_count: 1

The name of an endpoint needs to be unique within a region. You can change the name of your endpoints in the YAML specification files, or you can pass a unique name to the CLI command at creation time, as shown below. You can create the endpoints 1 through 5 using the following CLI commands:

az ml online-endpoint create -f managed-endpoint/cloud/endpoint-X/endpoint.yml --name <ENDPOINTX>
az ml online-deployment create -f managed-endpoint/cloud/endpoint-X/deployment.yml --all-traffic --endpoint-name <ENDPOINTX>

You can now go to the Azure ML studio, click on “Endpoints”, and in the “Real-time endpoints” page you’ll see the list of endpoints you created.

Ensuring a safe rollout

Let’s imagine a scenario where we’ve deployed our PyTorch model and it’s already in use by clients, who invoke it through an managed online endpoint. Suppose our team then decides to migrate all our machine learning code from PyTorch to TensorFlow, including the prediction code for the live endpoint. We create a new version of the code in TensorFlow, which works fine in our internal testing. But opening it up to all clients is a risky move that may reveal issues and cause instability.

That’s where Azure ML’s safe rollout feature comes in. Instead of making an abrupt switch, we can use a “blue-green” deployment approach, where we roll out the new version of the code to a small subset of clients, and tune the size of that subset as we go. After ensuring that the clients calling the new version of the code encounter no issues for a while, we can increase the percentage of clients, until we’ve completed the switch.

Endpoint 6 in the accompanying project will enable this scenario by specifying two deployments:

cloud/endpoint-6/endpoint.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: endpoint-managed-fashion-6
auth_mode: key
cloud/endpoint-6/deployment-blue.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: endpoint-managed-fashion-6
...
cloud/endpoint-6/deployment-greem.yml
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: green
endpoint_name: endpoint-managed-fashion-6
...

You can create the endpoint and deployments for endpoint 6 using CLI commands similar to the ones in the previous section. When you’re ready to adjust their traffic allocation, you can do that with an additional command, as shown below:

az ml online-endpoint create -f managed-endpoint/cloud/endpoint-6/endpoint.yml --name <ENDPOINT6>
az ml online-deployment create -f managed-endpoint/cloud/endpoint-6/deployment-blue.yml --all-traffic --endpoint-name <ENDPOINT6>
az ml online-deployment create -f managed-endpoint/cloud/endpoint-6/deployment-green.yml --endpoint-name <ENDPOINT6>
az ml online-endpoint update --name <ENDPOINT6> --traffic "blue=90 green=10"

For more information about safe rollout, check out the documentation.

Creating the sample request

Before we can invoke the endpoints, we need to create a file containing input data for our prediction code. Recall that in our scenario, the run(...) function takes in the JSON representation of a single image encoded as a matrix, and returns the class that the image belongs to as a string, such as “Shirt”.

We can easily get an image file from our dataset for testing, but we need to convert it into JSON. You can find code to create a JSON sample request in pytorch-src/create_sample_request.py. This code loads Fashion MNIST data, gets an image from the dataset, creates a matrix of shape containing the image’s pixel values, and adds it to a JSON dictionary with key data.

pytorch-src/create_sample_request.py
import json
import os
from pathlib import Path

from train import _get_data

DATA_PATH = 'managed-endpoint/data'
SAMPLE_REQUEST = 'managed-endpoint/sample-request'


def create_sample_request() -> None:
    """Creates a sample request to be used in prediction."""
    batch_size = 64
    (_, test_dataloader) = _get_data(batch_size)

    (x_batch, _) = next(iter(test_dataloader))
    x = x_batch[0, 0, :, :].cpu().numpy().tolist()

    os.makedirs(name=SAMPLE_REQUEST, exist_ok=True)
    with open(Path(SAMPLE_REQUEST, 'sample_request.json'),
              'w',
              encoding='utf-8') as file:
        json.dump({'data': x}, file)


def main() -> None:
    create_sample_request()


if __name__ == '__main__':
    main()

Here’s a bit of the generated sample_request.json file:

sample-request/sample_request.json
{"data": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.42352941632270813, 0.43921568989753723, 0.46666666865348816, 0.3921568691730499, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16
...

I’ve checked in the sample request JSON, so you only need to run this code if you want to re-generate it.

Invoking the endpoints using the CLI

We’re ready to invoke the endpoints!

Let’s first invoke them using the CLI. The only two pieces of information we need to pass to the invocation are the name of the endpoint and the request file, as you can see below. (Replace <ENDPOINTX> with the name of the endpoint you’d like to invoke.)

az ml online-endpoint invoke -n <ENDPOINTX> --request-file managed-endpoint/sample-request/sample_request.json
"\"Shirt\""

Let’s take a look at the logs for this endpoint, by going to the Azure ML studio, clicking on the endpoint name, and then “Deployment logs”.

Screenshot of the logs for an endpoint invocation.

If you scroll down a bit, you’ll find the logging we added to the init() function of the scoring file. I invoked the endpoint twice, so I can also see the logging of the run(...) function printed twice.

Screenshot of the logs for an endpoint invocation showing our custom logging.

Invoking the endpoints using REST

We can also invoke the endpoint using the REST (representational state transfer) protocol. Let’s now come back to the two different authentication modes, key and aml_token, and see how we can invoke endpoints created with each of these alternatives.

Let’s first consider the key authentication mode, which we used for endpoint 1. To find the REST scoring URI for this endpoint and its authentication key, we go to the Azure ML studio, select “Endpoints”, click on the name of the endpoint, and then select the “Consume” tab.

Screenshot showing the REST scoring URI and key for the endpoint created using key authentication.

The bearer token used in the request can be found in the same panel, under “Authentication”. In key authentication mode, our key never expires, so we don’t need to worry about refreshing it. We can execute the following curl command to do a POST that invokes the endpoint:

curl --location \
     --request POST https://<ENDPOINT1>.westus2.inference.ml.azure.com/score \
     --header "Authorization: Bearer gT2d2eHhbErlWy5FijTl1fkgC9kw5eZQ" \
     --header "Content-Type: application/json" \
     --data @managed-endpoint/sample-request/sample_request.json
"Shirt"%

Similar to the CLI invocation, we get a “Shirt” string back.

Now let’s consider endpoint 5, which was created using aml_token authentication mode.

Screenshot showing the REST scoring URI for the endpoint created using aml_token authentication.

As you can see, just like in the previous endpoint, the Azure ML studio gives us a REST scoring URI. And even though it doesn’t give us a token, it tells us what we need to do to get one. Let’s follow the instructions and execute the following command:

az ml endpoint get-credentials --name <ENDPOINT5>

You’ll get a JSON dictionary with key accessToken and a long string value, which we’ll abbreviate as <TOKEN>. We can now use it to invoke the endpoint:

curl --location --request POST https://<ENDPOINT5>.westus2.inference.ml.azure.com/score \
     --header "Authorization: Bearer <TOKEN>" \
     --header "Content-Type: application/json" \
     --data @managed-endpoint/sample-request/sample_request.json
"Shirt"%

Tokens expire after one hour, and you can refresh them by executing the same get-credentials call I show above.

Conclusion

In this post, you’ve seen how to create and invoke managed online endpoints using Azure ML. There are many methods for creating Azure ML resources — here we showed how to use separate YAML files to specify the details for each resource, and how to use the CLI to create them in the cloud. We then discussed the main concepts you need to know to make the right choices when creating an endpoint YAML file. And finally, we saw different ways to invoke an endpoint. I hope that you learned something new, and that you’ll try these features on your own!

The complete code for this post can be found on GitHub.

Thank you to Sethu Raman and Shivani Sambare from Microsoft for reviewing the content in this post.