Creating batch endpoints in Azure ML

Topic: Azure ML


Suppose you’ve trained a machine learning model to accomplish some task, and you’d now like to provide that model’s inference capabilities as a service. Maybe you’re writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints — they provide a simple web-based API for feeding data to your model and getting back inference results.

Azure ML currently supports three types of endpoints: batch endpoints, Kubernetes online endpoints, and managed online endpoints. I’m going to focus on batch endpoints in this post, but let me start by explaining how the three types differ.

Diagram showing an overview of the types of endpoints.

Batch endpoints are designed to handle large requests, working asynchronously and generating results that are held in blob storage. Because compute resources are only provisioned when the job starts, the latency of the response is higher than using online endpoints. However, that can result in substantially lower costs. Online endpoints, on the other hand, are designed to quickly process smaller requests and provide near-immediate responses. Compute resources are provisioned at the time of deployment, and are always up and running, which depending on your scenario may mean higher costs than batch endpoints. However, you get real-time responses, which is criticial to many scenarios. If you want to deploy an online endpoint, you have two options: Kubernetes online endpoints allow you to manage your own compute resources using Kubernetes, while managed online endpoints rely on Azure to manage compute resources, OS updates, scaling, and security. For more information about the different endpoint types and which one is right for you, check out the documentation.

If you’re interested in managed online endpoints, check out my previous post. In this post, I’ll show you how to work with batch endpoints. We’ll start by training and saving two machine learning models, one using PyTorch and another using TensorFlow. We’ll then write scoring functions that load the models and perform predictions based on user input. After that, we’ll explore how we can create the batch endpoints on Azure, which will require the creation of several resources in the cloud. And finally, we’ll see how we can invoke the endpoints. The code for this project can be found on GitHub.

Throughout this post, I’ll assume you’re familiar with machine learning concepts like training and prediction, but I won’t assume familiarity with Azure.

Azure ML prerequisites

Here’s how you can set up Azure ML to follow the steps in this post.

  • You need to have an Azure subscription. You can get a free subscription to try it out.
  • Create a resource group.
  • Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
  • Install the Azure CLI (command-line interface) on your platform of choice. This post has been tested using the CLI on WSL2 (Windows subsystem for Linux) with Ubuntu.
  • Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
  • Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or if you have a machine setup similar to mine, by looking at ~/.azure/azureProfile.json.
  • Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
  • You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
  • Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code, if you’re using VS Code.

You’re now ready to start working with Azure ML!

Training and saving the models, and creating them on Azure

We’ll start by training two machine learning models to classify Fashion MNIST images — one using PyTorch and another using TensorFlow. If you’d like to explore the training code in detail, check out my previous posts on PyTorch, Keras and TensorFlow. The code associated with this post already includes pre-trained models, so you can just use them as-is. But if you’d like to recreate them, you can set up your machine using the conda files provided and run the training code, which is in pytorch-src/ and tf-src/

In my managed online endpoints post, I save and load just the weights of the models. Here I’ll work with the whole model, which is very similar. First, during training, we need to save the model. This can be done with the code below:

pytorch-src/, MODEL_PATH)

Next, we need to create the models on Azure. There are many ways to create resources on Azure. My preferred way is to use a separate YAML file for each resource and a CLI command to kick-off the remote creation, so that’s what I’ll show here. Below you can see the YAML files we’ll use in the creation of these models.

name: model-pytorch-batch-fashion
version: 1
local_path: "../pytorch-model/model.pth"
name: model-tf-batch-fashion
version: 1
local_path: "../tf-model/"

If you read my managed online endpoints post, you should already be familiar with the YAML in these files. Please refer to that post for more details on the contents of these files, as well as the best way to create them from scratch.

We’re now ready to create the models on Azure, which we can do with the following CLI commands:

az ml model create -f batch-endpoint/cloud/model-pytorch-batch-fashion.yml
az ml model create -f batch-endpoint/cloud/model-tf-batch-fashion.yml

If you go to the Azure ML studio, and use the left navigation to go to the “Models” page, you’ll see our newly created models listed there.

In order to deploy our Azure ML endpoints, we’ll use endpoint and deployment YAML files to specify the details of the endpoint configurations. I’ll show bits and pieces of these YAML files throughout the rest of this post as I present each setting. Let’s start by taking a look at how the deployment YAML files refer to the models we created on Azure:

model: azureml:model-pytorch-batch-fashion:1
model: azureml:model-tf-batch-fashion:1

Creating the scoring files

When invoked, our endpoint will call a scoring file, which we need to provide. Just like the scoring file for managed online endpoints, this scoring file needs to follow a prescribed structure: it needs to contain an init() function and a run(...) function that are called when the batch job starts to run, after the endpoint is invoked. The init() function is only called once per instance, so it’s a good place to add shared operations such as loading the model. The run(...) function is called once per process and handles a single mini-batch.

First we’ll take a look at the init() function for the PyTorch model (you’ll find similar TensorFlow code in the post’s project):

import argparse
import logging
import os

import torch
from PIL import Image
from torch import Tensor, nn
from torchvision import transforms

def init():
    global logger
    global model
    global device

    arg_parser = argparse.ArgumentParser(description='Argument parser.')
    arg_parser.add_argument('--logging_level', type=str, help='logging level')
    args, _ = arg_parser.parse_known_args()
    logger = logging.getLogger(__name__)
    logger.setLevel(args.logging_level.upper())'Init started')

    device = 'cuda' if torch.cuda.is_available() else 'cpu''Device: %s', device)

    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'model.pth')

    model = torch.load(model_path, map_location=device)
    model.eval()'Init completed')

In our scenario, the main task of this function is to load the model. The AZUREML_MODEL_DIR environment variable gives us the directory where the model is located on Azure, which we use to construct the model’s path. Once we have the model’s path, we use it to load the model. Because we saved the whole model, not just the weights, we can load it with torch.load directly, without having to instantiate the NeuralNetwork class first.

Notice that logging is done differently from online endpoints. Here, we create and configure a global logger variable, which we then use by calling You can see in the code above that, in addition to logging the beginning and end of the function, I also log whether the code is running on GPU or CPU.

Now let’s look at the run(...) function:

labels_map = {
    0: 'T-Shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',

def predict(model: nn.Module, x: Tensor) -> torch.Tensor:
    with torch.no_grad():
        y_prime = model(x)
        probabilities = nn.functional.softmax(y_prime, dim=1)
        predicted_indices = probabilities.argmax(1)
    return predicted_indices

def run(mini_batch):'run(%s started: %s', mini_batch, {__file__})
    predicted_names = []
    transform = transforms.ToTensor()

    for image_path in mini_batch:
        image =
        tensor = transform(image).to(device)
        predicted_index = predict(model, tensor).item()
        predicted_names.append(f'{image_path}: {labels_map[predicted_index]}')'Run completed')
    return predicted_names

In my blog post about managed online endpoints, the run(...) function receives a JSON file as a parameter. Batch endpoints work a bit differently — here the run(...) function receives a list of file paths for a mini-batch of data. The data is specified when invoking the endpoint, and the mini-batch size is specified in the deployment YAML file, as we’ll see soon. In this scenario, we’ll invoke the endpoint by referring to the sample-request directory, which contains several images of clothing items, and we’ll set the mini-batch size to 10. Therefore, the run(...) method receives file paths for 10 images within the sample-request directory.

For each image in the mini-batch, we transform it into a PyTorch tensor, and pass it as a parameter to our predict(...) function. We then append the prediction to a predicted_names list, and return that list as the prediction result.

Let’s now look at how we specify the location of the scoring file and the mini-batch size in the deployment YAML files:

    local_path: ../../pytorch-src/
mini_batch_size: 10
    local_path: ../../tf-src/
mini_batch_size: 10

Creating the environments

An Azure Machine Learning environment specifies the runtime where we can run training and prediction code on Azure, along with any additional configuration. In my blog post about managed online endpoints, I present three different ways to create the inference environment for an endpoint: prebuilt Docker images for inference, base images, and user-managed environments. I also describe all the options for adding additional packages available for curated environments and base images.

Batch endpoints also support all three options for creating environments, but they don’t support extending prebuilt Docker images with conda files. In this post’s scenario, we need the Pillow package to read our images in the scoring file, which none of the prebuilt Docker images available includes. Therefore, we use base images and extend them with conda files that install Pillow as well as other packages.

Let’s take a look at the conda files used to extend the base images:

name: pytorch-batch-endpoint-score
  - pytorch
  - conda-forge
  - defaults
  - numpy=1.20
  - python=3.7
  - pytorch=1.7
  - pillow=8.3.1
  - torchvision=0.8.1
  - pip
  - pip:
    - azureml-defaults==1.32.0
name: tf-batch-endpoint-score
  - conda-forge
  - defaults
  - python=3.7
  - pillow=8.3.1
  - pip
  - pip:
    - tensorflow==2.4
    - azureml-defaults==1.32.0

Notice that each of the conda files above includes the azureml-defaults package, which is required for inference on Azure.

We can now create an environment using a base image and each of the conda files above, which we do directly in the deployment YAML files:

  conda_file: score-conda.yml
  conda_file: score-conda.yml

Notice how I specify that I want the latest version available of that image by using the “latest” tag. This is a super handy feature!

Creating the compute cluster

Next, let’s create the compute cluster, where we specify the size of the virtual machine we’ll use to run inference, and how many instances of that VM we want running in the cluster.

name: cluster-cpu
type: amlcompute
size: Standard_DS3_v2
min_instances: 0
max_instances: 4

First we need to specify the name for the cluster — I decided on the descriptive cluster-cpu name. Then we need to choose the compute type. Currently the only compute type supported is amlcompute, so that’s what we specify.

Next we need to choose a VM size. You can see a full list of supported VM sizes in the documentation. I decided to choose a Standard_DS3_v2 VM (a small VM without a GPU) because our inferencing scenario is simple.

And last, I specify that I want a minimum of zero VM instances, and a maximum of four. Depending on the work load at each moment, Azure will decide how many VMs to run and it will distribute the work across the VMs appropriately.

We can now create our compute cluster:

az ml compute create -f batch-endpoint/cloud/cluster-cpu.yml

You can go to the Azure ML studio, use the left navigation to go to the “Compute” page, click on “Compute clusters,” and see our newly created compute cluster listed there.

We’re now ready to refer to our compute cluster from within the deployment YAML files:

cloud/endpoint-1/deployment.yml &
compute: azureml:cluster-cpu

Creating the endpoints

By now, you’ve seen almost every line of the YAML files used to create the endpoints. Let’s take a look at the complete deployment and endpoint files for endpoint-1 to see what else we’re missing.

name: endpoint-batch-fashion-1
auth_mode: aad_token
name: blue
endpoint_name: endpoint-batch-fashion-1
model: azureml:model-pytorch-batch-fashion:1
    local_path: ../../pytorch-src/
  conda_file: score-conda.yml
compute: azureml:cluster-cpu
mini_batch_size: 10
output_file_name: predictions_pytorch.csv

The schema provides VS Code with the information it needs to make suggestions and warn us of problems. You may have noticed that a schema is present in the YAML file for every resource we’ve created so far. The Azure ML extension for VS Code is useful for those situations when you want to create a resource but are not sure which schema to use. If you have this extension installed, you can click on the Azure icon in the left navigation of VS Code, select your subscription and workspace, then click on the ”+” icon next to a resource type to create a template YAML file for that resource.

You’ll need to specify a name for your endpoint — just make sure that you pick a name that is unique within your resource group’s region. This means that you may need to change the name of the endpoints I provide, which you can do by changing the file directly or by specifying a new name in the endpoint creation command (we’ll see this later). For batch endpoints, always specify the auth_mode to be aad_token.

Keep in mind that unlike managed online endpoints, batch endpoints don’t support blue-green deployment. In my managed online endpoints post we added two deployments with traffic set for 90 and 10, to test a new version of our deployment on 10% of the inference calls. In batch endpoints, we can also have several deployment files with the same endpoint-name. However, only one deployment can be set as default, and the default deployment gets 100% of the traffic.

Let’s move on in the exploration of the deployment YAML file. The deployment needs to have an endpoint-name and that name needs to match the name specified in the endpoint YAML file. We’ve already explored in detail the model, code_configuration, environment, compute and mini_batch_size sections. The output_file_name is self-explanatory — it’s the name of the file that will contain all the predictions for our inputs. I’ll show you later where to find it.

The second endpoint is very similar to this one. The only difference is that it points to the TensorFlow scoring code. Now that you understand the endpoint configuration YAML files in detail, you’re ready to create the endpoints:

az ml batch-endpoint create -f batch-endpoint/cloud/endpoint-1/endpoint.yml --name <ENDPOINT1>
az ml batch-deployment create -f batch-endpoint/cloud/endpoint-1/deployment.yml --set-default --endpoint-name <ENDPOINT1>
az ml batch-endpoint create -f batch-endpoint/cloud/endpoint-2/endpoint.yml --name <ENDPOINT2>
az ml batch-deployment create -f batch-endpoint/cloud/endpoint-2/deployment.yml --set-default --endpoint-name <ENDPOINT2>

If you didn’t specify a unique name in the YAML files, you can do that in the CLI command by replacing <ENDPOINT1> and <ENDPOINT2> with your unique names. Also, notice how we set the deployment as default in the CLI, at the time of its creation.

You can now go to the Azure ML studio to see your endpoints in the UI. Click on “Endpoints” in the left navigation, then “Batch endpoints” in the top navigation, and you’ll see them listed, as you can see in the image below:

Screenshot of Azure ML studio showing the endpoints we created.

Creating the request files

Next we’ll explore our request files — the list of files we’ll specify when invoking the endpoint, which will then be passed to the run(...) function of the scoring file for inference. If you look at the accompanying project on GitHub, you’ll see a directory called sample-request containing several images of size pixels, representing clothing items. When invoking the endpoint, we’ll provide the path to this directory.

I decided to include the sample-request directory in the git repo for simplicity. If you want to recreate it, you’ll first need to create the conda environment specified in conda-pytorch.yml (if you haven’t already), then activate it, and finally run the code in the pytorch-src/ file.

from torchvision import datasets
import os

DATA_PATH = 'batch-endpoint/data'
SAMPLE_REQUEST = 'batch-endpoint/sample-request'

def main() -> None:
    """Creates a sample request to be used in prediction."""

    test_data = datasets.FashionMNIST(

    os.makedirs(name=SAMPLE_REQUEST, exist_ok=True)
    for i, (image, _) in enumerate(test_data):
        if i == 200:

if __name__ == '__main__':

Invoking the endpoints using CLI

Now that you have the endpoint YAML files and a directory with sample requests, you can invoke the endpoints using the following commands:

az ml batch-endpoint invoke --name <ENDPOINT1> --input-local-path batch-endpoint/sample-request
az ml batch-endpoint invoke --name <ENDPOINT2> --input-local-path batch-endpoint/sample-request

Make sure you use the names you chose in the creation of your endpoints.

Unlike with managed online endpoints, the invocation call will not immediately return the result of your predictions — instead, it kicks off an asynchronous inference run that will produce predictions at a later time. Let’s go to the Azure ML studio and see what’s going on. Click on “Endpoints” in the left navigation, then “Batch endpoints,” and then on the name of one of your endpoints. You’ll be led to a page with two tabs: “Details,” which shows the information you specified in the endpoint’s YAML file, and “Runs,” where we can see the status of asynchronous inference runs associated with the endpoint. Let’s click on “Runs.” You’ll see all the runs that were kicked off by the invoke command, with a status that may be “Running,” “Completed,” or “Failed.”

Screenshot of Azure ML studio run associated with an endpoint.

Now let’s click on the “Display name” of the latest “Completed” run. (If your run is still running, feel free to click on it anyway to see the logs coming in in real-time.) This will take you to a diagram that includes a “score” section in green, with the word “Completed.”

Screenshot of Azure ML studio showing a diagram of a completed run.

Next, right-click on the score section and choose “View log” to see the logs for this run.

Screenshot of Azure ML studio showing a diagram of a completed run and a context menu showing "View log".

This will take you to the following page, which shows all the logs for the run. You can read more about what each log means in the documentation. When a run completes successfully, I’m mostly interested in looking at the logs I added in the init() and run(...) functions of the scoring file. Those can be found under logs/user/stdout/ As you can see below, the logs in the init() function appear once, and the logs in the run(...) function appear as many times as the number of mini-batches in the sample request. Here’s what I see after a successful run:

Screenshot of Azure ML studio showing the logs for a completed run.

I encourage you to spend some time getting familiar with the structure of the logs.

Once a run completes successfully, you’ll want to look at the results of the prediction, which are stored in blob storage. You can access these by going back to your run diagram, and right-clicking on the little circle below the “Completed” section. Then choose “Access data” from the context menu.

Screenshot of Azure ML studio the link to show predictions.

This takes you to a blob storage location where you can see a file with the name you specified in the endpoint YAML file, which in our scenario is either predictions_tf.csv or predictions_pytorch.csv. Right-click on the filename (or click the triple-dot icon to its right) to show a menu of options, including an option to “View/edit” and another to “Download.” These CSV files contain one prediction per line, as we can see below:

Screenshot of predictions file.

Each clothing item in this file corresponds to the prediction for one of the images in the sample request. This is a great achievement — we got our predictions!

Invoking the endpoints using REST

Alternatively, you can invoke your endpoints using a curl POST command. In this case, we first need to create a dataset with the input data, which we then pass as a parameter to the curl command. Our input data is in the sample-request folder, therefore here’s what the YAML file for our dataset looks like:

name: dataset-input-batch-fashion
local_path: ../sample-request/

We can now create the dataset with the following CLI command:

az ml dataset create -f batch-endpoint/cloud/dataset-input-batch-fashion.yml

If you go to the Azure ML studio and click on “Datasets” in the left navigation, you’ll see your newly created dataset there.

Now let’s look at what the REST call looks like. The rest/ file contains all the commands you need to invoke the batch endpoint using REST. You can reuse this file for any of your projects, by simply replacing the ENDPOINT_NAME, DATASET_NAME. and DATASET_VERSION with the appropriate information.


SUBSCRIPTION_ID=$(az account show --query id | tr -d '\r"')

RESOURCE_GROUP=$(az group show --query name | tr -d '\r"')

WORKSPACE=$(az configure -l | jq -r '.[] | select(.name=="workspace") | .value')

SCORING_URI=$(az ml batch-endpoint show --name $ENDPOINT_NAME --query scoring_uri -o tsv)

SCORING_TOKEN=$(az account get-access-token --resource --query accessToken -o tsv)

curl --location --request POST $SCORING_URI \
--header "Authorization: Bearer $SCORING_TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{
    \"properties\": {
        \"dataset\": {
            \"dataInputType\": \"DatasetVersion\",
            \"datasetName\": \"$DATASET_NAME\",
            \"datasetVersion\": \"$DATASET_VERSION\"
        \"outputDataset\": {
            \"datastoreId\": \"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORKSPACE/datastores/workspaceblobstore\",
            \"path\": \"$ENDPOINT_NAME\"

Notice that before we execute the curl command, we query for a scoring token, which we then use as the bearer token in our POST call.

Now you can simple run this script to invoke the batch endpoint:


Invoking the endpoint this way triggers the same sequence of events in Azure portal that we’ve already covered in the previous section.


In this post, you learned how to create a batch endpoint on Azure ML. You learned how to write a scoring file, and how to create model and cluster resources on Azure ML. Then you learned how to use those resources to create the endpoint itself, and how to invoke it by giving it a directory of image resources. And finally, you learned to look at the logs and at the file containing the predictions. Congratulations on acquiring a new skill!

The project associated with this post can be found on GitHub.

Thank you to Tracy Chen from Microsoft for reviewing the content in this post.