Creating batch endpoints in Azure ML



Topic: Azure ML


Suppose you’ve trained a machine learning model to accomplish some task, and you’d now like to provide that model’s inference capabilities as a service. Maybe you’re writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints — they provide a simple web-based API for feeding data to your model and getting back inference results.

Azure ML currently supports three types of endpoints: batch endpoints, Kubernetes online endpoints, and managed online endpoints. I’m going to focus on batch endpoints in this post, but let me start by explaining how the three types differ.

Diagram showing an overview of the types of endpoints.

Batch endpoints are designed to handle large requests, working asynchronously and generating results that are held in blob storage. Because compute resources are only provisioned when the job starts, the latency of the response is higher than using online endpoints. However, that can result in substantially lower costs. Online endpoints, on the other hand, are designed to quickly process smaller requests and provide near-immediate responses. Compute resources are provisioned at the time of deployment, and are always up and running, which depending on your scenario may mean higher costs than batch endpoints. However, you get real-time responses, which is criticial to many scenarios. If you want to deploy an online endpoint, you have two options: Kubernetes online endpoints allow you to manage your own compute resources using Kubernetes, while managed online endpoints rely on Azure to manage compute resources, OS updates, scaling, and security. For more information about the different endpoint types and which one is right for you, check out the documentation.

If you’re interested in managed online endpoints, check out my previous post. In this post, I’ll show you how to work with batch endpoints. We’ll start by getting familiar with our PyTorch model. We’ll then write a scoring function that loads the model and performs predictions based on user input. After that, we’ll explore how we can create a batch endpoint on Azure, which will require the creation of several resources in the cloud. And finally, we’ll see how we can invoke the endpoint. The code for this project can be found on GitHub.

Throughout this post, I’ll assume you’re familiar with machine learning concepts like training and prediction, but I won’t assume familiarity with Azure.

Azure ML setup

Here’s how you can set up Azure ML to follow the steps in this post.

  • You need to have an Azure subscription. You can get a free subscription to try it out.
  • Create a resource group.
  • Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
  • If you have access to GitHub Codespaces, click on the “Code” button in this GitHub repo, select the “Codespaces” tab, and then click on “New codespace.”
  • Alternatively, if you plan to use your local machine:
    • Install the Azure CLI by following the instructions in the documentation.
    • Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
  • In a terminal window, login to Azure by executing az login --use-device-code.
  • Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or by looking at ~/.azure/azureProfile.json.
  • Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
  • You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
  • Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code.

You’re now ready to start working with Azure ML!

Training and saving the model

To keep this post simple and focused on endpoints, I provide the already trained model in the GitHub project, under model. This way you can go straight to learning Azure ML endpoints without having to run any code.

If you want to re-create the model provided, you first need to create and activate the conda environment. If you’re running this project on Codespaces, there’s nothing to do — the conda environment is created and activated automatically when the container is created. If you’re running the code locally, you’ll need to execute the following commands from the root of the GitHub repo:

conda env create -f environment.yml
conda activate aml-batch-endpoint

You can then run src/ This file saves the model using the following code:
    ..., path)

For a full explanation of the PyTorch training code, check out my PyTorch blog post.

If you’d like to train on Azure, you can look at the documentation on how to do that.

Creating the model on Azure

Before we can deploy our ML model, we need to create an Azure ML resource that brings it to the cloud. There are a few different ways to create Azure resources — my preferred way is to use YAML files, so this is the method I’ll show in this post.

Below you can see the YAML file we’ll use in the creation of this model:
name: model-batch
version: 1
path: "../model/weights.pth"

If you read my managed online endpoints post, you should already be familiar with the YAML in this file. Please refer to that post for more details on the contents of this file, as well as the best way to create it from scratch.

We’re now ready to create the model on Azure, which we can do by right-clicking on the opened YAML file and selecting “Azure ML: Execute YAML,” or by executing the following CLI command in the terminal:

az ml model create -f cloud/model.yml

If you go to the Azure ML studio, and use the left navigation to go to the “Models” page, you’ll see our newly created model listed there.

In order to deploy our Azure ML endpoint, we’ll use endpoint and deployment YAML files to specify the details of the endpoint configuration. I’ll show bits and pieces of these YAML files throughout the rest of this post as I present each setting. Let’s start by taking a look at how the deployment YAML file refers to the model we created on Azure:
model: azureml:model-batch@latest

Notice that I added a @latest tag after the model name, which selects the latest version of this resource.

Creating the scoring file

When invoked, our endpoint will call a scoring file, which we need to provide. Just like the scoring file for managed online endpoints, this file needs to follow a prescribed structure: it needs to contain an init() function and a run(...) function. In batch endpoints, these functions are called when the batch job starts to run, after the endpoint is invoked. The init() function is only called once per instance, so it’s a good place to add shared operations such as loading the model. The run(...) function is called once per mini-batch (I’ll come back to this later).

First we’ll take a look at the init() function:
import argparse
import logging
import os

import torch
from PIL import Image
from torch import Tensor, nn
from torchvision import transforms

from neural_network import NeuralNetwork


logger = None
model = None
device = None


def init():
    global logger
    global model
    global device

    arg_parser = argparse.ArgumentParser(description='Argument parser.')
    arg_parser.add_argument('--logging_level', type=str, help='logging level')
    args, _ = arg_parser.parse_known_args()
    logger = logging.getLogger(__name__)
    logger.setLevel(args.logging_level.upper())'Init started')

    device = 'cuda' if torch.cuda.is_available() else 'cpu''Device: %s', device)

    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'weights.pth')
    model = NeuralNetwork().to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()'Init completed')


In our scenario, the main task of this function is to load the model. The AZUREML_MODEL_DIR environment variable gives us the directory where the model is located on Azure, which we use to construct the model’s path. Once we have the model’s path, we use it to load the model.

Notice that logging is done differently from online endpoints. Here, we create and configure a global logger variable, which we then use by calling You can see in the code above that, in addition to logging the beginning and end of the function, I also log whether the code is running on GPU or CPU.

Now let’s look at the run(...) function:

labels_map = {
    0: 'T-Shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',


def predict(trained_model: nn.Module, x: Tensor) -> torch.Tensor:
    with torch.no_grad():
        y_prime = trained_model(x)
        probabilities = nn.functional.softmax(y_prime, dim=1)
        predicted_indices = probabilities.argmax(1)
    return predicted_indices


def run(mini_batch):'run(%s started: %s', mini_batch, {__file__})
    predicted_names = []
    transform = transforms.ToTensor()

    for image_path in mini_batch:
        image =
        tensor = transform(image).to(device)
        predicted_index = predict(model, tensor).item()
        predicted_names.append(f'{image_path}: {labels_map[predicted_index]}')'Run completed')
    return predicted_names

In my blog post about managed online endpoints, the run(...) function receives a JSON file as a parameter. Batch endpoints work a bit differently — here the run(...) function receives a list of file paths for a mini-batch of data. The data is specified when invoking the endpoint, and the mini-batch size is specified in the deployment YAML file, as we’ll see soon. In this scenario, we’ll invoke the endpoint by referring to the sample-request directory, which contains several images of clothing items, and we’ll set the mini-batch size to 10. Therefore, the run(...) method receives file paths for 10 images within the sample-request directory.

For each image in the mini-batch, we transform it into a PyTorch tensor, and pass it as a parameter to our predict(...) function. We then append the prediction to a predicted_names list, and return that list as the prediction result.

Let’s now look at how we specify the location of the scoring file and the mini-batch size in the deployment YAML file:
  code: ../../src/
mini_batch_size: 10

Creating the environment

An Azure Machine Learning environment specifies the runtime where we can run training and prediction code on Azure, along with any additional configuration. In my blog post about environments on Azure ML, I present three different options for creating environments. Batch endpoints support all three options, but they don’t support extending curated environments with conda files. In this post’s scenario, we need the Pillow package to read our images in the scoring file, which none of the curated environments available includes. Therefore, we use a system-managed environment and extend it with a conda file that installs Pillow as well as other packages.

Let’s take a look at the conda file used to extend the base image:
name: aml-batch-endpoint
  - pytorch
  - conda-forge
  - defaults
  - numpy=1.20
  - python=3.7
  - pytorch=1.7
  - pillow=8.3.1
  - torchvision=0.8.1
  - pip
  - pip:
    - azureml-defaults==1.32.0

Notice that the conda file above includes the azureml-defaults package, which is required for inference on Azure.

We can now specify our environment in the deployment YAML file, which is created using a base image and the conda file above:
  conda_file: score-conda.yml

Notice how I specify that I want the latest version available of that image by using the “latest” tag. This is a super handy feature!

Creating the compute cluster

Next, let’s create the compute cluster, where we specify the size of the virtual machine we’ll use to run inference, and how many instances of that VM we want running in the cluster. You can learn more about compute on Azure ML in general and compute clusters in particular in my blog post about compute on Azure ML.
name: cluster-cpu
type: amlcompute
size: Standard_DS3_v2
min_instances: 0
max_instances: 4

As you can see, I decided to choose a Standard_DS3_v2 VM (a small VM without a GPU) because our inferencing scenario is simple. I also decided that I want a minimum of zero VM instances, and a maximum of four. Depending on the work load at each moment, Azure will decide how many VMs to run and it will distribute the work across the VMs appropriately.

We can now create our compute cluster:

az ml compute create -f cloud/cluster-cpu.yml

You can go to the Azure ML studio, use the left navigation to go to the “Compute” page, click on “Compute clusters,” and see our newly created compute cluster listed there.

We’re now ready to refer to our compute cluster from within the deployment YAML file:
compute: azureml:cluster-cpu

Creating the endpoint

By now, you’ve seen almost every line of YAML used to create the endpoint. Let’s take a look at the complete deployment and endpoint files to see what else we’re missing.
name: endpoint-batch
auth_mode: aad_token
name: blue
endpoint_name: endpoint-batch
model: azureml:model-batch@latest
  code: ../../src/
  conda_file: score-conda.yml
compute: azureml:cluster-cpu
mini_batch_size: 10
output_file_name: predictions_pytorch.csv

The schema provides VS Code with the information it needs to make suggestions and warn us of problems. You may have noticed that a schema is present in the YAML file for every resource we’ve created so far. The Azure ML extension for VS Code is useful for those situations when you want to create a resource but are not sure which schema to use. If you have this extension installed, you can click on the Azure icon in the left navigation of VS Code, select your subscription and workspace, then click on the ”+” icon next to a resource type to create a template YAML file for that resource.

You’ll need to specify a name for your endpoint — just make sure that you pick a name that is unique within your resource group’s region. This means that you may need to change the name of the endpoints I provide, which you can do by changing the file directly or by specifying a new name in the endpoint creation command (we’ll see this later). For batch endpoints, always specify the auth_mode to be aad_token.

Keep in mind that unlike managed online endpoints, batch endpoints don’t support blue-green deployment. In my managed online endpoints post we added two deployments with traffic set for 90 and 10, to test a new version of our deployment on 10% of the inference calls. In batch endpoints, we can also have several deployment files with the same endpoint-name. However, only one deployment can be set as default, and the default deployment gets 100% of the traffic.

Let’s now take a look at the deployment YAML file. The deployment needs to have an endpoint-name which needs to match the name specified in the endpoint YAML file. We’ve already explored in detail the model, code_configuration, environment, compute and mini_batch_size sections. The output_file_name is self-explanatory — it’s the name of the file that will contain all the predictions for our inputs. I’ll show you later where to find it.

Now that you understand the endpoint configuration YAML files in detail, you’re ready to create the endpoint:

az ml batch-endpoint create -f cloud/endpoint/endpoint.yml --name <ENDPOINT>
az ml batch-deployment create -f cloud/endpoint/deployment.yml --set-default --endpoint-name <ENDPOINT>

If you didn’t specify a unique name in the YAML files, you can do that in the CLI commands by replacing <ENDPOINT> with your unique name. Also, notice how we set the deployment as default at the time of its creation.

You can now go to the Azure ML studio to see your endpoints in the UI. Click on “Endpoints” in the left navigation, then “Batch endpoints” in the top navigation, and you’ll see them listed:

Screenshot of Azure ML studio showing the endpoints we created.

Creating the request files

Next we’ll explore our request files — the list of files we’ll specify when invoking the endpoint, which will then be passed to the run(...) function of the scoring file for inference. If you look at the accompanying project on GitHub, you’ll see a directory called sample-request containing several images of size 28 × 28 pixels, representing clothing items. When invoking the endpoint, we’ll provide the path to this directory.

I decided to include the sample-request directory in the git repo for simplicity. If you want to recreate it, you’ll first need to create the conda environment specified in the environment.yml at the root of the GitHub repo (if you haven’t already), then activate it, and finally run the code in the src/ file.
from torchvision import datasets
import os

DATA_PATH = 'aml-batch-endpoint/data'
SAMPLE_REQUEST = 'aml-batch-endpoint/sample-request'

def main() -> None:
    """Creates a sample request to be used in prediction."""

    test_data = datasets.FashionMNIST(

    os.makedirs(name=SAMPLE_REQUEST, exist_ok=True)
    for i, (image, _) in enumerate(test_data):
        if i == 200:

if __name__ == '__main__':

Invoking the endpoint using the CLI

Now that you have the endpoint YAML files and a directory with sample requests, you can invoke the endpoint using the following command:

az ml batch-endpoint invoke --input sample-request --name <ENDPOINT>

Make sure you use the name you chose in the creation of your endpoint.

Unlike with managed online endpoints, the invocation call will not immediately return the result of your predictions — instead, it kicks off an asynchronous inference run that will produce predictions at a later time. Let’s go to the Azure ML studio and see what’s going on. Click on “Endpoints” in the left navigation, then “Batch endpoints,” and then on the name of one of your endpoints. You’ll be led to a page with two tabs: “Details,” which shows the information you specified in the endpoint’s YAML file, and “Runs,” where we can see the status of asynchronous inference runs associated with the endpoint. Let’s click on “Runs.” You’ll see all the runs that were kicked off by the invoke command, with a status that may be “Running,” “Completed,” or “Failed.”

Screenshot of Azure ML studio run associated with an endpoint.

Now let’s click on the “Display name” of the latest “Completed” run. (If your run is still running, feel free to click on it anyway to see the logs coming in in real-time.) This will take you to a diagram that includes a “score” section in green, with the word “Completed.”

Screenshot of Azure ML studio showing a diagram of a completed run.

Next, right-click on the score section and choose “View log” to see the logs for this run.

Screenshot of Azure ML studio showing a diagram of a completed run and a context menu showing "View log".

This will take you to the following page, which shows all the logs for the run. You can read more about what each log means in the documentation. When a run completes successfully, I’m mostly interested in looking at the logs I added in the init() and run(...) functions of the scoring file. Those can be found under logs/user/stdout/ As you can see below, the logs in the init() function appear once, and the logs in the run(...) function appear as many times as the number of mini-batches in the sample request. Here’s what I see after a successful run:

Screenshot of Azure ML studio showing the logs for a completed run.

I encourage you to spend some time getting familiar with the structure of the logs.

Once a run completes successfully, you’ll also want to look at the results of the prediction, which are stored in blob storage. You can access these by going back to your run diagram, and right-clicking on the little “score” circle below the “Completed” section. Then choose “Access data” from the context menu.

Screenshot of Azure ML studio the link to show predictions.

This takes you to a blob storage location where you can see a file with the name you specified in the endpoint YAML file, which in our scenario is predictions_pytorch.csv. Right-click on the filename (or click the triple-dot icon to its right) to show a menu of options, including an option to “View/edit” and another to “Download.” These CSV files contain one prediction per line, as you can see below:

Screenshot of predictions file.

Each clothing item in this file corresponds to the prediction for one of the images in the sample request. This is a great achievement — we got our predictions!

Invoking the endpoint using REST

Alternatively, you can invoke your endpoint using a curl POST command. In this scenario, we first need to create a dataset with the input data, which we’ll then pass as a parameter to the curl command. Our input data is in the sample-request folder, therefore we specify that path in the YAML file that creates the dataset:
name: dataset-invoke-batch
local_path: ../sample-request/

We can now create the dataset with the following CLI command:

az ml dataset create -f cloud/dataset-invoke.yml

If you go to the Azure ML studio and click on “Datasets” in the left navigation, you’ll see your newly created dataset there.

Now let’s look at what the REST call looks like. The rest/ file contains all the commands you need to invoke the batch endpoint using REST. You can reuse this file for any of your projects, by simply replacing the ENDPOINT_NAME, DATASET_NAME. and DATASET_VERSION with the appropriate information.

SUBSCRIPTION_ID=$(az account show --query id | tr -d '\r"')

RESOURCE_GROUP=$(az group show --query name | tr -d '\r"')

WORKSPACE=$(az configure -l | jq -r '.[] | select(.name=="workspace") | .value')

SCORING_URI=$(az ml batch-endpoint show --name $ENDPOINT_NAME --query scoring_uri -o tsv)

SCORING_TOKEN=$(az account get-access-token --resource --query accessToken -o tsv)

curl --location --request POST $SCORING_URI \
--header "Authorization: Bearer $SCORING_TOKEN" \
--header "Content-Type: application/json" \
--data-raw "{
    \"properties\": {
        \"dataset\": {
            \"dataInputType\": \"DatasetVersion\",
            \"datasetName\": \"$DATASET_NAME\",
            \"datasetVersion\": \"$DATASET_VERSION\"
        \"outputDataset\": {
            \"datastoreId\": \"/subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP/providers/Microsoft.MachineLearningServices/workspaces/$WORKSPACE/datastores/workspaceblobstore\",
            \"path\": \"$ENDPOINT_NAME\"

Notice that before we execute the curl command, we query for a scoring token, which we then use as the bearer token in our POST call.

You can now run this script to invoke the batch endpoint:


Invoking the endpoint this way triggers the same sequence of events in Azure portal that we’ve already covered in the previous section.


In this post, you learned how to create a batch endpoint on Azure ML. You learned how to write a scoring file, and how to create model and compute cluster resources on Azure ML. Then you learned how to use those resources to create the endpoint itself, and how to invoke it by giving it a directory of image resources. And finally, you learned to look at the logs and at the file containing the predictions. Congratulations on acquiring a new skill!

The project associated with this post can be found on GitHub.

Thank you to Tracy Chen from the Azure ML team at Microsoft for reviewing the content in this post.