Creating batch endpoints in Azure ML

Topic: Azure ML


Suppose you’ve trained a machine learning model to accomplish some task, and you’d now like to provide that model’s inference capabilities as a service. Maybe you’re writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints — they provide a simple web-based API for feeding data to your model and getting back inference results.

Azure currently supports two types of endpoints: batch endpoints, and online managed endpoints.

Diagram showing an overview of the types of endpoints.

Online managed endpoints are designed to quickly process smaller requests and provide near-immediate responses. Batch endpoints, on the other hand, are designed to handle large requests, working asynchronously and generating results that are held in blob storage. For more information about the different endpoint types and which one is right for you, check out the documentation.

If you’re interested in online managed endpoints, check out my previous post. In this post, I’ll show you how to work with batch endpoints. We’ll start by training and saving two machine learning models, one using PyTorch and another using TensorFlow. We’ll then write scoring functions that load the models and perform predictions based on user input. After that, we’ll explore how we can create the batch endpoints on Azure, which will need the creation of several resources in the cloud. And finally, we’ll see how we can invoke them. The code for this project can be found on GitHub.

Throughout this post, I’ll assume you’re familiar with machine learning concepts like training and prediction, but I won’t assume familiarity with Azure.

Azure ML prerequisites

Here’s how you can set up Azure ML to follow the steps in this post.

  • You need to have an Azure subscription. You can get a free subscription to try it out.
  • Create a resource group.
  • Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
  • Install the Azure CLI (command-line interface) on your platform of choice. This post has been tested using the CLI on WSL2 (Windows subsystem for Linux) with Ubuntu.
  • Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
  • Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or if you have a machine setup similar to mine, by looking at ~/.azure/azureProfile.json.
  • Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
  • You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
  • Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code, if you’re using VS Code.

You’re now ready to start working with Azure ML!

Training and saving the models, and creating them on Azure

We’ll start by training two machine learning models to classify Fashion MNIST images — one using PyTorch and another using TensorFlow. If you’d like to explore the training code in detail, check out my previous posts on PyTorch, Keras and TensorFlow. The code associated with this post already includes pre-trained models, so you can just use them as-is. But if you’d like to recreate them, you can set up your machine using the files under the conda folder and run the training code, which is in pytorch_src/ and tf_src/

In my online managed endpoints post, I save and load just the weights of the models. Here I’ll work with the whole model, which is very similar. First, during training, we need to save the model. This can be done with the code below:

pytorch_src/, 'pytorch_model/model.pth')

Next, we need to create the models on Azure. There are many ways to create resources on Azure. My preferred way is to use a separate YAML file for each resource and a CLI command to kick-off the remote creation, so that’s what I’ll show here. Below you can see the YAML files we’ll use in the creation of these models.

name: pytorch-fashion-model
version: 1
local_path: '../pytorch_model/model.pth'
name: tf-fashion-model
version: 1
local_path: '../tf_model/'

If you read my online managed endpoints post, you should already be familiar with the YAML in these files. Please refer to that post for more details on the contents of these files, as well as the best way to create them from scratch.

We’re now ready to create the models on Azure, which we can do with the following CLI commands:

az ml model create -f cloud/pytorch-fashion-model.yml
az ml model create -f cloud/tf-fashion-model.yml

If you go to the Azure ML studio, and use the left navigation to go to the “Models” page, you’ll see our newly created models listed there.

In order to deploy our Azure ML endpoints, we’ll use YAML files to specify the details of the endpoint configurations. I’ll show bits and pieces of these YAML files throughout the rest of this post as I present each setting. Let’s start by taking a look at how these endpoint YAML files refer to the models we created on Azure:

model: azureml:pytorch-fashion-model:1
model: azureml:tf-fashion-model:1

Creating the scoring files

When invoked, our endpoint will call a scoring file, which we need to provide. Just like the scoring file for online managed endpoints, this scoring file needs to follow a prescribed structure: it needs to contain an init() function and a run() function that are called when the batch job starts to run, after the endpoint is invoked. The init() function is only called once per instance, so it’s a good place to add shared operations such as loading the model. The run() function is called once per process and handles a single mini-batch.

Let’s first take a look at the scoring code for the PyTorch model (you’ll find similar TensorFlow code in the post’s project):

def init():
  global logger
  global model
  global device

  arg_parser = argparse.ArgumentParser(description="Argument parser.")
  arg_parser.add_argument("--logging_level", type=str, help="logging level")
  args, _ = arg_parser.parse_known_args()
  logger = logging.getLogger(__name__)
  logger.setLevel(args.logging_level.upper())'Init started')

  device = 'cuda' if torch.cuda.is_available() else 'cpu''Device: {device}')

  model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'model.pth')

  model = torch.load(model_path, map_location=device)
  model.eval()'Init completed')

In our scenario, the main task of this function is to load the model. The AZUREML_MODEL_DIR environment variable gives us the directory where the model is located on Azure, which we use to construct the model’s path. Once we have the model’s path, we use it to load the model. Because we saved the whole model, not just the weights, we can load it with torch.load directly, without having to instantiate the NeuralNetwork class first.

Notice that logging is done differently from online endpoints. Here, we create and configure a global logger variable, which we then use by calling You can see in the code above that, in addition to logging the beginning and end of the function, I also log whether the code is running on GPU or CPU.

Now let’s look at the run() function:

labels_map = {
    0: 'T-Shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',

def predict(model: nn.Module, X: Tensor) -> torch.Tensor:
  with torch.no_grad():
    y_prime = model(X)
    probabilities = nn.functional.softmax(y_prime, dim=1)
    predicted_indices = probabilities.argmax(1)
  return predicted_indices

def run(mini_batch):'run({mini_batch} started: {__file__}')
  predicted_names = []
  transform = transforms.ToTensor()
  device = 'cuda' if torch.cuda.is_available() else 'cpu'

  for image_path in mini_batch:
    image =
    tensor = transform(image).to(device)
    predicted_index = predict(model, tensor).item()
    predicted_names.append(labels_map[predicted_index])'Run completed')
  return predicted_names

In my blog post about online managed endpoints, the run() function receives a JSON file as a parameter. Because we’re now creating a batch endpoint, here the run() function receives a list of file paths for a mini-batch of data that is specified when invoking the endpoint. In this scenario, we’ll invoke the endpoint by referring to the sample_request directory, which contains several images of clothing items, and therefore the run() method receives the file paths for a mini-batch of these images. The mini-batch size is defined in the YAML for the endpoint, as we’ll see later. For each image in the mini-batch, we transform it into a PyTorch tensor, and pass it as a parameter to our predict() function. We then append the prediction to a predicted_names list, and return that list as the prediction result.

Let’s now look at how we specify the location of the scoring file and the mini-batch size in the endpoint YAML files:

    local_path: ../../pytorch_src/
mini_batch_size: 10
    local_path: ../../tf_src/
mini_batch_size: 10

Creating the environments

An Azure Machine Learning environment specifies the runtime where we can run training and prediction code on Azure, along with any additional configuration. In my blog post about online managed endpoints, I present three different ways to create the inference environment for an endpoint: curated environments, base images, and user-managed environments. I also describe all the options for adding additional packages available for curated environments and base images.

Batch endpoints also support all three options for creating environments, but they don’t support extending prebuilt Docker images with conda files. In this post’s scenario, we need the Pillow package to read our images in the scoring file. Since none of the prebuilt Docker images includes Pillow, we use base images and extend them with conda files that install Pillow as well as other packages.

Let’s take a look at the conda files used to extend the base images:

name: pytorch-batch-endpoint-score
  - pytorch
  - conda-forge
  - defaults
  - numpy=1.20
  - python=3.7
  - pytorch=1.7
  - pillow=8.3.1
  - torchvision=0.8.1
  - pip
  - pip:
    - azureml-defaults==1.32.0
name: tf-batch-endpoint-score
  - conda-forge
  - defaults
  - python=3.7
  - pillow=8.3.1
  - pip
  - pip:
    - tensorflow==2.4
    - azureml-defaults==1.32.0

Notice that each of the conda files above includes the azureml-defaults package, which is required for inference on Azure. Now let’s look at the YAML files that we use to create our environments, each of which refers to a base Docker image and a conda file that extends it:

name: pytorch-cpu-batch-env
version: 1
conda_file: file:score-conda.yml
name: tf-cpu-batch-env
version: 1
conda_file: file:score-conda.yml

We can now create the environments using the CLI:

az ml environment create -f cloud/endpoint_1/pytorch-cpu-batch-env.yml
az ml environment create -f cloud/endpoint_2/tf-cpu-batch-env.yml

If you look at the “Environments” section of the Azure ML studio, you should see the newly created environments listed there.

Now that the environments are created, we can refer to them in the endpoint YAML files:

environment: azureml:pytorch-cpu-batch-env:1
environment: azureml:tf-cpu-batch-env:1

Creating the compute clusters

Next, let’s create the compute cluster, where we specify the size of the virtual machine we’ll use to run inference, and how many instances of that VM we want running in the cluster.

name: cpu-cluster
type: amlcompute
size: Standard_DS3_v2
min_instances: 0
max_instances: 4

First we need to specify the name for the cluster — I decided on the descriptive cpu-cluster name. Then we need to choose the compute type. Currently the only compute type supported is amlcompute, so that’s what we specify.

Next we need to choose a VM size. You can see a full list of supported VM sizes in the documentation. I decided to choose a Standard_DS3_v2 VM (a small VM without a GPU) because our inferencing scenario is simple.

And last, I specify that I want a minimum of zero VM instances, and a maximum of four. Depending on the work load at each moment, Azure will decide how many VMs to run and it will distribute the work across the VMs appropriately.

We can now create our compute cluster:

az ml compute create -f cloud/cpu-cluster.yml

You can go to the Azure ML studio, use the left navigation to go to the “Compute” page, click on “Compute clusters,” and see our newly created compute cluster listed there.

We’re now ready to refer to our compute cluster from within the endpoint YAML files:

cloud/endpoint_1/endpoint.yml &
  target: azureml:cpu-cluster

Creating the endpoints

By now, you’ve seen almost every line of the YAML files used to create the endpoints. Let’s take a look at the whole file for one of the endpoints to see what else we’re missing.

name: fashion-endpoint-batch-1
type: batch
auth_mode: aad_token
  blue: 100

  - name: blue
    model: azureml:pytorch-fashion-model:1
        local_path: ../../pytorch_src/
    environment: azureml:pytorch-cpu-batch-env:1
      target: azureml:cpu-cluster
    mini_batch_size: 10
    output_file_name: predictions_pytorch.csv

The schema provides VS Code with the information it needs to make suggestions and warn us of problems. You may have noticed that a schema is present in the YAML file for every resource we’ve created so far. The Azure ML extension for VS Code is useful for those situations when you want to create a resource but are not sure which schema to use. If you have this extension installed, you can click on the Azure icon in the left navigation of VS Code, select your subscription and workspace, then click on the ”+” icon next to a resource type to create a template YAML file for that resource.

You’ll need to specify a name for your endpoint — just make sure that you pick a name that is unique within your resource group’s region. This means that you may need to change the name of the endpoints in the files I provide. For batch endpoints, always specify the type to be batch, and the auth_mode to be aad_token.

For traffic, in our scenario we just specify one deployment, which we call blue. You could add more than one deployment here, but unlike online managed endpoints, batch endpoints don’t support traffic levels other than 0 and 100. In my online managed endpoints post we added two deployments with traffic set for 90 and 10, to test a new version of our deployment on 10% of the inference calls. Sending requests to multiple deployments is less useful for batch endpoints, because each invocation leads to a variable number of inference calls, so Azure ML can’t guarantee a particular split.

Let’s move on in the exploration of the endpoint YAML file. The deployment needs to have a name and that name needs to match the one specified in the traffic section. We’ve already explored in detail the model, code_configuration, environment, compute and mini_batch_size sections. The output_file_name is self-explanatory — it’s the name of the file that will contain all the predictions for our inputs. I’ll show you later where to find it.

The second endpoint is very similar to this one. The only difference is that it points to the TensorFlow scoring code. Now that you understand the endpoint YAML files in detail, you’re ready to create the endpoints:

az ml endpoint create --type batch -f cloud/endpoint_1/endpoint.yml
az ml endpoint create --type batch -f cloud/endpoint_2/endpoint.yml

You can now go to the Azure ML studio to see your endpoints in the UI. Click on “Endpoints” in the left navigation, then “Batch endpoints” in the top navigation, and you’ll see them listed, as you can see in the image below:

Screenshot of Azure ML studio showing the endpoints we created.

Creating the request files

Next we’ll explore our request files — the list of files we’ll specify when invoking the endpoint, which will then be passed to the run() function of the scoring file for inference. If you look at the accompanying project on GitHub, you’ll see a directory called sample_request containing several images of size pixels, representing clothing items. When invoking the endpoint, we’ll provide the path to this directory.

I decided to include the sample_request directory in the git repo for simplicity. If you want to recreate it, you’ll first need to create the conda environment specified in conda/pytorch-batch-endpoint-train.yml (if you haven’t already), then activate it, and finally run the code in the pytorch_src/ file.

def main() -> None:
  dir_name = 'sample_request'

  test_data = datasets.FashionMNIST(

  os.makedirs(name=dir_name, exist_ok=True)
  for i, (image, _) in enumerate(test_data):
    if i == 200:

Invoking the endpoints

Now that you have the endpoint YAML files and a directory with sample requests, you can invoke the endpoints using the following commands:

az ml endpoint invoke --name <ENDPOINT1> --type batch --input-local-path sample_request
az ml endpoint invoke --name <ENDPOINT2> --type batch --input-local-path sample_request

Make sure you replace the endpoint names with the names you used in the creation of your endpoints.

Unlike with online managed endpoints, the invocation call will not immediately return the result of your predictions — instead, it kicks off an asynchronous inference run that will produce predictions at a later time. Let’s go to the Azure ML studio and see what’s going on. Click on “Endpoints” in the left navigation, then “Batch endpoints,” and then on the name of one of your endpoints. You’ll be led to a page with two tabs: “Details,” which shows the information you specified in the endpoint’s YAML file, and “Runs,” where we can see the status of asynchronous inference runs associated with the endpoint. Let’s click on “Runs.” You’ll see the run that was kicked off by the invoke command, with a status that may be “Running,” “Completed,” or “Failed.”

Screenshot of Azure ML studio run associated with an endpoint.

Now let’s click on the “Display name” of the run. I’ll cover two scenarios here, one where the run status is “Completed” and another one where it’s “Failed.” But feel free to click on the run name even if your status is “Running” to see the logs appearing in real-time and get familiar with the UI.

Let’s first consider the “Completed” scenario. Clicking on the name of the run will take you to a diagram that includes a “score” section in green, with the word “Completed.”

Screenshot of Azure ML studio showing a diagram of a completed run.

Clicking on the score section of the diagram will open a new area to its right showing all the logs for the run. You can read more about what each log means in the documentation. When a run completes successfully, I’m mostly interested in looking at the logs I added in the init() and run() functions of the scoring file. Those can be found under logs/user/stdout/ As you can see below, the logs in the init() function appear once, and the logs in the run() function appear as many times as the number of mini-batches in the sample request. Here’s what I see after a successful run:

Screenshot of Azure ML studio showing the logs for a completed run.

Now let’s consider a “Failed” run. Clicking on the name of the run will take you to a diagram similar to the one below:

Screenshot of Azure ML studio showing a diagram of a failed run.

Clicking on the score section of the diagram will open an area showing the logs. As an example, in one failing case, I opened the file logs/user/error/error.txt and saw the following message: “Please check logs/user/error/* and logs/sys/error/* to see if some errors have occurred.” So, next I opened the file logs/sys/error/ and saw “MKL_THREADING_LAYER=INTEL is incompatible with library. Try to import numpy first…” Adding numpy to the imports in my scoring file fixed the problem.

Screenshot of Azure ML studio showing the logs for a failed run.

I encourage you to spend some time getting familiar with the structure of the logs. The logs/readme.txt file contains a nice overview of the logs, and it’s a great place to start your exploration.

Once a run completes successfully, you’ll want to look at the results of the prediction. These results are saved in blob storage, and can be found with the following steps: just above the log files, click on “Show data outputs”:

Screenshot of Azure ML studio the link to show data outputs.

Then click on the icon that says “Access data” when hovering over it, which is the icon shown below:

Screenshot of Azure ML studio the link to show data outputs.

This takes you to a blob storage location where you can see a file with the name you specified in the endpoint YAML file, which in our scenario is either predictions_tf.csv or predictions_pytorch.csv. Right-click on the filename (or click the triple-dot icon to its right) to show a menu of options, including an option to “View/edit” and another to “Download.” These CSV files contain one prediction per line, as we can see below:

Screenshot of predictions file.

Each clothing item in this file corresponds to the prediction for one of the images in the sample request. This is a great achievement — we got our predictions!


In this post, you learned how to create a batch endpoint on Azure ML. You learned how to write a scoring file, and how to create a model, environment, and cluster resources on Azure ML. Then you learned how to use those resources to create the endpoint itself, and how to invoke it by giving it a directory of image resources. And finally, you learned to look at the logs and at the file containing the predictions. Congratulations on acquiring a new skill!

The project associated with this post can be found on GitHub.

Thank you to Tracy Chen from Microsoft for reviewing the content in this post.