Creating batch endpoints in Azure ML without using MLflow



Topic: Azure ML: from beginner to pro


My main goal for this post is to demonstrate how you can deploy a non-MLflow model using an Azure ML batch endpoint. Batch endpoints are designed to enable large asynchronous requests, as opposed to online endpoints, which are designed to deliver fast results. MLflow is an open-source platform that among many other features, provides us with a convention for saving models in a self-contained way.

You can deploy MLflow and non-MLflow models with both types of endpoints. Here’s where you can learn more:

  • In this blog post, you can learn about deploying MLflow models with managed online endpoints.
  • In this blog post, you can learn about deploying non-MLflow models with managed online endpoints.
  • In this blog post, you can learn about deploying MLflow models with batch endpoints.
  • To complete the series, in the current blog post, you’ll learn about deploying non-MLflow models with batch endpoints.

It’s useful to understand all of these scenarios, at least at the high-level, because different projects bring different challenges and are better solved with different solutions. If you have a chance to save your model using MLflow I recommend that you do so, because deployment will be much easier to implement. But I realize that there are many constraints when you’re trying to get a project completed, and so I want to give you options.

The code for this project can be found on GitHub. Feel free to follow along on GitHub as you read the post. The README file for the project contains details about Azure and project setup.

Training and inference on your development machine

I always recommend testing your training and inference code on your local machine before you deploy it in the cloud.

The training code for the Fashion MNIST scenario in this post’s project can be found under src/ Because we’re not using MLflow in our scenario, we save our PyTorch model using the function:

def save_model(model_dir: str, model: nn.Module) -> None:
    Saves the trained model.
    """"Saving model to %s", model_dir)
    Path(model_dir).mkdir(parents=True, exist_ok=True), Path(model_dir, "weights.pth"))

You can run the training code by clicking on “Run and Debug” on VS Code’s left navigation, selecting “Train locally,” and pressing F5. This generates a new model folder containing the trained model.

If this were a managed online endpoint, we could test deployment locally by adding --local to our usual Azure ML commands (you can learn more about that in my blog post). Local deployment is not supported for batch endpoints though, so we need to write custom code to test local inference. You can write any code that you can easily debug and that gives you confidence in your training code. Here’s, my local scoring file for Fashion MNIST:

"""Code that helps us test our neural network before deploying to the cloud."""

import logging
from pathlib import Path

import torch
from import DataLoader
from torchvision.datasets import FashionMNIST

from dataset import FashionMNISTDatasetFromImages
from neural_network import NeuralNetwork
from utils_score_nn import predict

IMAGES_DIR = "aml_batch_endpoint_no_mlflow/test_data/images/"
MODEL_DIR = "aml_batch_endpoint_no_mlflow/model/weights.pth"

def main() -> None:
    device = "cuda" if torch.cuda.is_available() else "cpu"

    model_path = MODEL_DIR
    model = NeuralNetwork().to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))

    image_paths = [
        f.as_posix() for f in Path(IMAGES_DIR).iterdir() if Path.is_file(f)
    images_dataset = FashionMNISTDatasetFromImages(image_paths)

    dataloader = DataLoader(images_dataset)
    predicted_indices = predict(dataloader, model, device)
    predictions = [
        for predicted_index in predicted_indices
    ]"Predictions: %s", predictions)

if __name__ == "__main__":

You can run this code by selecting the “Test locally” configuration and pressing F5. This will produce the following output:

INFO:root:Predictions: ['Ankle boot', 'Pullover', 'Trouser', 'Trouser', 'Shirt', 'Trouser', 'Coat', 'Shirt', 'Sandal', 'Sneaker', 'Coat', 'Sandal', 'Bag', 'Dress', 'Coat', 'Trouser', 'Pullover', 'Pullover', 'Bag', 'T-shirt/top']

You can verify that the predictions are correct by looking at the input images, under test_data/images.

Once you’re happy with your training and local inference code, it’s time to deploy in the cloud!

Deployment in the cloud

Our first step is to write a scoring file that will be executed by Azure ML every time the endpoint is invoked. If we had saved our model using MLflow there would be no need for this file, which greatly simplifies development. Here’s what the file looks like for Fashion MNIST:


import argparse
import logging
import os
from dataclasses import dataclass

import torch
from import DataLoader
from torchvision.datasets import FashionMNIST

from dataset import FashionMNISTDatasetFromImages
from neural_network import NeuralNetwork
from utils_score_nn import predict

class State:
    model: torch.nn.Module
    device: str
    logger: logging.Logger

state = None

def init() -> None:
    global state

    arg_parser = argparse.ArgumentParser(description="Argument parser.")
    arg_parser.add_argument("--logging_level", type=str, help="logging level")
    args, _ = arg_parser.parse_known_args()

    logger = logging.getLogger(__name__)
    logger.setLevel(args.logging_level.upper())"Init started")

    device = "cuda" if torch.cuda.is_available() else "cpu""Device: %s", device)

    model_path = os.path.join(os.getenv("AZUREML_MODEL_DIR", default=""),
    model = NeuralNetwork().to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))

    state = State(model, device, logger)"Init completed")

def run(mini_batch: list[str]) -> list[str]:
    if state is None:
        return []"run(%s started: %s", mini_batch, {__file__})

    images_dataset = FashionMNISTDatasetFromImages(mini_batch)
    dataloader = DataLoader(images_dataset)
    predicted_indices = predict(dataloader, state.model, state.device)
    predictions = [
        for predicted_index in predicted_indices
    ]"Predictions: %s", predictions)"Run completed")

    return predictions

Azure ML expects the scoring file to contain an init() function and a run(...) function, which are called when the batch job starts to run (after the endpoint is invoked).

The init() function is only called once per instance, so it’s a good place to add shared operations such as loading the model. Notice the AZUREML_MODEL_DIR environment variable, which gives us the directory where the model is located on Azure. We use this directory to construct the model’s path, which we then use to load the model. In addition to logging the beginning and end of the function, I also log whether the code is running on GPU or CPU, as well as the model’s predictions.

The run(...) function is called once per mini-batch. When we invoke a batch endpoint, if the size of our data is larger than the size we specify for the mini-batch, this function will be called more than once. I’ll show you later how we can control the size of the mini-batch, when we look at the deployment YAML file specification. In my blog post about managed online endpoints, the run(...) function receives a JSON file as a parameter. Batch endpoints work a bit differently — here the run(...) function receives a list of file paths for a mini-batch of input data. In this scenario, we’ll invoke the endpoint by referring to the test_data/images directory, which contains 20 images of clothing items, and we’ll set the mini-batch size to 10. Therefore, the run(...) method will be called twice, and will receive file paths for 10 images each time. We open each image in the mini-batch, transform it into a NumPy array, add it to a list, create a DataLoader from the list, and pass the DataLoader as a parameter to our predict(...) function. We get back a list of prediction indices, which we convert to strings representing the associated clothing items.

Our scoring file is complete, and we’re ready to start creating cloud resources to support our deployment. In order to deploy a simple batch endpoint, we’ll need to create the following resources on Azure ML:

  • A cluster of CPU VMs where the batch job will run.
  • The trained model.
  • A batch endpoint.
  • The deployment associated with the endpoint.

Let’s start with the CPU cluster. Here’s cluster-cpu.yml, the YAML definition for our CPU cluster:

name: cluster-cpu
type: amlcompute
size: Standard_DS4_v2
min_instances: 0
max_instances: 4

We specify a schema to get Intellisense and error messages during development, then give the resource a name and a type (which is always amlcompute for a compute cluster). We picked the Standard_DS4_v2 CPU VM size — take a look at my blog post about compute for guidance on which VM to pick. And we instruct Azure to give us however many instances we need to run our batch job, between 0 and 4. We can create our compute resource with the following CLI command:

az ml compute create -f cloud/cluster-cpu.yml

You can verify that the cluster was created correctly by going to the Azure ML Studio, and then clicking on “Compute” on the left navigation, followed by “Compute clusters.” You should see a compute cluster named “cluster-cpu” on this page:

Screenshot showing the CPU cluster in the Studio.

Next we register a model asset with Azure ML. We could have created a YAML definition file, but since we just need a few properties, we can add them directly to the command:

az ml model create --path model --name model-batch-no-mlflow --version 1

You can verify that the model was created in the Studio, in the “Models” section:

Screenshot showing the model in the Studio.

Next we create the batch endpoint. Here’s the YAML definition file, endpoint.yml:

name: endpoint-batch-no-mlflow
auth_mode: aad_token

In addition to the schema and name, we specify the aad_token authentication mode (this is the only mode supported at the moment for batch endpoints). We can execute the following command to create the endpoint in the cloud:

az ml batch-endpoint create -f cloud/endpoint.yml

You can verify its creation in the Studio, by clicking on “Endpoints” and then on “Batch endpoints.”

Screenshot showing the endpoint in the Studio.

And finally, we create the deployment. My blog post on managed online endpoints demonstrates how a managed online endpoint can have more than one deployment, with partial traffic directed to each deployment. Batch endpoints can also have multiple deployments, but the traffic can’t be split — it needs to be directed to a single deployment.

Let’s take a look at the deployment.yml YAML definition file:

name: blue
endpoint_name: endpoint-batch-no-mlflow
model: azureml:model-batch-no-mlflow:1
  code: ../src/
  conda_file: score-conda.yml
compute: azureml:cluster-cpu
mini_batch_size: 10
output_file_name: predictions_pytorch.csv

We give it a schema, a name, and the name of the endpoint we want to associate with the deployment. We refer to the trained model we registered on Azure ML, and we specify the location of the scoring file we looked at earlier. The environment contains all the software dependencies needed to run our scoring file — you can read more about environments in my blog post on the topic. We specify the VM cluster we created earlier in the cloud. We give it a mini-batch size, which determines how many images the run(...) function will receive, and we choose a name for the output prediction file.

There’s a lot of information in the deployment definition file. If our model had been saved using MLflow, we could have skipped the scoring file and the environment, because MLflow is capable of inferring those automatically.

Let’s create the deployment in the cloud:

az ml batch-deployment create -f cloud/deployment.yml --set-default

The --set-default flag indicates that we want this deployment to receive all the traffic directed to the endpoint. You can verify that the deployment was created by clicking on the endpoint in the Studio, and making sure that you see a deployment named “blue” under the “Deployment summary” section.

Screenshot showing the deployment in the Studio.

Now that we have all resources created in the cloud, we can invoke the endpoint by giving it a folder of images for it to score asynchronously:

az ml batch-endpoint invoke --name endpoint-batch-no-mlflow --input test_data/images

This operation will take several minutes to complete. You can follow its progress by going to the Studio, clicking on “Jobs,” the name of your endpoint, and then the latest job (you can get to the same batch job by clicking on the “Endpoints” left navigation tab instead). You’ll see a diagram containing the input data linked to the batch job in blue, while the operation is still in progress:

Screenshot showing the batch job running.

When the operation completes, the diagram turns green. Anytime during or after the run, you can click on the little icon on the right of the page (or double click on the batchscoring rectangle) to look at the logs.

Screenshot of the Studio UI after the batch job completes.

Remember the scoring file, where we added some custom logging? You can look at these logs in logs/user/stdout/

Screenshot of our custom logs.

To look at the predictions file, click on “Show data outputs,” and then on the “Access data” icon.

Screenshot of the "Access data" icon to get to our predictions.

You will be taken to the blob storage location that contains the predictions file. If you right-click on the file and select “View/edit,” you can see your predictions:

Screenshot of our predictions.

And we’re done! :)


In this post, you learned how to deploy a non-MLflow model using a batch endpoint. Compared to deploying an MLflow model, the workflow you learned today is a bit more involved:

  • We always need to write custom code to test local inference (when using MLflow, we may be able to use the MLflow CLI).
  • We need to write a scoring file that runs in the cloud when the endpoint is invoked.
  • We need to specify an environment containing the software dependencies for the scoring file.

If you’re working with Azure ML, it’s important to understand the tradeoffs of different solutions so that you can make the best decision for your scenario. Hopefully this blog post helped you do just that.

I’ll conclude with a table summarizing local endpoint input type, local inference recommended technique, and cloud endpoint input type for managed online endpoints and batch endpoints, assuming that your model is not saved using MLflow. For more details about the managed online endpoint scenario, check out my blog post on the topic.

Table comparing non-MLflow endpoints.

Thank you for reading!

Read next: Choosing the compute for Azure ML resources