Introduction to Azure ML



Topic: Azure ML


If you’re used to training machine learning models on your local hardware, you may have experienced frustration as you hit the limits of what it can do. Maybe you’ve trained a model overnight or over several days for a school project, preventing you from using your laptop for other homework. Or maybe you’ve deployed a model in your company’s hardware, only to find out that your product went viral unexpectedly and your hardware couldn’t scale to meet demand. Or maybe as your team grows, you need a better way to manage the workflow of your machine learning project. These are a few common scenarios that are easily solved by moving your training and deployment to the cloud.

There are currently three major offerings in the AI cloud space: AWS Sagemaker from Amazon, Google Cloud Vertex AI from Google, and Azure ML from Microsoft. This article is the first in a series that will cover Azure ML in detail.

In this post, I will provide you with an overview of the major Azure ML concepts you need to understand to become effective at using this platform. Learning the concepts will make it easier for you to follow the code samples in my upcoming posts, and will provide a good foundation for writing your own code.

Let’s get started.

Major concepts in Azure ML

In essence, Azure ML allows you to train and deploy your machine learning models in the cloud. This sounds pretty simple, but when you start analyzing all the different scenarios that ML practitioners need, there’s actually more than meets the eye. Our goal for Azure ML is to provide our users with a platform that has the right balance of flexibility and complexity.

When working with Azure ML, you’ll create several cloud resources that will help you implement your scenario. The diagram below shows the main resources that make up Azure ML:

Overview of Azure ML resources

Here’s a high-level overview of these resources:

(I’ll explain why some of these are highlighted in orange later.)


A workspace is a central resource used to register all other resources, and to track the history of your training job executions. You can learn more about it in the documentation.


A datastore is a reference to an existing storage account on Azure. You can learn more about it in the documentation.


A job is a resource that specifies all the details needed to execute your training code in the cloud: inputs and outputs, the type of hardware to use, software to install, and how to run your code. You can learn more about training models in the documentation. There are currently four types of jobs:

  • Command Job - Contains information to execute a single command. You can learn more about this in my blog post.

  • Pipeline Job - Enables you to break up your machine learning task into several steps, and to specify their inter-dependencies. Check out my blog post on the topic to learn more.

  • Sweep Job - Helps you do hyperparameter tuning. Learn more about this topic in my blog post.

  • AutoML Job - Automates the process of comparing the performance of different models with different parameters, for your particular scenario.


Azure ML supports four different types of assets:

  • Data - A reference to a data source location. This is the data that will be used to train your model.

  • Model - A reference to the location of a trained model.

  • Environment - The software runtime and libraries that you want installed on the hardware where you’ll be training or deploying your model. You can learn more about it in my blog post about environments.

  • Component - A reusable piece of code with inputs and outputs, similar to a function. Components are typically used as steps in a pipeline job, so I talk about them in my blog post on pipelines.


You can learn everything about compute in my blog post on this topic. Azure ML supports three major types of compute:

  • Compute Instance - A virtual machine set up for running ML code during development.

  • Compute Cluster (AmlCompute) - A set of virtual machines that can scale up automatically based on traffic.

  • Attached Computes - Any compute target that you manage yourself outside of Azure. The most commonly used one is Azure Arc-enabled Kubernetes, which you can read about in the documentation.


An endpoint is a web-based API for feeding data to your model and getting back inference results. In Azure ML, an endpoint can have any number of deployments, which specify the resources that do the actual inferencing. You can learn more about endpoints and deployments in the documentation.

  • Managed Online Endpoint - Endpoint designed to quickly process smaller requests and provide near-immediate responses. I cover these in two blog posts, one focusing on MLflow models, and another focusing on non-MLflow models.

  • Batch Endpoint - Endpoint designed to handle large requests, working asynchronously and generating results that are held in blob storage. I also cover these in two blog posts, one focusing on MLflow models, and another focusing on non-MLflow models.

  • Managed Online Deployment and Batch Deployment - When using these deployments, Azure manages compute resources, OS updates, scaling, and security.

  • Kubernetes Online Deployment - When using this deployment, you manage your own resources using Kubernetes.

Different ways of creating resources

Now that you have a good high-level understanding of the resources that make up Azure ML, you might be wondering how you can create these resources. Azure ML supports four different methods for resource management:

  • Azure ML Studio

    The Azure ML Studio is a web portal that allows you to create, maintain, and visualize your Azure ML resources. Here’s a screenshot of what my portal looks like at the moment:

    Screenshot showing the Azure ML Studio.

    You can see that the menu on the left aligns roughly with the Azure ML resources in the earlier diagram. The menu item I have selected, “Jobs,” shows all the training jobs I completed recently. I can click on each of them to see more details. As you learn more about these resources in upcoming blog posts, the UI in the Studio will start to make sense to you. You can learn more about the Azure ML Studio in the documentation.

  • Azure ML CLI

    The Azure ML CLI is an extension to the Azure CLI, providing commands that allow you to manipulate resources by specifying their details in YAML files. Azure ML CLI commands follow the pattern “az ml <noun> <verb> <options>”. For example:

    az ml model create --file model.yml

    How do you know which nouns you can use? The nouns are marked in orange in my diagram earlier in this post. You can also run az ml --help. To know which verbs are allowed for each noun, you can run az ml <noun> --help.

    You can install the Azure ML CLI by following these instructions. A full reference for these commands can be found in the documentation.

  • Azure ML SDK

    The Azure ML SDK is a Python package that allows you to create and maintain Azure ML resources using code. You can install it by following these instructions, and you can find a full reference for this package in the documentation.

  • REST

    Azure ML also provides a REST API that allows you to manipulate resources. You can learn more about it in the documentation.

Which of these four Azure ML resource management methods should you choose? Well, it depends.

The Azure ML Studio guides you through many of the less intuitive choices you need to make, so it’s great for beginners. It also shows all the resources you’ve created together with all their relevant details in a single location, and it offers great visualizations for your logs. However, it doesn’t provide you with an efficient way to repeat your steps. As an advanced user, I use it when I need to double check certain choices while using other more repeatable methods for resource creation, when I need to visualize my resources and logs, and when I need to create and maintain resources and I know that I won’t need to repeat those steps in the future.

I personally use the Azure ML CLI method the most, for the following reasons:

  • It’s easily repeatable. For example, if my resource creation isn’t quite right the first time, I can easily update it by making minor tweaks in the YAML file and re-running the CLI command.
  • It’s language independent (unlike the SDK, which I’ll cover next). You might think that this is not an advantage for you, because everyone on your team is using Python anyway. But maybe you’ll hire a great data scientist that’s more comfortable with R. Or maybe Julia overtakes Python in popularity someday, and your team ports all existing code to Julia. Using the CLI might pay off in the future.
  • It’s the easiest method to separate your machine learning logic from your cloud-specific logic. What I mean by this is that it should be absolutely clear to anyone looking at your project which files are needed to solve your machine learning task, and which files are needed to bring training and deployment to the cloud. If your team decides to switch to a different cloud provider in the future, it should be easy to replace a set of cloud-related files with another, without touching the actual machine learning code. When using the CLI, all Azure ML configuration files are written in YAML, and your ML code is written in Python or R, so this distinction is clear.

The Azure ML Python SDK is also great when you need a repeatable way to manipulate resources. The majority of data scientists and ML engineers are familiar with Python these days, and are pretty comfortable adding a new package to their workflow. As I mentioned, I tend to prefer the CLI for most uses, but there are some scenarios that have me reaching for the SDK. For example, if I need to create a large number of components in a pipeline and they’re all similar to each other, it’s much easier to use a for loop in code than to copy and paste the same YAML multiple times. Also, I think it’s not that hard to keep your machine learning code separate from your cloud code, with a little bit of intentional project planning. I discuss how I organize my SDK code in this blog post.

The REST API is mostly used by ISVs (Independent Software Vendors) that are building solutions on top of Azure ML. I rarely use our REST API.

You can choose a single method for your project, or you can mix them in a single project, as I explain in this blog post. If you’re a data scientist or ML engineer, I think that you should be familiar with the first three methods, so that you can choose whichever one is most appropriate for each scenario.

Notebooks vs Python files

Azure ML works with both notebooks and Python files, and you’ll find many sample apps using both methods. In this blog, however, I’ll be using projects containing Python files pretty much exclusively. I find that notebooks are great for the initial phase where I’m experimenting with different techniques, but they’re not ideal to deploy my code into production. Since the purpose of Azure ML is to get our code ready for production, I’ve decided to demonstrate Azure ML techniques with projects containing Python files only.

Why are notebooks less than ideal for deploying production code?

There are many reasons. As I mentioned earlier, I think it’s important to keep your ML logic separate from your cloud solution logic. This is much easier to achieve using a project that contains multiple Python and YAML files, as opposed to a single monolithic notebook file.

Also, I like to create utility files that I reuse across many projects. I have files with reusable machine learning code, plotting code, image-related code, and so on. Reusing common code makes me much more productive, because I don’t need to start from scratch for each project. It’s far easier to include and organize these utilities in a new project if I use Python files instead of notebooks.

In addition, keep in mind that Command Components contain reusable pieces of code that must be defined in separate Python files. You could develop the code within a notebook, then resort to the %writefile command to save it in a Python file. However, as the complexity of your project increases, this approach quickly becomes unmaintainable.

These are just some of the reasons why I’ve decided not to use notebooks in this blog. But if you prefer to use notebooks for your project, all the concepts you’ll learn here will still apply, and you can easily convert my code into notebook form.


MLflow is an open source platform for the machine learning lifecycle. Many companies and individuals contribute to this open source project, including Microsoft. Azure ML has fully embraced MLflow, and my recommendation is for you to take advantage of its capabilities if you can. I’ll show examples that work with and without MLflow in the blog posts coming up in this series, and you’ll see first-hand the many advantages of using it. For now, I’ll give you a quick overview of its functionality.

  • MLflow tracking

    I love MLflow’s ability to track different types of data (numbers, strings, dictionaries, plots, images, files, and more), and its support for visualizing that data. For example, whenever I log a numeric metric like accuracy or loss while training, MLflow will show me a graph without the need for any additional code. When training locally, you can use the MLflow UI for visualization. And because MLflow is tightly integrated into Azure ML, you can also use the Azure ML Studio to see the same visualizations, whether you’re training locally or in the cloud.

  • MLflow model deployment

    MLflow defines a standard format for packaging machine learning models, regardless of which technology was used to create them. This enables services to consume different types of models by simply supporting the interface defined by MLflow. As you can imagine, this is a very attractive feature because it facilitates the sharing of models across different technologies. MLflow’s standard format has gained wide adoption in the industry, so it’s no surprise that Azure ML supports it too.

    One advantage of saving a model using MLflow when working with Azure ML becomes apparent when you deploy the model. Deploying a non-MLflow model requires you to write a “scoring” code file, but deploying an MLflow model can be accomplished without writing any additional code. You can read more about this scenario in the documentation.

  • MLflow model registry

    You can manage your Azure ML model registry using any of the resource management methods I explained earlier in this post (CLI, SDK, or Studio). Alternatively, you can use MLflow’s model registry API. Any commands you execute using the MLflow API, when executed in the cloud or locally with a tracking URI, also update your Azure ML model registry. You can read more about this in the documentation, or you can look at this code sample for more details.

  • MLflow projects

    MLflow also defines a standard format for packaging your data science code. You can use the MLflow project API to run and track your code locally, run your code locally and track it in the cloud, or run your code and track it in the cloud. You can read more about this scenario in the documentation.


I hope this article gave you a good high-level overview of Azure ML’s major concepts. In future posts, I’ll dive into the details of how you can use these concepts to train and deploy machine learning models on Azure ML.

Read next: How to train and deploy in Azure ML