Today Microsoft and PyTorch announced a “PyTorch Fundamentals” tutorial, which you can find on Microsoft’s site and on PyTorch’s site. The code in this post is based on the code appearing in that tutorial, and forms the foundation for a series of other posts, where I’ll explore other machine learning frameworks and show integration with Azure ML.

In this post, I’ll explain how you can create a basic neural network in PyTorch, using the Fashion MNIST dataset as a data source. The neural network we’ll build takes as input images of clothing, and classifies them according to their contents, such as “Shirt,” “Coat,” or “Dress.”

I’ll assume that you have a basic conceptual understanding of neural networks, and that you’re comfortable with Python, but I assume no knowledge of PyTorch.

Let’s start by getting familiar with the data we’ll be using, the Fashion MNIST dataset. This dataset contains 70,000 black-and-white images of articles of clothing — 60,000 meant to be used for training and 10,000 meant for testing. The images are square and contain 28 × 28 = 784 pixels, where each pixel is represented by a value between 0 and 255. Each of these images is associated with a label, which is an integer between 0 and 9 that classifies the article of clothing. The following dictionary helps us understand the clothing categories corresponding to these integer labels:

```
labels_map = {
0: 'T-Shirt',
1: 'Trouser',
2: 'Pullover',
3: 'Dress',
4: 'Coat',
5: 'Sandal',
6: 'Shirt',
7: 'Sneaker',
8: 'Bag',
9: 'Ankle Boot',
}
```

Here’s a random sampling of 9 images from the dataset, along with their labels:

A PyTorch tensor is the data structure used to store the inputs and outputs of a deep learning model, as well as any parameters that need to be learned during training. It’s a super important concept to understand if you’re going to be working with PyTorch.

Mathematically speaking, a tensor is just a generalization of vectors and matrices. A vector is a one-dimensional array of values, a matrix is a two-dimensional array of values, and a tensor is an array of values with any number of dimensions. A PyTorch `tensor`

, much like NumPy’s `ndarray`

, gives us a way to represent multidimensional data, but with added tricks, such as the ability to perform operations on a GPU and the ability to calculate derivatives.

Suppose we want to represent this 3 × 2 matrix in PyTorch:

Here’s the code to create the corresponding tensor:

```
X = torch.tensor([[1, 2], [3, 4], [5, 6]])
```

We can inspect the tensor’s `shape`

attribute to see how many dimensions it has and the size in each dimension. The `device`

attribute tells us whether the tensor is stored on the CPU or GPU, and the `dtype`

attribute indicates what kind of values it holds. We use the `type()`

method to check the type of the tensor itself.

```
print(X.shape)
print(X.device)
print(X.dtype)
print(X.type())
```

```
torch.Size([3, 2])
cpu
torch.int64
torch.LongTensor
```

If you consult this table in the PyTorch docs, you’ll see that this all makes sense: a tensor with a `dtype`

of `torch.int64`

on the CPU has a `type`

of `torch.LongTensor`

.

There are many ways to move this tensor to the GPU (assuming that you have a GPU and CUDA setup on your machine). One way is to change its device to `'cuda'`

:

```
device = 'cuda' if torch.cuda.is_available() else 'cpu'
X = X.to(device)
print(X.device)
print(X.type())
```

```
cuda:0
torch.cuda.LongTensor
```

If you’ve used NumPy ndarrays before, you might be happy to know that PyTorch tensors can be indexed in a familiar way. We can slice a tensor to view a smaller portion of it:

```
X = X[0:2, 0:1]
```

We get this:

We can also convert tensors to and from NumPy arrays, and have a NumPy ndarray and PyTorch tensor share the same underlying memory (as long as the tensor is on the CPU, just like the ndarray):

```
X = X[0:2, 0:1].cpu() # [1, 3] on the CPU
array = X.numpy() # [1, 3]
Y = torch.from_numpy(array) # [1, 3]
array[0, 0] = 2 # [2, 3]
print(X)
print(Y)
```

```
tensor([[2],
[3]])
tensor([[2],
[3]])
```

If your tensor contains a single value, the `item()`

method is a handy way to get that value as a scalar:

```
Z = torch.tensor([6])
scalar = Z.item()
print(scalar)
```

```
6
```

I mentioned earlier that tensors also help with calculating derivatives. I will explain how that works later in this post, in the section titled PyTorch autograd on a simple scenario.

PyTorch’s torchvision package gives us a super easy way to get the Fashion MNIST data, by simply instantiating the `datasets.FashionMNIST`

class. The `root`

parameter specifies the local path where we want the data to go; `train`

should be set to True to get the training set, and to False to get the test set; `download`

is set to True to ensure that the data is downloaded to the location specified in `root`

; and `transform`

contains any transformations we want to perform on the data.

In this case, we apply a `ToTensor()`

transform, which does two things:

- Converts each image into a PyTorch tensor, whose shape is [number of channels, height, width]. If we were working with color images, we’d have three channels (red, green, and blue). But because our images are black and white, the number of channels is one. Height and width are both 28 pixels in our scenario, so the shape of each of our images is [1, 28, 28].
- Converts the value of each pixel from integers between 0 and 255 into floating-point numbers between 0 and 1.

```
def get_data(batch_size: int) -> Tuple[DataLoader, DataLoader]:
training_data = datasets.FashionMNIST(
root='data',
train=True,
download=True,
transform=ToTensor(),
)
test_data = datasets.FashionMNIST(
root='data',
train=False,
download=True,
transform=ToTensor(),
)
train_dataloader = DataLoader(training_data, batch_size=batch_size, shuffle=True)
test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)
return (train_dataloader, test_dataloader)
```

The `datasets.FashionMNIST`

class derives from `Dataset`

, a base class provided by PyTorch for holding data. If we were using custom data, we would have to create our own class that derives from `Dataset`

and override two functions: `__len__(self)`

returns the length of the dataset, and `__getitem__(self, idx)`

returns the item corresponding to an index. In our scenario, torchvision makes life easy for us by giving us the data already contained in a `Dataset`

.

We then wrap the training and test datasets into instances of the `DataLoader`

class, which gives us an iterable over the `Dataset`

. If we were to write a `for`

loop over one of these `DataLoader`

instances, each iteration would retrieve the next batch of images and corresponding labels in the `Dataset`

, where the number of images is determined by the `batch_size`

parameter passed to the `DataLoader`

constructor. This functionality will be important later, when we train our model — we’ll come back to it.

Our goal is to classify an input image into one of the 10 classes of clothing, so we will define our neural network to take as input a tensor of shape [1, 28, 28] and output a vector of size 10, where the index of the largest value in the output corresponds to the integer label for the class of clothing in the image. For example, if we use an image of an ankle boot as input, we might get an output vector

In this particular example, the largest value appears at index 9 (counting from zero) — and as we showed in the Data section above, index 9 corresponds to the “Ankle Boot” category. So this indicates that our neural network correctly classified the image of an ankle boot.

Here’s a visualization of the structure of the neural network we chose for this scenario:

Because each image has 28 × 28 = 784 pixels, we need 784 nodes in the input layer (one for each pixel value). We decided to add two hidden layers with 512 nodes, each followed by a ReLU (rectified linear unit) activation function. We want the output of our network to be a vector of size 10, therefore our output layer needs to have 10 nodes.

In PyTorch, a neural network is defined as a class that derives from the `nn.Module`

base class. Here’s the code that represents our network design:

```
import torch
from torch import nn
class NeuralNetwork(nn.Module):
def __init__(self):
super(NeuralNetwork, self).__init__()
self.sequence = nn.Sequential(
nn.Flatten(),
nn.Linear(28*28, 512),
nn.ReLU(),
nn.Linear(512, 512),
nn.ReLU(),
nn.Linear(512, 10)
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
y_prime = self.sequence(x)
return y_prime
```

The `Flatten`

layer turns our input tensor of shape [1, 28, 28] into a vector of size 728. The `Linear`

layers are also known as “fully connected” or “dense” layers because they connect all nodes from the previous layer with each of their own nodes — notice how each `Linear`

constructor is passed the size of the previous layer and the size of the current layer. The `ReLU`

layers take the output of the previous layer and pass it through a “Rectified Linear Unit” activation function, which adds non-linearity to the computation. The `Sequential`

layer combines all the other layers. Lastly, we define the `forward`

method, which supplies a tensor `x`

as input to the `sequence`

of layers and produces the `y_prime`

vector as a result.

Now that we have a neural network model, we’re interested in training it — but part of the training process requires calculating derivatives that involve tensors. So let’s learn about PyTorch’s built-in automatic differentiation engine, `autograd`

, using a very simple example. Let’s consider the following two tensors:

Now let’s suppose that we want to multiply

Our goal is to calculate the derivative of `requires_grad`

flag to true during the construction of the tensors to tell `autograd`

to help us calculate those derivatives. Then, when we execute a function that takes our tensors as input, `autograd`

records any information necessary to compute the derivative of that function with respect to our tensors. Calling `backward()`

on the output tensor then kicks off the actual computations of those derivatives. Afterwards, we can access the derivatives by inspecting the `grad`

attribute of each input tensor.

```
# Decimal points in tensor values ensure they are floats, which autograd requires.
U = torch.tensor([[1., 2.]], requires_grad=True)
V = torch.tensor([[3., 4.], [5., 6.]], requires_grad=True)
W = torch.matmul(U, V)
f = W.sum()
f.backward()
print(U.grad)
print(V.grad)
```

```
tensor([[ 7., 11.]])
tensor([[1., 1.],
[2., 2.]])
```

Internally, a dynamic directed acyclic graph (DAG) of instances of type `Function`

is created to represent the operations we’re performing on our tensors. These `Function`

instances have a `forward`

method that, given one or more inputs, executes the computations needed to produce an output. And they have a `backward`

method that calculates the derivatives of the function with respect to each of its inputs. The diagram below illustrates the DAG that corresponds to our current example. Instances of type `Tensor`

flow left-to-right as the functions are being executed, and right-to-left as the derivatives are being calculated.

Let’s take a look at the math used to compute the derivatives. You only need to understand matrix multiplication and partial derivatives to follow along, but if the math isn’t as interesting to you, feel free to skip to the next section.

We’ll start by thinking of

Then the scalar function

We can now calculate the derivatives of

As you can see, when we plug in the numerical values of `autograd`

gave for `U.grad`

and `V.grad`

.

We saw how automatic differentiation works for a simple calculation. Let’s now explore how it works for a neural network. We’ll consider a very small neural network, consisting of an input layer and an output layer, with no hidden layers:

When training a neural network, our goal is to find parameter values that will enable the network to produce predicted labels as similar as possible to the actual labels provided in the training data. In our current very small network, the parameters include two weights

Let’s now analyze the calculations that happen when we give it some input data. We can represent the input data and weights as vectors:

The calculations in the linear layer give us the predicted value

We’ll use the `MSELoss`

function to calculate the loss as the mean squared error:

The DAG for this simple neural network should now be easy to understand:

In the scenarios shown here, our graphs are really just simple trees. But as you can imagine, when building larger neural networks, the complexity of these graphs increases.

Technically, you could write the code for this simple neural network by spelling out all the individual operations, as in the code sample below. However, you typically wouldn’t — you would instead create a neural network similar to the one I showed earlier, but with a single `Linear`

layer. Creating a neural network with a `Linear`

layer is a simpler solution with a higher level of abstraction, and therefore scales better to more complex networks. But for now, let’s write out all the steps:

```
W = torch.tensor([[1., 2.]], requires_grad=True)
X = torch.tensor([[3.], [4.]])
b = torch.tensor([5.], requires_grad=True)
y = torch.tensor([[6.]])
y_prime = torch.matmul(W, X) + b
loss_fn = torch.nn.MSELoss()
loss = loss_fn(y_prime, y)
loss.backward()
print(W.grad)
print(b.grad)
```

```
tensor([[60., 80.]])
tensor([20.])
```

Setting `requires_grad`

to `True`

for a particular tensor should only be done when we need to calculate the gradient with respect to that tensor, because it adds a bit of overhead to the forward pass. Notice that in the code above, I only add `requires_grad`

to the

Also, there are some scenarios where we want to do a forward pass in the neural network without calculating any of the gradients. For example, when we want to test the network, or make a prediction, or fine tune an already-trained network. In those scenarios, we can wrap the operations with a `torch.no_grad()`

, which tells `autograd`

to skip any gradient-related setup in the forward pass.

```
U = torch.tensor([[1., 2.]], requires_grad=True)
V = torch.tensor([[3., 4.], [5., 6.]], requires_grad=True)
W = torch.matmul(U, V)
print(W.requires_grad)
with torch.no_grad():
W = torch.matmul(U, V)
print(W.requires_grad)
```

```
True
False
```

One interesting property of PyTorch DAGs is that they are dynamic. This means that after each forward and backward pass, we’re free to change the structure of the graph and the shape of the tensors that flow through it, because the graph will be re-created in the next pass. This is a very powerful feature that allows tremendous flexiblity when training a model.

In order to understand what happens during training, we need to add a little bit more detail to our neural network visualization.

Notice that we’ve added weigths `Linear`

layers. Our goal when training our network (also known as fitting) is to find the parameters `CrossEntropyLoss`

, which is provided to us by PyTorch.

Mathematically speaking, we can now think of our neural network as a function

Our goal is to find the parameters

To implement the gradient descent algorithm, we iteratively improve our estimates of

The parameter `autograd`

, which we learned in the previous section.

When we put all of these ideas together, we get the backpropagation algorithm. This algorithm consists of four steps:

- a forward pass through the model to compute the predicted value,
`y_prime = model(X)`

- a calculation of the loss using a loss function,
`loss = loss_fn(y_prime, y)`

- a backward pass from the loss function through the model to calculate derivatives,
`loss.backward()`

- a gradient descent step to update
and using the derivatives calculated in the backward pass,`optimizer.step()`

Here’s the complete code:

```
def fit_one_batch(X: torch.Tensor, y: torch.Tensor, model: NeuralNetwork,
loss_fn: CrossEntropyLoss, optimizer: Optimizer) -> Tuple[torch.Tensor, torch.Tensor]:
y_prime = model(X)
loss = loss_fn(y_prime, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return (y_prime, loss)
```

The reason we call `zero_grad()`

before the backward pass is a bit of a technicality. During the backward pass, gradients are accumulated by adding them to the `grad`

attribute of a tensor, so we need to explicitly ensure that the `grad`

attribute of each parameter is reset to zero before this pass.

You might be wondering how many images we pass as our input `DataLoader`

instances, we passed a `batch_size`

to the constructor — this is the size we chose for our mini-batches. A `DataLoader`

gives us an iterator that returns a mini-batch of data on each iteration. Therefore, we just need to iterate over a `DataLoader`

, and pass each mini-batch to the backpropagation algorithm to advance one more step in the gradient descent algorithm. Each time we do that, we’re one step closer to discovering weights

```
def fit(device: str, dataloader: DataLoader, model: nn.Module,
loss_fn: CrossEntropyLoss, optimizer: Optimizer) -> None:
...
for batch_index, (X, y) in enumerate(dataloader):
...
fit_one_batch(X, y, model, loss_fn, optimizer)
...
```

I’ve omitted all but the most important lines of code here. The code I left out calculates and prints the loss and accuracy of each mini-batch, so we can follow the training progress of our network.

An “epoch” of training refers to a complete iteration over all mini-batches in the dataset. Neural networks typically require many epochs of training to achieve good predictions. In this sample, we restrict the code to just two epochs, but in a real project you would want to set it to a much higher number.

```
def training_phase(device: str):
learning_rate = 0.1
batch_size = 64
epochs = 2
(train_dataloader, test_dataloader) = get_data(batch_size)
model = NeuralNetwork().to(device)
loss_fn = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
print('\nFitting:')
for epoch in range(epochs):
print(f'\nEpoch {epoch + 1}\n-------------------------------')
fit(device, train_dataloader, model, loss_fn, optimizer)
...
```

Notice that we also specify an optimizer, which is how we choose the optimization algorithm we want to use. For this example, we use a variant of the gradient descent algorithm, SGD, which stands for Stochastic Gradient Descent. You can see that we pass `model.parameters()`

to the optimizer’s constructor — this tells the optimizer which tensors it should modify when taking an optimization step.

There are many other types of optimizers you can choose, as you can see in the PyTorch docs. Understanding the intricacies of each one is a fascinating topic that I’ll leave for a future post.

Running the previous code produces the output below, showing that the accuracy tends to increase and the loss tends to decrease as we iterate over mini-batches and epochs.

```
Fitting:
Epoch 1
-------------------------------
[Batch 100 - 6400 items] accuracy: 49.1%, loss: 1.442805
[Batch 200 - 12800 items] accuracy: 54.8%, loss: 0.948315
[Batch 300 - 19200 items] accuracy: 57.3%, loss: 0.999786
[Batch 400 - 25600 items] accuracy: 59.5%, loss: 0.961919
[Batch 500 - 32000 items] accuracy: 61.0%, loss: 1.109915
[Batch 600 - 38400 items] accuracy: 63.3%, loss: 0.659384
[Batch 700 - 44800 items] accuracy: 65.7%, loss: 0.575162
[Batch 800 - 51200 items] accuracy: 67.6%, loss: 0.461994
[Batch 900 - 57600 items] accuracy: 69.2%, loss: 0.523541
[Batch 938 - 60000 items] accuracy: 69.8%, loss: 0.543552
Epoch 2
-------------------------------
[Batch 100 - 6400 items] accuracy: 83.3%, loss: 0.390897
[Batch 200 - 12800 items] accuracy: 83.2%, loss: 0.456197
[Batch 300 - 19200 items] accuracy: 83.6%, loss: 0.406075
[Batch 400 - 25600 items] accuracy: 83.8%, loss: 0.735562
[Batch 500 - 32000 items] accuracy: 83.8%, loss: 0.363857
[Batch 600 - 38400 items] accuracy: 83.9%, loss: 0.221844
[Batch 700 - 44800 items] accuracy: 84.0%, loss: 0.639660
[Batch 800 - 51200 items] accuracy: 84.0%, loss: 0.289021
[Batch 900 - 57600 items] accuracy: 84.1%, loss: 0.468118
[Batch 938 - 60000 items] accuracy: 84.2%, loss: 0.122774
```

After we’ve trained the network and have found parameters

When evaluating a network, we just want to traverse the network forward and calculate the loss. We’re not learning any parameters, so we don’t need the backward step. Therefore, as we saw earlier, we can improve efficiency in the forward step by wrapping it in a `torch.no_grad()`

block. The evaluation code for a single batch looks like this:

```
def evaluate_one_batch(X: torch.tensor, y: torch.tensor, model: NeuralNetwork,
loss_fn: CrossEntropyLoss) -> Tuple[torch.Tensor, torch.Tensor]:
with torch.no_grad():
y_prime = model(X)
loss = loss_fn(y_prime, y)
return (y_prime, loss)
```

The code to perform this evaluation for all batches in the dataset is similar to the corresponding code in the training section, except for the added call to `model.eval()`

. This call affects the behavior of some layers (such as `Dropout`

and `BatchNorm`

layers), which need to work differently depending on whether we’re training or evaluating the model.

```
def evaluate(device: str, dataloader: DataLoader, model: nn.Module,
loss_fn: CrossEntropyLoss) -> Tuple[float, float]:
...
model.eval()
with torch.no_grad():
for (X, y) in dataloader:
...
(y_prime, loss) = evaluate_one_batch(X, y, model, loss_fn)
...
```

In this project, we do the evaluation for all the training data just once, and get the test loss and accuracy. Ideally these would be pretty similar to the training values printed earlier, and if not, we might need to adjust our model or data.

```
print('\nEvaluating:')
(test_loss, test_accuracy) = evaluate(device, test_dataloader, model, loss_fn)
print(f'Test accuracy: {test_accuracy * 100:>0.1f}%, test loss: {test_loss:>8f}')
```

When you run the code above, you’ll see output similar to the following:

```
Evaluating:
Test accuracy: 85.1%, test loss: 0.415654
```

Assuming we achieved pretty good accuracy during the training and testing phases, we can now use the trained model for inference — in other words, to predict the classification of images that the network has never seen before. Just like the evaluation code, we wrap the prediction code in a `torch.no_grad()`

block because we don’t need to calculate derivatives. Unlike the fitting and evaluation code though, this time we don’t need to calculate the loss. In fact, we can’t calculate the loss when classifying images whose actual label we don’t know.

By calling the model with input data, we get a vector `argmax`

of the vector to get our prediction. In this sample, however, we use a `softmax`

function first, which converts `probabilities`

tensor below might tell us that our input image has 30% probability of being a dress, 25% probability of being a coat, and so on.

```
def predict(model: nn.Module, X: Tensor) -> torch.Tensor:
with torch.no_grad():
y_prime = model(X)
probabilities = nn.functional.softmax(y_prime, dim=1)
predicted_indices = probabilities.argmax(1)
return predicted_indices
```

The true power of a neural network is its ability to make predictions for data that has not been seen during training or testing. But for the sake of simplicity, we’ll demonstrate how prediction works by re-using a few of the images that we used during testing. The code below retrieves the first three images in our test data and uses them to make predictions. We then print the actual and predicted labels.

```
def inference_phase(device: str):
batch_size = 64
...
model.eval()
(_, test_dataloader) = get_data(batch_size)
(X_batch, actual_index_batch) = next(iter(test_dataloader))
X = X_batch[0:3, :, :, :]
X = X.to(device)
actual_indices = actual_index_batch[0:3]
predicted_indices = predict(model, X)
print('\nPredicting:')
for (actual_index, predicted_index) in zip(actual_indices, predicted_indices):
actual_name = labels_map[actual_index.item()]
predicted_name = labels_map[predicted_index.item()]
print(f'Actual: {actual_name}, Predicted: {predicted_name}')
```

With only two epochs of training, our network isn’t very accurate, but it gets two out of three predictions right:

```
Predicting:
Actual: Pullover, Predicted: Pullover
Actual: Shirt, Predicted: Pullover
Actual: Sneaker, Predicted: Sneaker
```

For the sake of simplicity, the code sample for this post includes the training, testing, and prediction phases in one program. In practice, though, training and testing are performed together, while prediction is often done in a separate program, executed at a different time or on a different machine. To help with this scenario, PyTorch offers a variety of ways to save and load a trained neural network model.

In this project, at the end of the fitting and evaluation phases, we save the optimized values of `torch.save()`

and `model.state_dict()`

.

```
torch.save(model.state_dict(), 'outputs/weights.pth')
```

Then, once we’re ready to do inference, we create a new model using the `NeuralNetwork`

constructor and populate its parameters by calling `model.load_state_dict()`

and `torch.load()`

.

```
model = NeuralNetwork().to(device)
model.load_state_dict(torch.load('outputs/weights.pth'))
```

For more information and alternative techniques, see the PyTorch tutorial on saving and loading models.

In this blog post, you learned how to use PyTorch to load data; create, train, and test a neural network; and make a prediction. You didn’t just cover these topics on the surface — you went deeper and learned about the details of PyTorch’s automatic differentiation engine, gradient descent, and the backpropagation algorithm. Congratulations on this achievement! You’re now ready to apply what you’ve learned to your own data.

The complete code for this post can be found on GitHub.