Comparing PyTorch and TensorFlow implementations

Created:
Updated:
Topic: Deep learning

Introduction

How do PyTorch code and TensorFlow code compare? Maybe you’re in the beginning phases of your machine learning journey and deciding which framework to embrace, or maybe you’re an experienced ML practicioner considering a change of framework. Either way, you’re in the right place.

If you’re interested in a high-level comparison between the frameworks, considering the popularity and capabilities of each, this article is a good resource. In this post, we’re going to dig a bit deeper and look at actual code. In my PyTorch, Keras, and TensorFlow posts, I explain how you can classify images from the Fashion MNIST dataset, while introducing key concepts of each of these machine learning frameworks. In this post, I’ll focus on comparing the code written in each of these frameworks.

I assume that you’re familiar with machine learning concepts, and that you’ve used at least one of the frameworks before. If you’ve read one of my the three introductory posts before (PyTorch, Keras, or TensorFlow), you’ll be well prepared to understand this post.

All the code shown in this post can be found on GitHub.

Getting the data

The Fashion MNIST dataset is a collection of 70,000 black-and-white images of articles of clothing, along with corresponding labels. The labels are represented by an integer from 0 to 9, with the following meaning:

labels_map = {
    0: 'T-Shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',
}

The PyTorch and TensorFlow versions of the code that loads the Fashion MNIST dataset appear very different, but their behavior is in fact quite similar.

PyTorch
def _get_data(batch_size: int) -> Tuple[DataLoader, DataLoader]:
    """Downloads Fashion MNIST data, and returns two DataLoader objects
    wrapping test and training data."""
    training_data = datasets.FashionMNIST(
        root=DATA_PATH,
        train=True,
        download=True,
        transform=ToTensor(),
    )

    test_data = datasets.FashionMNIST(
        root=DATA_PATH,
        train=False,
        download=True,
        transform=ToTensor(),
    )

    train_dataloader = DataLoader(training_data,
                                  batch_size=batch_size,
                                  shuffle=True)
    test_dataloader = DataLoader(test_data, batch_size=batch_size, shuffle=True)

    return (train_dataloader, test_dataloader)
TensorFlow
def _get_data(batch_size: int) -> Tuple[tf.data.Dataset, tf.data.Dataset]:
    """Downloads Fashion MNIST data, and returns two Dataset objects
    wrapping test and training data."""
    (training_images, training_labels), (
        test_images, test_labels) = tf.keras.datasets.fashion_mnist.load_data()

    train_dataset = tf.data.Dataset.from_tensor_slices(
        (training_images, training_labels))
    test_dataset = tf.data.Dataset.from_tensor_slices(
        (test_images, test_labels))

    train_dataset = train_dataset.map(lambda image, label:
                                      (float(image) / 255.0, label))
    test_dataset = test_dataset.map(lambda image, label:
                                    (float(image) / 255.0, label))

    train_dataset = train_dataset.batch(batch_size).shuffle(500)
    test_dataset = test_dataset.batch(batch_size).shuffle(500)

    return (train_dataset, test_dataset)

PyTorch and TensorFlow both make popular datasets easily available to their users, including the Fashion MNIST dataset. PyTorch exposes it through the torchvision.datasets.FashionMNIST class, and TensorFlow through the tensorflow.keras.datasets.fashion_mnist class.

In both PyTorch and TensorFlow, we apply the same transformation to the image pixels, converting from integer values between 0 and 255 to floating-point values between 0 and 1. In PyTorch we use the built-in ToTensor data transformation, while in TensorFlow we write the transformation code explicitly. Also, we shuffle the data in both frameworks — in PyTorch we specify our shuffle preferences to the DataLoader, and in TensorFlow we call the shuffle function on the Dataset.

In both cases, we obtain an instance of an Iterable which enables us to iterate over batches of data of a specified size. In PyTorch, we obtain a DataLoader class, while in TensorFlow we obtain a Dataset class. For most practical purposes they work the same way — if we iterate over them using a for loop, we get a batch of size batch_size (specified as a parameter) on each iteration, until we’ve gone through the full dataset. We’ll see code for this later, in the training section.

Creating the model

For model creation, we’ll look at PyTorch, the higher-level TensorFlow Keras API, and lower-level TensorFlow.

The differences between PyTorch and Keras are mainly cosmetic. PyTorch’s Linear layers are known as Dense layers in Keras. PyTorch Linear layers require both the number of inputs and the number of outputs, while Keras Dense layers only need the number of outputs (they infer the number of inputs from the outputs of previous layers). And PyTorch requires a Linear layer and its associated activation function to be specified as separate layers, while Keras’ Dense layer permits you to specify an activation function by name as a convenience.

PyTorch
class NeuralNetwork(nn.Module):
    """Neural network that classifies Fashion MNIST-style images."""

    def __init__(self):
        super().__init__()
        self.sequence = nn.Sequential(nn.Flatten(), nn.Linear(28 * 28, 20),
                                      nn.ReLU(), nn.Linear(20, 10))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        y_prime = self.sequence(x)
        return y_prime
Keras
class NeuralNetwork(tf.keras.Model):
    """Neural network that classifies Fashion MNIST-style images."""

    def __init__(self):
        super().__init__()
        self.sequence = tf.keras.Sequential([
            tf.keras.layers.Flatten(input_shape=(28, 28)),
            tf.keras.layers.Dense(20, activation='relu'),
            tf.keras.layers.Dense(10)
        ])

    def call(self, x: tf.Tensor) -> tf.Tensor:
        y_prime = self.sequence(x)
        return y_prime
TensorFlow
class NeuralNetwork(tf.keras.Model):
    """Neural network that classifies Fashion MNIST-style images."""

    def __init__(self):
        super().__init__()
        initializer = tf.keras.initializers.GlorotUniform()
        self.w1 = tf.Variable(initializer(shape=(784, 20)))
        self.b1 = tf.Variable(tf.zeros(shape=(20,)))
        self.w2 = tf.Variable(initializer(shape=(20, 10)))
        self.b2 = tf.Variable(tf.zeros(shape=(10,)))

    def call(self, x: tf.Tensor) -> tf.Tensor:
        x = tf.reshape(x, [-1, 784])
        x = tf.matmul(x, self.w1) + self.b1
        x = tf.nn.relu(x)
        x = tf.matmul(x, self.w2) + self.b2
        return x

Defining the model using lower-level TensorFlow components looks a bit different from the two other implementations because we’re being explicit about the calculations that happen under the hood. We need to define our own and parameters as tf.Variables, and we need to manually perform the additions, multiplications and activation function calls that are encapsulated in the PyTorch and Keras layers. Still, for an example as simple as this, the code isn’t that complicated.

In both PyTorch and TensorFlow, you can create your own layers with custom behavior, if the provided ones don’t work for your purposes. And in both frameworks, you can create a model by writing your own class, or you can just use the framework’s built-in Sequential class directly as your model. I show how to use your own class above, which permits breakpoints within the forward or call method.

Training and testing the neural network

The code to train a neural network is a bit different in these three frameworks. Let’s start by comparing the code to train the network on a single batch of data.

PyTorch
def _fit_one_batch(x: torch.Tensor, y: torch.Tensor, model: NeuralNetwork,
                   loss_fn: CrossEntropyLoss,
                   optimizer: Optimizer) -> Tuple[torch.Tensor, torch.Tensor]:
    """Trains a single minibatch (backpropagation algorithm)."""
    y_prime = model(x)
    loss = loss_fn(y_prime, y)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    return (y_prime, loss)
TensorFlow
@tf.function
def _fit_one_batch(
        x: tf.Tensor, y: tf.Tensor, model: tf.keras.Model,
        loss_fn: tf.keras.losses.Loss, optimizer: tf.keras.optimizers.Optimizer
) -> Tuple[tf.Tensor, tf.Tensor]:
    """Trains a single minibatch (backpropagation algorithm)."""
    with tf.GradientTape() as tape:
        y_prime = model(x, training=True)
        loss = loss_fn(y, y_prime)

    grads = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(grads, model.trainable_variables))

    return (y_prime, loss)

In both PyTorch and TensorFlow, we need the same four steps required to execute the backpropagation algorithm: we do one forward pass by executing the model, we call the loss function, we do one backward pass to calculate the gradients (in loss.backward and tape.gradient), and we take a gradient descent step by applying the calculated gradients (in optimizer.step and optimizer.apply_gradients). The PyTorch call to optimizer.zero_grad is just a technicality — gradients get accumulated when calculated, and since PyTorch doesn’t clear them at the end of a backward pass, we need to do that manually.

Notice that in TensorFlow, the forward pass needs to be executed within the tf.GradientTape context. That’s because, unlike PyTorch, TensorFlow does not record forward operations automatically for playback during differentiation. Therefore we need to explicitly tell TensorFlow to hold on to all the data it will later need to calculate derivatives.

Notice also the @tf.function decorator in the TensorFlow code. This tells TensorFlow to execute the code in “graph execution” mode, instead of the “eager execution” mode that is used by default. In graph execution mode, we cannot set breakpoints as usual, therefore we typically develop the code in eager execution mode and add the decorator only when the code is complete. However, code that has been compiled into a static graph is much more performant and can be executed in environments without Python, which enables the deployment to production scenarios such as embedded devices.

It’s worth mentioning that PyTorch can also take advantage of the benefits of a static graph with the help of TorchScript, a newer library provided by Facebook. We won’t go into the details of TorchScript in this post.

Now let’s look at the code that trains the network on the entire dataset by repeatedly calling the _fit_one_batch(...) function.

PyTorch
def _fit(device: str, dataloader: DataLoader, model: nn.Module,
         loss_fn: CrossEntropyLoss,
         optimizer: Optimizer) -> Tuple[float, float]:
    """Trains the given model for a single epoch."""
    loss_sum = 0
    correct_item_count = 0
    item_count = 0

    # Used for printing only.
    batch_count = len(dataloader)
    print_every = 100

    model.to(device)
    model.train()

    for batch_index, (x, y) in enumerate(dataloader):
        x = x.float().to(device)
        y = y.long().to(device)

        (y_prime, loss) = _fit_one_batch(x, y, model, loss_fn, optimizer)

        correct_item_count += (y_prime.argmax(1) == y).sum().item()
        loss_sum += loss.item()
        item_count += len(x)

        # Printing progress.
        if ((batch_index + 1) % print_every == 0) or ((batch_index + 1)
                                                      == batch_count):
            accuracy = correct_item_count / item_count
            average_loss = loss_sum / item_count
            print(f'[Batch {batch_index + 1:>3d} - {item_count:>5d} items] ' +
                  f'loss: {average_loss:>7f}, ' +
                  f'accuracy: {accuracy*100:>0.1f}%')

    average_loss = loss_sum / item_count
    accuracy = correct_item_count / item_count

    return (average_loss, accuracy)
TensorFlow
def _fit(dataset: tf.data.Dataset, model: tf.keras.Model,
         loss_fn: tf.keras.losses.Loss,
         optimizer: tf.optimizers.Optimizer) -> Tuple[float, float]:
    """Trains the given model for a single epoch."""
    loss_sum = 0
    correct_item_count = 0
    item_count = 0

    # Used for printing only.
    batch_count = len(dataset)
    print_every = 100

    for batch_index, (x, y) in enumerate(dataset):
        x = tf.cast(x, tf.float64)
        y = tf.cast(y, tf.int64)

        (y_prime, loss) = _fit_one_batch(x, y, model, loss_fn, optimizer)

        correct_item_count += (tf.math.argmax(y_prime,
                                              axis=1) == y).numpy().sum()
        loss_sum += loss.numpy()
        item_count += len(x)

        # Printing progress.
        if ((batch_index + 1) % print_every == 0) or ((batch_index + 1)
                                                      == batch_count):
            accuracy = correct_item_count / item_count
            average_loss = loss_sum / item_count
            print(f'[Batch {batch_index + 1:>3d} - {item_count:>5d} items] ' +
                  f'loss: {average_loss:>7f}, ' +
                  f'accuracy: {accuracy*100:>0.1f}%')

    average_loss = loss_sum / item_count
    accuracy = correct_item_count / item_count

    return (average_loss, accuracy)

The _fit(...) functions above simply iterate through the iterators we created earlier (the PyTorch DataLoader or the TensorFlow Dataset), which give us a batch of data in each iteration. We then pass the batch to the _fit_one_batch(...) function that we saw earlier. The rest of the code just keeps track of certain metrics and prints them as the training progresses.

Next, let’s take a look at the code that evaluates our model’s performance on a single batch.

PyTorch
def _evaluate_one_batch(
        x: torch.tensor, y: torch.tensor, model: NeuralNetwork,
        loss_fn: CrossEntropyLoss) -> Tuple[torch.Tensor, torch.Tensor]:
    """Evaluates a single minibatch."""
    with torch.no_grad():
        y_prime = model(x)
        loss = loss_fn(y_prime, y)

    return (y_prime, loss)
TensorFlow
@tf.function
def _evaluate_one_batch(
        x: tf.Tensor, y: tf.Tensor, model: tf.keras.Model,
        loss_fn: tf.keras.losses.Loss) -> Tuple[tf.Tensor, tf.Tensor]:
    """Evaluates a single minibatch."""
    y_prime = model(x, training=False)
    loss = loss_fn(y, y_prime)

    return (y_prime, loss)

While evaluating a batch, we only need to do a forward pass through the network to obtain a prediction, and call the loss function to evaluate it — we don’t need to calculate the derivatives in a backward pass. Since PyTorch calculates derivatives by default, this time we need to wrap the forward pass in a torch.no_grad() context to avoid the cost of unneeded calculations. There’s no need to change the TensorFlow code, since it assumes that we don’t need derivatives by default.

Similarly to the _fit(...) function, in both PyTorch and TensorFlow, the _evaluate(...) function iterates through each batch of data and calls the _evaluate_one_batch(...) function. Most of the code below is used for printing progress during execution.

PyTorch
def _evaluate(device: str, dataloader: DataLoader, model: nn.Module,
              loss_fn: CrossEntropyLoss) -> Tuple[float, float]:
    """Evaluates the given model for the whole dataset once."""
    loss_sum = 0
    correct_item_count = 0
    item_count = 0

    model.to(device)
    model.eval()

    with torch.no_grad():
        for (x, y) in dataloader:
            x = x.float().to(device)
            y = y.long().to(device)

            (y_prime, loss) = _evaluate_one_batch(x, y, model, loss_fn)

            correct_item_count += (y_prime.argmax(1) == y).sum().item()
            loss_sum += loss.item()
            item_count += len(x)

        average_loss = loss_sum / item_count
        accuracy = correct_item_count / item_count

    return (average_loss, accuracy)
TensorFlow
def _evaluate(dataset: tf.data.Dataset, model: tf.keras.Model,
              loss_fn: tf.keras.losses.Loss) -> Tuple[float, float]:
    """Evaluates the given model for the whole dataset once."""
    loss_sum = 0
    correct_item_count = 0
    item_count = 0

    for (x, y) in dataset:
        x = tf.cast(x, tf.float64)
        y = tf.cast(y, tf.int64)

        (y_prime, loss) = _evaluate_one_batch(x, y, model, loss_fn)

        correct_item_count += (tf.math.argmax(
            y_prime, axis=1).numpy() == y.numpy()).sum()
        loss_sum += loss.numpy()
        item_count += len(x)

    average_loss = loss_sum / item_count
    accuracy = correct_item_count / item_count
    return (average_loss, accuracy)

Now that we have code that trains and evaluates our model using the entire dataset, let’s call it. We want to feed the full dataset into the neural network multiple times (or “epochs”) for training, and just once for evaluation. The code below shows how to do that using PyTorch, TensorFlow, and Keras. While the PyTorch and TensorFlow versions call the _fit(...) and _evaluate(...) functions presented earlier, the Keras version takes advantage of built-in methods model.fit and model.evaluate. If you plan on using the usual training and evaluation loops, Keras will save you from writing a bunch of code, since you won’t need to provide _fit(...), _fit_one_batch(...), _evaluate(...), and _evaluate_one_batch(...) functions.

PyTorch
def training_phase(device: str):
    """Trains the model for a number of epochs, and saves it."""
    learning_rate = 0.1
    batch_size = 64
    epochs = 5

    (train_dataloader, test_dataloader) = _get_data(batch_size)

    model = NeuralNetwork()

    loss_fn = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)

    print('\n***Training***')
    for epoch in range(epochs):
        print(f'\nEpoch {epoch + 1}\n-------------------------------')
        (train_loss, train_accuracy) = _fit(device, train_dataloader, model,
                                            loss_fn, optimizer)
        print(f'Train loss: {train_loss:>8f}, ' +
              f'train accuracy: {train_accuracy * 100:>0.1f}%')

    print('\n***Evaluating***')
    (test_loss, test_accuracy) = _evaluate(device, test_dataloader, model,
                                           loss_fn)
    print(f'Test loss: {test_loss:>8f}, ' +
          f'test accuracy: {test_accuracy * 100:>0.1f}%')

    torch.save(model.state_dict(), WEIGHTS_PATH)
TensorFlow
def training_phase():
    """Trains the model for a number of epochs, and saves it."""
    learning_rate = 0.1
    batch_size = 64
    epochs = 5

    (train_dataset, test_dataset) = _get_data(batch_size)

    model = NeuralNetwork()

    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    optimizer = tf.optimizers.SGD(learning_rate)

    print('\n***Training***')
    t_begin = time.time()

    for epoch in range(epochs):
        print(f'\nEpoch {epoch + 1}\n-------------------------------')
        (train_loss, train_accuracy) = _fit(train_dataset, model, loss_fn,
                                            optimizer)
        print(f'Train loss: {train_loss:>8f}, ' +
              f'train accuracy: {train_accuracy * 100:>0.1f}%')

    t_elapsed = time.time() - t_begin
    print(f'\nTime per epoch: {t_elapsed / epochs :>.3f} sec')

    print('\n***Evaluating***')
    (test_loss, test_accuracy) = _evaluate(test_dataset, model, loss_fn)
    print(f'Test loss: {test_loss:>8f}, ' +
          f'test accuracy: {test_accuracy * 100:>0.1f}%')

    model.save_weights(WEIGHTS_PATH)
Keras
def training_phase():
    """Trains the model for a number of epochs, and saves it."""
    learning_rate = 0.1
    batch_size = 64
    epochs = 5

    (train_dataset, test_dataset) = _get_data(batch_size)

    model = NeuralNetwork()

    loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
    optimizer = tf.keras.optimizers.SGD(learning_rate)
    metrics = ['accuracy']
    model.compile(optimizer, loss_fn, metrics)

    print('\n***Training***')
    model.fit(train_dataset, epochs=epochs)

    print('\n***Evaluating***')
    (test_loss, test_accuracy) = model.evaluate(test_dataset)
    print(f'Test loss: {test_loss:>8f}, ' +
          f'test accuracy: {test_accuracy * 100:>0.1f}%')

    model.save(WEIGHTS_PATH)

Making a prediction

As you can see below, the code to make a prediction is very similar in PyTorch and TensorFlow. The main difference is the additional torch.no_grad() call in the PyTorch code, which we’ve already covered. Also, just like before, we annotate the TensorFlow function with @tf.function when we’re done debugging, to get the benefits of graph execution.

PyTorch
def _predict(model: nn.Module, x: torch.Tensor, device: str) -> np.ndarray:
    """Makes a prediction for input x."""
    model.to(device)
    model.eval()

    x = torch.from_numpy(x).float().to(device)

    with torch.no_grad():
        y_prime = model(x)
        probabilities = nn.functional.softmax(y_prime, dim=1)
        predicted_indices = probabilities.argmax(1)
    return predicted_indices.cpu().numpy()
TensorFlow
@tf.function
def _predict(model: tf.keras.Model, x: np.ndarray) -> tf.Tensor:
    """Makes a prediction for input x."""
    y_prime = model(x, training=False)
    probabilities = tf.nn.softmax(y_prime, axis=1)
    predicted_indices = tf.math.argmax(input=probabilities, axis=1)
    return predicted_indices

In addition, you’ll notice that the PyTorch code contains a model.eval() line of code, and that the TensorFlow code passes training=False to the model. These both tell the corresponding models to execute in inference mode, which changes the behavior of some layers (for example, dropout and batch normalization layers).

The PyTorch and TensorFlow versions of our inference_phase function can now obtain a predicted label by passing an image of an ankle boot to the appropriate version of the _predict(...) function above.

PyTorch
def inference_phase(device: str):
    """Makes a prediction for a local image."""
    print('\n***Predicting***')

    model = NeuralNetwork()
    model.load_state_dict(torch.load(WEIGHTS_PATH))

    with Image.open(IMAGE_PATH) as image:
        x = np.asarray(image).reshape((-1, 28, 28)) / 255.0

    predicted_index = _predict(model, x, device)[0]
    predicted_class = labels_map[predicted_index]

    print(f'Predicted class: {predicted_class}')
TensorFlow
def inference_phase():
    """Makes a prediction for a local image."""
    print('\n***Predicting***')

    model = NeuralNetwork()
    model.load_weights(WEIGHTS_PATH)

    with Image.open(IMAGE_PATH) as image:
        x = np.asarray(image).reshape((-1, 28, 28)) / 255.0

    predicted_index = _predict(model, x).numpy()[0]
    predicted_name = labels_map[predicted_index]

    print(f'Predicted class: {predicted_name}')
Keras
def inference_phase():
    """Makes a prediction for a local image."""
    print('\n***Predicting***')

    model = tf.keras.models.load_model(WEIGHTS_PATH)

    with Image.open(IMAGE_PATH) as image:
        x = np.asarray(image).reshape((-1, 28, 28)) / 255.0

    predicted_index = np.argmax(model.predict(x))
    predicted_name = labels_map[predicted_index]

    print(f'Predicted class: {predicted_name}')

In contrast with TensorFlow and PyTorch, the Keras version calls the built-in model.predict method instead of a custom-written _predict(...) function. Once again, if our scenario can be accomplished using the standard prediction routine, Keras requires less code.

Conclusion

In this post, you saw how to accomplish the same training scenario in PyTorch and TensorFlow. The PyTorch, Keras and TensorFlow code for this project can be found on GitHub.

Historically, PyTorch and TensorFlow have had very distinct strengths, with PyTorch being much easier to debug, and TensorFlow having much better production deployment capabilities. For this reason, PyTorch is more popular in academic settings, and TensorFlow takes the lead in the industry. But in recent years the functionality gap has been reduced, with TensorFlow adding a debug-friendly eager execution mode and PyTorch improving its deployment offerings with TorchScript. I’m curious to see if and how the balance will shift in light of these new developments, and what the future will bring to both frameworks.