Bea Stollnitz - How to structure your machine learning projects using GitHub and VS Code

Introduction

Having a well-thought-out process to structure your machine learning projects enables you to create new GitHub repositories quickly, and encourages you to embrace elegant software architecture from the very beginning. In this post I’ll show you how I organize the files in my machine learning projects, and I’ll explain the reasoning behind each decision. Some of the information included here is specific to VS Code, but even if you prefer a different editor, you’ll still benefit from most of the content of this article.

The template I use to create new machine learning projects can be found on GitHub.

TLDR

If you want to start a new machine learning project from my GitHub template, navigate to the repo on GitHub, and click on the “Use this template” button. GitHub template repositories are super handy — they allow me and others to generate new repositories containing the same structure, branches, and files as the template!

Screenshot of the "Use this template" button.

The next page will guide you through the settings you need to choose for your project (such as a repository name and privacy settings):

Screenshot of the screen where you choose your project settings.

Once your repo is created, click on “Actions” in the top menu, and wait for the GitHub action to complete:

Screenshot of the completed GitHub Action.

When you see a green checkmark, your project is ready for you to start coding!

Keep reading if you want to understand the reasoning behind each file added to your project, and how the GitHub template was created.

Basic files

Let’s start by covering the most basic and essential files you’ll find in the project you created from the template:

.gitignore

The .gitignore file tells GitHub which files it should ignore when committing the project to the GitHub repo. If you were creating a new repo without using a template, you’d have a chance to select a pre-configured .gitignore file. The file I added to the template is simply the file generated after I selected the pre-configured “Python” file at repo creation time:

Screenshot of the gitignore setting.

This is a long file, and I show just the beginning below:

https://github.com/bstollnitz/ml-template/blob/main/.gitignore

 lines (105 sloc)  1.76 KB

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
...

LICENSE

The LICENSE file tells others which legal license applies to the code in the current repo. This is also a setting that can be set when a new repo is created:

Screenshot of the LICENSE setting.

I generally choose the MIT license because it allows my blog readers to use my code for commercial purposes, which I know is a concern for developers working for larger companies with more processes in place. Adding the right license for your code is extremely important — it may be the key factor that protects your company’s IP and keeps it in business, or that enables you to contribute to a high-impact open-source project or commercial product.

README.md

The README.md file contains information about the code in the current repo. I left this file blank in my template, but I still wanted it to be present so that I’m reminded of adding content to it when I use the template to create a new project. Similarly to the other two files, the creation of this file can be done by simply checking a checkbox during the creation of a new repo:

Screenshot of the README setting.

environment.yml

I use Miniconda as my environment manager for Python. Since my projects always include a conda-style environment.yml file containing all the packages needed to run my code, it makes sense to add this file to my template. I included in this file a few packages that I use pretty much every time I create a machine learning project: Python, PyTorch, Yapf and Pylint — I’ll talk about the last two in detail in the next section.

https://github.com/bstollnitz/ml-template/blob/main/environment.yml

name: <GITHUB_REPOSITORY>
channels:
  - anaconda
  - pytorch
  - conda-forge
dependencies:
  - python==3.10.4
  - pytorch==1.11.0
  - yapf==0.31.0
  - pylint==2.12.2

The versions in the snippet above are the ones I use at the time of writing. As part of my workflow, next time I update my packages to a new set of compatible versions, that update will be made in this template. Since all my future project will use the template, they will all have the updated versions.

If you look at the environment.yml file in the project you created from the template, you’ll notice that the name of your conda environment matches your repository name, instead of the <GITHUB_REPOSITORY> token I added in the GitHub template. I’ll come back to this later, when I explain how the template was created.

You can install and activate a conda environment with the following commands:

conda env create environment.yml
conda activate <name of your environment>

You can read more about common conda commands in the documentation.

Linting and automatic formatting

A good choice of linter and formatter makes a world of difference when writing Python code in VS Code!

A typical Python linter analyzes your source code, verifies that it follows the PEP8 official style guide for Python, and warns you of any instances where it doesn’t. The PEP8 style guide provides guidance on matters such as indentation, maximum line length, and variable naming conventions. I like to use Pylint as my linter because in addition to PEP8 style checks, it also does error checking — detecting when I’ve used a module without importing it, for example. Pylint is the most popular linter for Python at the time I’m writing this.

A typical Python formatter auto-formats your Python code according to the PEP8 standard. For example, imagine that you have a line of code that’s longer than the maximum line length recommended by PEP8. Running the linter will give you a warning, but it won’t fix the issue for you. That’s where the formatter comes in: when you run it, it breaks up the code onto multiple lines automatically. I like to use YAPF written by Google, because in addition to making sure my code conforms to PEP8, it also makes it look good. In the example I mentioned, YAPF won’t just break up the line so that it doesn’t violate PEP8’s max line length, it also breaks it up so that it’s as easy as possible to read.

The configuration for Pylint is typically done in a .pylintrc file. The .pylintrc file that I use follows the Google Python style guide, and can be downloaded from that same page. Before I was using a linter I was already following Google’s excellent style guide, so the addition of a linter brought great benefits for me.

Occasionally, I don’t want a particular rule to be enforced, so I disable it in one of two ways:

If I want it to be disabled for the whole project, I add it to the “disable” section in the .pylintrc file. For example:

https://github.com/bstollnitz/ml-template/blob/main/.pylintrc

disable=abstract-method,
    ...
    zip-builtin-not-iterating,

If I want it to be disabled just for a particular instance, I add a comment immediately above the line with the lint warning. For example:

https://github.com/bstollnitz/sindy/blob/main/sindy/lorenz-pysindy/src/2_fit.py

# pylint: disable=unused-import
from pysindy.differentiation import FiniteDifference, SINDyDerivative

The configuration for YAPF is done in a .style.yapf file. Here are the contents of this file in the GitHub template:

https://github.com/bstollnitz/ml-template/blob/main/.style.yapf

[style]
based_on_style = google

The Formatting style section of YAPF’s documentation lists the four base styles supported by YAPF: “pep8”, “google”, “yapf”, and “facebook”. I chose Google’s style because it follows the Google Python style guide — it’s a good idea for the linter and formatter to follow the same style guidelines. The YAPF docs contain a lot more information to further customize how you want YAPF to work.

How do you run the linter and formatter? You can kick them off manually from the command line, but the best way is to configure your editor to run them automatically. Keep reading to learn how I configure VS Code to do this.

Configuring VS Code settings

VS Code is highly customizable through the use of a wide range of settings. There are a few different locations where you can add these settings, though, and choosing the right place can make a big difference in how efficient you are at working across projects and machines.

Any time I want to add a new setting to VS Code, I choose one of the following three locations for the setting:

.devcontainer/devcontainer.json — GitHub Codespaces is a cloud-based development environment that runs your code in a container. This is a recent GitHub feature that I’m super excited about — I recommend that you check it out if you haven’t already! I add to this file settings that are specific to running the project in a GitHub codespace. For example: this is a good place to set the default Python interpreter path for the codespace.
.vscode/settings.json — I add to this file settings that are specific to the project, regardless of whether I run the project on my local machine or in a codespace. This is where I add my linter and formatter choices, as we’ll see soon.
VS Code user settings with “Settings Sync” — If I want a particular setting to apply to all projects across machines, I add it to my VS Code settings and enable “Settings Sync.”

I have another post where I explain in detail how I configure GitHub Codespaces. In the remainder of this section, I’ll cover the two other setting locations.

Let’s first look at .vscode/settings.json. I set my linter and formatter settings in this file within each project, because I may want to customize them per project. Storing these settings in each project also guarantees that my collaborators share the same settings, keeping the code consistently formatted. I don’t recommend including these settings in your devcontainer.json file because you typically want your development environment to be the same locally and in Codespaces.

Here are the contents of my .vscode/settings.json file in the GitHub template:

https://github.com/bstollnitz/ml-template/blob/main/.vscode/settings.json

{
    "python.linting.pylintEnabled": true,
    "python.formatting.provider": "yapf",
    "editor.rulers": [
        80
    ],
    "editor.formatOnSave": true,
}

The first two lines specify my choices of linter (Pylint) and formatter (YAPF). The third line instructs VS Code to display a thin vertical line at character 80, since the max line length recommended by PEP8 is 79 characters. This just helps me to visualize where my code should wrap. The fourth line tells VS Code to run YAPF every time I save my code. This is super handy! I can write my code without worrying about making it pretty, and a simple “Ctrl + S” formats it exactly the way I want it!

Now let’s talk about your VS Code user settings. To change those settings, click on the button with a gear-shaped icon at the bottom-left of the VS Code window, and choose “Settings”:

Screenshot of the gear button with popup up menu open and a "Settings" option highlighted.

Once you’ve opened the settings editor, you can browse and search for all VS Code settings. By default, any setting that you change in this editor will apply to VS Code on the machine you are currently using. For a consistent development experience across your local machine, Codespaces, and any other machine where you use VS Code, I recommend turning on “Settings Sync.” This enables all your settings to be associated with your GitHub (or Microsoft) account, and it causes them to sync every time you open VS Code as that user. You can turn on “Settings Sync” by clicking again on the VS Code gear button, and then “Turn on Settings Sync…“:

Screenshot of the gear button with popup up menu open and a "Settings Sync" option highlighted.

You’ll then be taken to the following dialog, where you can configure which settings you want to sync. I like to keep all the checkboxes checked:

Screenshot of the "Settings Sync" options.

Next click on the “Sign in & Turn on” button, select your account (GitHub or Microsoft), and your settings will be synced. If you have conflicts in settings from different machines, you’ll be given a chance to select which settings you want to prevail. In addition to seeing your preferences in the “Settings” editor, you can also see them in json format, by going to the Command Palette and selecting “Preferences: Open Settings (JSON).” I like to use the “Settings” tab to browse all the settings that are available to me, and the json file to quickly glance at all the settings I have customized.

My VS Code user settings are not specific to machine learning projects — they apply to every project! For example, this is where I set the color theme for the VS Code user interface (“workbench.colorTheme”: “Default Dark+”), and where I instruct VS Code to show me differences in whitespace when comparing two files (“diffEditor.ignoreTrimWhitespace”: false).

I’ve been a fan of Settings Sync for a while, because it enables me to re-install Visual Studio code and immediately start working in a familiar environment. But with Codespaces, it’s more important than ever. It plays a big role in ensuring that my cloud environment feels as comfortable as my local one.

GitHub template

Creating a GitHub template from a project is easy: go to the project Settings, and check the “Template repository” checkbox:

Screenshot of the "Template repository" checkbox.

You can read more about this in the documentation.

Often GitHub templates contain GitHub Actions that customize the projects generated from the template in some way. A GitHub Action workflow is defined in a YAML file, and at a high-level consists of a trigger event and a sequence of GitHub actions that execute when the event is triggered. The GitHub template associated with this post contains a GitHub Action workflow that is triggered when a new project is created using the template, and renames the conda environment and source code directory to match the repository name.

https://github.com/bstollnitz/ml-template/blob/main/.github/workflows/initial_setup.yml

# This GitHub Action workflow runs when a new repository is created from this
# template repo.

name: Initial Setup

on:
  push:
    branches:
      - main

env:
  REPO_NAME: ${{ github.event.repository.name }}

jobs:
  initial_setup:
    runs-on: ubuntu-latest
    if: ${{ github.event.created }}
    steps:
      # GitHub Action marketplace: https://github.com/marketplace/actions/checkout
      - name: Checkout
        uses: actions/checkout@v3

      # GitHub Action marketplace: https://github.com/marketplace/actions/find-and-replace
      - name: Replace token with repository name
        uses: jacobtomlinson/gha-find-replace@v2
        with:
          find: "<GITHUB_REPOSITORY>"
          replace: ${{ env.REPO_NAME }}
          exclude: "{.git,.github}/**"

      - name: Rename source code directory to repository name
        run: mv ml-template $REPO_NAME

      # GitHub Action marketplace: https://github.com/marketplace/actions/git-auto-commit
      - name: Commit all changes
        uses: stefanzweifel/git-auto-commit-action@v4
        with:
          commit_message: Initial repo setup
          commit_options: "--no-verify --signoff"

GitHub Action workflows are always added to a directory named .github/workflows. Our workflow has four sections: a name, a trigger event, a section for environment variables, and a jobs section containing the actions we want to trigger.

We want this workflow to be triggered the first time we push files to a project created from this template. We achieve this by specifying that we want the trigger event to be a push to the main branch, and then refining it with an if statement that says that that we’re really just interested in the first push that creates the project. The if statement accesses the GitHub context, which contains information about the workflow run. Then it gets its event property, which in this case is the push event, and returns the full event webhook payload. And finally it drills into the created key of the push webhook, which is true for the first push of a repository.

Once the event is triggered, we want to rename the ml-template folder (which will contain the source code) and the conda environment name to match the name of the repository. We achieve this with four steps within the job, which run sequentially:

The first step executes the Checkout action, which checks out the code in the new repository created from the template.
The second step replaces all instances of the token <GITHUB_REPOSITORY> with the repository name, using the jacobtomlinson/gha-find-replace action. In our scenario there’s just one instance of the token — it’s the name of the conda environment in the environment.yml file.
The third step renames the ml-template folder with the name of the repository. This is the folder where you’ll add your source code.
The fourth step commits all changes to the repo, using the stefanzweifel/git-auto-commit-action action.

How did I know to use these particular actions? I found them all (except the rename command) by searching the GitHub Action marketplace, which I find so handy! No need to re-invent the wheel!

When you create your repository from the template, check the directory that’s created at the root level — its name should match the name you chose for your repository. And likewise for the name of the conda environment, which you can find in the environment.yml file. You’ll notice that the YAML file with the actions workflow is added to your repository. You can delete the whole .github/workflows directory at this point, if you want.

Conclusion

I hope that you’re inspired by the information I shared, and that you’ll use it to enhance your existing machine learning workflow. And if you have tricks of your own that improve on my current process, please do reach out! I would love to hear about them and update this post for everyone’s benefit!

Thank you for reading!