Created:
Topic: Development tools
Having a well-thought-out process to structure your machine learning projects enables you to create new GitHub repositories quickly, and encourages you to embrace elegant software architecture from the very beginning. In this post I’ll show you how I organize the files in my machine learning projects, and I’ll explain the reasoning behind each decision. Some of the information included here is specific to VS Code, but even if you prefer a different editor, you’ll still benefit from most of the content of this article.
The template I use to create new machine learning projects can be found on GitHub.
If you want to start a new machine learning project from my GitHub template, navigate to the repo on GitHub, and click on the “Use this template” button. GitHub template repositories are super handy — they allow me and others to generate new repositories containing the same structure, branches, and files as the template!
The next page will guide you through the settings you need to choose for your project (such as a repository name and privacy settings):
Once your repo is created, click on “Actions” in the top menu, and wait for the GitHub action to complete:
When you see a green checkmark, your project is ready for you to start coding!
Keep reading if you want to understand the reasoning behind each file added to your project, and how the GitHub template was created.
Let’s start by covering the most basic and essential files you’ll find in the project you created from the template:
The .gitignore
file tells GitHub which files it should ignore when committing the project to the GitHub repo. If you were creating a new repo without using a template, you’d have a chance to select a pre-configured .gitignore
file. The file I added to the template is simply the file generated after I selected the pre-configured “Python” file at repo creation time:
This is a long file, and I show just the beginning below:
https://github.com/bstollnitz/ml-template/blob/main/.gitignore
lines (105 sloc) 1.76 KB
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
...
The LICENSE
file tells others which legal license applies to the code in the current repo. This is also a setting that can be set when a new repo is created:
I generally choose the MIT license because it allows my blog readers to use my code for commercial purposes, which I know is a concern for developers working for larger companies with more processes in place. Adding the right license for your code is extremely important — it may be the key factor that protects your company’s IP and keeps it in business, or that enables you to contribute to a high-impact open-source project or commercial product.
The README.md
file contains information about the code in the current repo. I left this file blank in my template, but I still wanted it to be present so that I’m reminded of adding content to it when I use the template to create a new project. Similarly to the other two files, the creation of this file can be done by simply checking a checkbox during the creation of a new repo:
I use Miniconda as my environment manager for Python. Since my projects always include a conda-style environment.yml
file containing all the packages needed to run my code, it makes sense to add this file to my template. I included in this file a few packages that I use pretty much every time I create a machine learning project: Python, PyTorch, Yapf and Pylint — I’ll talk about the last two in detail in the next section.
https://github.com/bstollnitz/ml-template/blob/main/environment.yml
name: <GITHUB_REPOSITORY>
channels:
- anaconda
- pytorch
- conda-forge
dependencies:
- python==3.10.4
- pytorch==1.11.0
- yapf==0.31.0
- pylint==2.12.2
The versions in the snippet above are the ones I use at the time of writing. As part of my workflow, next time I update my packages to a new set of compatible versions, that update will be made in this template. Since all my future project will use the template, they will all have the updated versions.
If you look at the environment.yml
file in the project you created from the template, you’ll notice that the name of your conda environment matches your repository name, instead of the <GITHUB_REPOSITORY>
token I added in the GitHub template. I’ll come back to this later, when I explain how the template was created.
You can install and activate a conda environment with the following commands:
conda env create environment.yml
conda activate <name of your environment>
You can read more about common conda commands in the documentation.
A good choice of linter and formatter makes a world of difference when writing Python code in VS Code!
A typical Python linter analyzes your source code, verifies that it follows the PEP8 official style guide for Python, and warns you of any instances where it doesn’t. The PEP8 style guide provides guidance on matters such as indentation, maximum line length, and variable naming conventions. I like to use Pylint as my linter because in addition to PEP8 style checks, it also does error checking — detecting when I’ve used a module without importing it, for example. Pylint is the most popular linter for Python at the time I’m writing this.
A typical Python formatter auto-formats your Python code according to the PEP8 standard. For example, imagine that you have a line of code that’s longer than the maximum line length recommended by PEP8. Running the linter will give you a warning, but it won’t fix the issue for you. That’s where the formatter comes in: when you run it, it breaks up the code onto multiple lines automatically. I like to use YAPF written by Google, because in addition to making sure my code conforms to PEP8, it also makes it look good. In the example I mentioned, YAPF won’t just break up the line so that it doesn’t violate PEP8’s max line length, it also breaks it up so that it’s as easy as possible to read.
The configuration for Pylint is typically done in a .pylintrc
file. The .pylintrc
file that I use follows the Google Python style guide, and can be downloaded from that same page. Before I was using a linter I was already following Google’s excellent style guide, so the addition of a linter brought great benefits for me.
Occasionally, I don’t want a particular rule to be enforced, so I disable it in one of two ways:
.pylintrc
file. For example:https://github.com/bstollnitz/ml-template/blob/main/.pylintrc
disable=abstract-method,
...
zip-builtin-not-iterating,
https://github.com/bstollnitz/sindy/blob/main/sindy/lorenz-pysindy/src/2_fit.py
# pylint: disable=unused-import
from pysindy.differentiation import FiniteDifference, SINDyDerivative
The configuration for YAPF is done in a .style.yapf
file. Here are the contents of this file in the GitHub template:
https://github.com/bstollnitz/ml-template/blob/main/.style.yapf
[style]
based_on_style = google
The Formatting style section of YAPF’s documentation lists the four base styles supported by YAPF: “pep8”, “google”, “yapf”, and “facebook”. I chose Google’s style because it follows the Google Python style guide — it’s a good idea for the linter and formatter to follow the same style guidelines. The YAPF docs contain a lot more information to further customize how you want YAPF to work.
How do you run the linter and formatter? You can kick them off manually from the command line, but the best way is to configure your editor to run them automatically. Keep reading to learn how I configure VS Code to do this.
VS Code is highly customizable through the use of a wide range of settings. There are a few different locations where you can add these settings, though, and choosing the right place can make a big difference in how efficient you are at working across projects and machines.
Any time I want to add a new setting to VS Code, I choose one of the following three locations for the setting:
.devcontainer/devcontainer.json
— GitHub Codespaces is a cloud-based development environment that runs your code in a container. This is a recent GitHub feature that I’m super excited about — I recommend that you check it out if you haven’t already! I add to this file settings that are specific to running the project in a GitHub codespace. For example: this is a good place to set the default Python interpreter path for the codespace..vscode/settings.json
— I add to this file settings that are specific to the project, regardless of whether I run the project on my local machine or in a codespace. This is where I add my linter and formatter choices, as we’ll see soon.I have another post where I explain in detail how I configure GitHub Codespaces. In the remainder of this section, I’ll cover the two other setting locations.
Let’s first look at .vscode/settings.json
. I set my linter and formatter settings in this file within each project, because I may want to customize them per project. Storing these settings in each project also guarantees that my collaborators share the same settings, keeping the code consistently formatted. I don’t recommend including these settings in your devcontainer.json
file because you typically want your development environment to be the same locally and in Codespaces.
Here are the contents of my .vscode/settings.json
file in the GitHub template:
https://github.com/bstollnitz/ml-template/blob/main/.vscode/settings.json
{
"python.linting.pylintEnabled": true,
"python.formatting.provider": "yapf",
"editor.rulers": [
80
],
"editor.formatOnSave": true,
}
The first two lines specify my choices of linter (Pylint) and formatter (YAPF). The third line instructs VS Code to display a thin vertical line at character 80, since the max line length recommended by PEP8 is 79 characters. This just helps me to visualize where my code should wrap. The fourth line tells VS Code to run YAPF every time I save my code. This is super handy! I can write my code without worrying about making it pretty, and a simple “Ctrl + S” formats it exactly the way I want it!
Now let’s talk about your VS Code user settings. To change those settings, click on the button with a gear-shaped icon at the bottom-left of the VS Code window, and choose “Settings”:
Once you’ve opened the settings editor, you can browse and search for all VS Code settings. By default, any setting that you change in this editor will apply to VS Code on the machine you are currently using. For a consistent development experience across your local machine, Codespaces, and any other machine where you use VS Code, I recommend turning on “Settings Sync.” This enables all your settings to be associated with your GitHub (or Microsoft) account, and it causes them to sync every time you open VS Code as that user. You can turn on “Settings Sync” by clicking again on the VS Code gear button, and then “Turn on Settings Sync…“:
You’ll then be taken to the following dialog, where you can configure which settings you want to sync. I like to keep all the checkboxes checked:
Next click on the “Sign in & Turn on” button, select your account (GitHub or Microsoft), and your settings will be synced. If you have conflicts in settings from different machines, you’ll be given a chance to select which settings you want to prevail. In addition to seeing your preferences in the “Settings” editor, you can also see them in json
format, by going to the Command Palette and selecting “Preferences: Open Settings (JSON).” I like to use the “Settings” tab to browse all the settings that are available to me, and the json
file to quickly glance at all the settings I have customized.
My VS Code user settings are not specific to machine learning projects — they apply to every project! For example, this is where I set the color theme for the VS Code user interface (“workbench.colorTheme”: “Default Dark+”), and where I instruct VS Code to show me differences in whitespace when comparing two files (“diffEditor.ignoreTrimWhitespace”: false).
I’ve been a fan of Settings Sync for a while, because it enables me to re-install Visual Studio code and immediately start working in a familiar environment. But with Codespaces, it’s more important than ever. It plays a big role in ensuring that my cloud environment feels as comfortable as my local one.
Creating a GitHub template from a project is easy: go to the project Settings, and check the “Template repository” checkbox:
You can read more about this in the documentation.
Often GitHub templates contain GitHub Actions that customize the projects generated from the template in some way. A GitHub Action workflow is defined in a YAML file, and at a high-level consists of a trigger event and a sequence of GitHub actions that execute when the event is triggered. The GitHub template associated with this post contains a GitHub Action workflow that is triggered when a new project is created using the template, and renames the conda environment and source code directory to match the repository name.
https://github.com/bstollnitz/ml-template/blob/main/.github/workflows/initial_setup.yml
# This GitHub Action workflow runs when a new repository is created from this
# template repo.
name: Initial Setup
on:
push:
branches:
- main
env:
REPO_NAME: ${{ github.event.repository.name }}
jobs:
initial_setup:
runs-on: ubuntu-latest
if: ${{ github.event.created }}
steps:
# GitHub Action marketplace: https://github.com/marketplace/actions/checkout
- name: Checkout
uses: actions/checkout@v3
# GitHub Action marketplace: https://github.com/marketplace/actions/find-and-replace
- name: Replace token with repository name
uses: jacobtomlinson/gha-find-replace@v2
with:
find: "<GITHUB_REPOSITORY>"
replace: ${{ env.REPO_NAME }}
exclude: "{.git,.github}/**"
- name: Rename source code directory to repository name
run: mv ml-template $REPO_NAME
# GitHub Action marketplace: https://github.com/marketplace/actions/git-auto-commit
- name: Commit all changes
uses: stefanzweifel/git-auto-commit-action@v4
with:
commit_message: Initial repo setup
commit_options: "--no-verify --signoff"
GitHub Action workflows are always added to a directory named .github/workflows
. Our workflow has four sections: a name, a trigger event, a section for environment variables, and a jobs section containing the actions we want to trigger.
We want this workflow to be triggered the first time we push files to a project created from this template. We achieve this by specifying that we want the trigger event to be a push
to the main branch, and then refining it with an if
statement that says that that we’re really just interested in the first push that creates the project. The if
statement accesses the GitHub context, which contains information about the workflow run. Then it gets its event property, which in this case is the push event, and returns the full event webhook payload. And finally it drills into the created
key of the push webhook, which is true for the first push of a repository.
Once the event is triggered, we want to rename the ml-template
folder (which will contain the source code) and the conda environment name to match the name of the repository. We achieve this with four steps within the job, which run sequentially:
<GITHUB_REPOSITORY>
with the repository name, using the jacobtomlinson/gha-find-replace
action. In our scenario there’s just one instance of the token — it’s the name of the conda environment in the environment.yml
file.ml-template
folder with the name of the repository. This is the folder where you’ll add your source code.stefanzweifel/git-auto-commit-action
action.How did I know to use these particular actions? I found them all (except the rename command) by searching the GitHub Action marketplace, which I find so handy! No need to re-invent the wheel!
When you create your repository from the template, check the directory that’s created at the root level — its name should match the name you chose for your repository. And likewise for the name of the conda environment, which you can find in the environment.yml
file. You’ll notice that the YAML file with the actions workflow is added to your repository. You can delete the whole .github/workflows
directory at this point, if you want.
I hope that you’re inspired by the information I shared, and that you’ll use it to enhance your existing machine learning workflow. And if you have tricks of your own that improve on my current process, please do reach out! I would love to hear about them and update this post for everyone’s benefit!
Thank you for reading!