A binary or package (used interchangeably) is a pre-built collection of c++ libraries, header files, python bits, and other files. We build these and distribute them so that users do not need to install from source.
A binary configuration is a collection of
The binaries are built in CircleCI. There are nightly binaries built every night at 9pm PST (midnight EST) and release binaries corresponding to Pytorch releases, usually every few months.
We have 3 types of binary packages
All binaries are built in CircleCI workflows except Windows. There are checked-in workflows (committed into the .circleci/config.yml) to build the nightlies every night. Releases are built by manually pushing a PR that builds the suite of release binaries (overwrite the config.yml to build the release)
Some quick vocab:
The nightly binaries have 3 workflows. We have one job (actually 3 jobs: build, test, and upload) per binary configuration
The jobs are in https://github.com/pytorch/pytorch/tree/main/.circleci/verbatim-sources. Jobs are made of multiple steps. There are some shared steps used by all the binaries/smokes. Steps of these jobs are all delegated to scripts in https://github.com/pytorch/pytorch/tree/main/.circleci/scripts .
CircleCI creates a final yaml file by inlining every <<* segment, so if we were to keep all the code in the config.yml itself then the config size would go over 4 MB and cause infra problems.
So, CircleCI has several executor types: macos, machine, and docker are the ones we use. The ‘machine’ executor gives you two cores on some linux vm. The ‘docker’ executor gives you considerably more cores (nproc was 32 instead of 2 back when I tried in February). Since the dockers are faster, we try to run everything that we can in dockers. Thus
binary_run_in_docker.sh is a way to share the docker start-up code between the binary test jobs and the binary smoke test jobs
We want all the nightly binary jobs to run on the exact same git commit, so we wrote our own checkout logic to ensure that the same commit was always picked. Later circleci changed that to use a single pytorch checkout and persist it through the workspace (they did this because our config file was too big, so they wanted to take a lot of the setup code into scripts, but the scripts needed the code repo to exist to be called, so they added a prereq step called ‘setup’ to checkout the code and persist the needed scripts to the workspace). The changes to the binary jobs were not properly tested, so they all broke from missing pytorch code no longer existing. We hotfixed the problem by adding the pytorch checkout back to binary_checkout, so now there's two checkouts of pytorch on the binary jobs. This problem still needs to be fixed, but it takes careful tracing of which code is being called where.
The code that runs the binaries lives in two places, in the normal github.com/pytorch/pytorch, but also in github.com/pytorch/builder, which is a repo that defines how all the binaries are built. The relevant code is
# All code needed to set-up environments for build code to run in, # but only code that is specific to the current CI system pytorch/pytorch - .circleci/ # Folder that holds all circleci related stuff - config.yml # GENERATED file that actually controls all circleci behavior - verbatim-sources # Used to generate job/workflow sections in ^ - scripts/ # Code needed to prepare circleci environments for binary build scripts - setup.py # Builds pytorch. This is wrapped in pytorch/builder - cmake files # used in normal building of pytorch # All code needed to prepare a binary build, given an environment # with all the right variables/packages/paths. pytorch/builder # Given an installed binary and a proper python env, runs some checks # to make sure the binary was built the proper way. Checks things like # the library dependencies, symbols present, etc. - check_binary.sh # Given an installed binary, runs python tests to make sure everything # is in order. These should be de-duped. Right now they both run smoke # tests, but are called from different places. Usually just call some # import statements, but also has overlap with check_binary.sh above - run_tests.sh - smoke_test.sh # Folders that govern how packages are built. See paragraphs below - conda/ - build_pytorch.sh # Entrypoint. Delegates to proper conda build folder - switch_cuda_version.sh # Switches activate CUDA installation in Docker - pytorch-nightly/ # Build-folder - manywheel/ - build_cpu.sh # Entrypoint for cpu builds - build.sh # Entrypoint for CUDA builds - build_common.sh # Actual build script that ^^ call into - wheel/ - build_wheel.sh # Entrypoint for wheel builds - windows/ - build_pytorch.bat # Entrypoint for wheel builds on Windows
Every type of package has an entrypoint build script that handles the all the important logic.
Linux, MacOS and Windows use the same code flow for the conda builds.
Conda packages are built with conda-build, see https://conda.io/projects/conda-build/en/latest/resources/commands/conda-build.html
Basically, you pass conda build
a build folder (pytorch-nightly/ above) that contains a build script and a meta.yaml. The meta.yaml specifies in what python environment to build the package in, and what dependencies the resulting package should have, and the build script gets called in the env to build the thing. tl;dr on conda-build is
The build.sh we use is essentially a wrapper around python setup.py build
, but it also manually copies in some of our dependent libraries into the resulting tarball and messes with some rpaths.
The entrypoint file builder/conda/build_conda.sh
is complicated because
Manywheels are pip packages for linux distros. Note that these manywheels are not actually manylinux compliant.
builder/manywheel/build_cpu.sh
and builder/manywheel/build.sh
(for CUDA builds) just set different env vars and then call into builder/manywheel/build_common.sh
The entrypoint file builder/manywheel/build_common.sh
is really really complicated because
The entrypoint file builder/wheel/build_wheel.sh
is complicated because
Note that the MacOS Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
The entrypoint file builder/windows/build_pytorch.bat
is complicated because
Note that the Windows Python wheels are still built in conda environments. Some of the dependencies present during build also come from conda.
Libtorch packages are built in the wheel build scripts: manywheel/build_*.sh for linux and build_wheel.sh for mac. There are several things wrong with this
All linux builds occur in docker images. The docker images are
The Dockerfiles are available in pytorch/builder, but there is no circleci job or script to build these docker images, and they cannot be run locally (unless you have the correct local packages/paths). Only Soumith can build them right now.
tl;dr make a PR that looks like https://github.com/pytorch/pytorch/pull/21159
Sometimes we want to push a change to mainand then rebuild all of today‘s binaries after that change. As of May 30, 2019 there isn’t a way to manually run a workflow in the UI. You can manually re-run a workflow, but it will use the exact same git commits as the first run and will not include any changes. So we have to make a PR and then force circleci to run the binary workflow instead of the normal tests. The above PR is an example of how to do this; essentially you copy-paste the binarybuilds workflow steps into the default workflow steps. If you need to point the builder repo to a different commit then you'd need to change https://github.com/pytorch/pytorch/blob/main/.circleci/scripts/binary_checkout.sh#L42-L45 to checkout what you want.
Writing PRs that test the binaries is annoying, since the default circleci jobs that run on PRs are not the jobs that you want to run. Likely, changes to the binaries will touch something under .circleci/ and require that .circleci/config.yml be regenerated (.circleci/config.yml controls all .circleci behavior, and is generated using .circleci/regenerate.sh
in python 3.7). But you also need to manually hardcode the binary jobs that you want to test into the .circleci/config.yml workflow, so you should actually make at least two commits, one for your changes and one to temporarily hardcode jobs. See https://github.com/pytorch/pytorch/pull/22928 as an example of how to do this.
# Make your changes touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml # Regenerate the yaml, has to be in python 3.7 .circleci/regenerate.sh # Make a commit git add .circleci * git commit -m "My real changes" git push origin my_branch # Now hardcode the jobs that you want in the .circleci/config.yml workflows section # Also eliminate ensure-consistency and should_run_job checks # e.g. https://github.com/pytorch/pytorch/commit/2b3344bfed8772fe86e5210cc4ee915dee42b32d # Make a commit you won't keep git add .circleci git commit -m "[DO NOT LAND] testing binaries for above changes" git push origin my_branch # Now you need to make some changes to the first commit. git rebase -i HEAD~2 # mark the first commit as 'edit' # Make the changes touch .circleci/verbatim-sources/nightly-binary-build-defaults.yml .circleci/regenerate.sh # Ammend the commit and recontinue git add .circleci git commit --amend git rebase --continue # Update the PR, need to force since the commits are different now git push origin my_branch --force
The advantage of this flow is that you can make new changes to the base commit and regenerate the .circleci without having to re-write which binary jobs you want to test on. The downside is that all updates will be force pushes.
You can build Linux binaries locally easily using docker.
# Run the docker # Use the correct docker image, pytorch/conda-cuda used here as an example # # -v path/to/foo:path/to/bar makes path/to/foo on your local machine (the # machine that you're running the command on) accessible to the docker # container at path/to/bar. So if you then run `touch path/to/bar/baz` # in the docker container then you will see path/to/foo/baz on your local # machine. You could also clone the pytorch and builder repos in the docker. # # If you know how, add ccache as a volume too and speed up everything docker run \ -v your/pytorch/repo:/pytorch \ -v your/builder/repo:/builder \ -v where/you/want/packages/to/appear:/final_pkgs \ -it pytorch/conda-cuda /bin/bash # Export whatever variables are important to you. All variables that you'd # possibly need are in .circleci/scripts/binary_populate_env.sh # You should probably always export at least these 3 variables export PACKAGE_TYPE=conda export DESIRED_PYTHON=3.7 export DESIRED_CUDA=cpu # Call the entrypoint # `|& tee foo.log` just copies all stdout and stderr output to foo.log # The builds generate lots of output so you probably need this when # building locally. /builder/conda/build_pytorch.sh |& tee build_output.log
Building CUDA binaries on docker
You can build CUDA binaries on CPU only machines, but you can only run CUDA binaries on CUDA machines. This means that you can build a CUDA binary on a docker on your laptop if you so choose (though it’s gonna take a long time).
For Facebook employees, ask about beefy machines that have docker support and use those instead of your laptop; it will be 5x as fast.
There’s no easy way to generate reproducible hermetic MacOS environments. If you have a Mac laptop then you can try emulating the .circleci environments as much as possible, but you probably have packages in /usr/local/, possibly installed by brew, that will probably interfere with the build. If you’re trying to repro an error on a Mac build in .circleci and you can’t seem to repro locally, then my best advice is actually to iterate on .circleci :/
But if you want to try, then I’d recommend
# Create a new terminal # Clear your LD_LIBRARY_PATH and trim as much out of your PATH as you # know how to do # Install a new miniconda # First remove any other python or conda installation from your PATH # Always install miniconda 3, even if building for Python <3 new_conda="~/my_new_conda" conda_sh="$new_conda/install_miniconda.sh" curl -o "$conda_sh" https://repo.anaconda.com/miniconda/Miniconda3-latest-MacOSX-x86_64.sh chmod +x "$conda_sh" "$conda_sh" -b -p "$MINICONDA_ROOT" rm -f "$conda_sh" export PATH="~/my_new_conda/bin:$PATH" # Create a clean python env # All MacOS builds use conda to manage the python env and dependencies # that are built with, even the pip packages conda create -yn binary python=2.7 conda activate binary # Export whatever variables are important to you. All variables that you'd # possibly need are in .circleci/scripts/binary_populate_env.sh # You should probably always export at least these 3 variables export PACKAGE_TYPE=conda export DESIRED_PYTHON=3.7 export DESIRED_CUDA=cpu # Call the entrypoint you want path/to/builder/wheel/build_wheel.sh
N.B. installing a brand new miniconda is important. This has to do with how conda installations work. See the “General Python” section above, but tldr; is that
path/to/conda_root/bin
to your PATH.path/to/conda_root/envs/new_env/bin:path/to/conda_root/bin:$PATH
foo
foo
in new_env
, then path/to/conda_root/envs/new_env/bin/foo
will get called, as expected.foo
in new_env
but happened to previously install it in your root conda env (called ‘base’), then unix/linux will still find path/to/conda_root/bin/foo
. This is dangerous, since foo
can be a different version than you want; foo
can even be for an incompatible python version!Newer conda versions and proper python hygiene can prevent this, but just install a new miniconda to be safe.
TODO: fill in