Using vendor in Python

This article describes the proper way to vendor third-party libraries in Python libraries. I know the audience for this article is very narrow, and most Python developers don’t know or need to use this technique, but in the spirit of sharing, I’ll summarize it, and as the author of the software, you should respect the work of all other library authors.

WHAT - What is a vendor?

A vendor is a way of embedding third-party library code directly into software (in languages like C, Go, etc.). It differs from the way it is specified by a dependency file in that the code of the third-party library is included directly in the software and may or may not be kept as is, so you need to be aware of the various license restrictions, especially if the upstream library is under the GPL family of agreements, and the use of vendor software is subject to contagion.

WHY - When do I use vendor in Python?

As I said at the beginning, the scope is very narrow and there are three scenarios.

software features restrict it to be self-contained and zero-dependent. In the Python world, the library that uses vendor most heavily is pip, which we use every day. There are 25 dependencies in pip._vendor. pip is the current standard Python installer, so it can’t have any dependencies that would otherwise have to be installed in order to install pip, and those dependencies can only be installed through pip, which is recursive. In addition to this, there are also basic build tools like setuptools.
the software depends on a specific version of an upstream library. This also includes cases where the upstream library breaks change frequently, leading to API instability. If you simply specify third-party-lib==1.0.0 in a dependency, it will cause a dependency conflict with software that also relies on this library and does not resolve the version. Switching to vendor removes this very strict dependency restriction.
the software needs to make some changes to the upstream library, and due to the maintenance of the upstream library, these changes can not be merged into the upstream and released through PR and other means. In the case of open source agreement, you can embed the source code into the software through vendor and modify it by yourself.

In fact, for scenarios 2 and 3 above, you don’t have to be a vendor. In addition to vendor, you can also fork to your own git repository and introduce it using git dependencies or publish it as a new PyPI package. Just vendor is one of the easiest ways to do this.

There is one more constraint: for Python, only pure Python libraries can be vendor.

HOW - How should I vendor?

A vendor is not a simple copy and paste solution, in my opinion, it has to pay attention to the following two points.

vendor must comply with the open source protocol and put the protocol files in the vendor directory as well.
when there are changes to the source code, you need to record the patch file, so that when the time is right, feedback back upstream.

So, a vendor is not a copy-and-paste, but a compromise to the status quo in an open source framework, and our ultimate goal is to eliminate vendors.

In Python, in addition to putting the vendor libraries in a directory under the code base (e.g. mypackage/vendor), you need to modify all import statements to point to this directory. For example, change import requests to from mypackage.vendor import requests. The PDM also contains such a directory, and I use the same tool as pip to manage vendors. This tool is vendoring and is very poorly documented (because nobody wants to use it). It contains the following functions.

read a requirements.txt to download the dependencies to the specified directory
download the LICENSE files of all libraries into this directory
read the patch file from a specified path and apply it to the source code
rewrite all import statements to point to the vendor directory
update the vendor version

The procedure is roughly the same as above. First create a mypackage/vendor directory, create a vendors.txt in it and fill in the dependencies (in requirements.txt format).

1
2

requests==2.24.1
click==8.0.1

Then in pyproject.toml under the project root path, add the following.

[tool.vendoring]
destination = "mypackage/vendor/"   # vendor目录路径
requirements = "mypackage/vendor/vendors.txt"  # requirements路径
namespace = "mypackage.vendor"  # import 重命名前缀

protected-files = ["__init__.py", "README.md", "vendors.txt"]  # 每次重新 vendor 时需要保留的文件
patches-dir = "tasks/patches"  # patch 文件目录

[tool.vendoring.transformations]
substitute = [  # 重命名没有覆盖到的 import，文件替换规则
  {match = '__import__("requests")', replace = '__import__("mypackage.vendor.requests")'}
]
drop = [   # 需要从 vendor 库中去除的文件
    "bin/",
    "*.so",
    "typing.*",
    "*/tests/"
]

Finally, run vendoring sync and you’ll have the vendor all ready to go automatically.

For patch files, this is actually the output of git diff, with which git can recreate the vendor directory from the source code. To generate the patch, 1.

run vendoring sync once after configuration and commit the file to the local repository (commit only, not push)
modify the source code
run git diff --patch <file_path> > <patches_dir>/<file_name>.patch to save the patch file to patches_dir.
Review the patch file and revert any modified import statements to the original import statements, e.g. from mypackage.vendor import requests to import requests.

As for why we should do this, because apply patch is rewritten before import, so the patch file should be filled with unrewritten import statements. Be careful not to change any whitespace characters when modifying, the patch file is sensitive to whitespace.
run git add . && git commit --amend to commit the changes
Run vendoring sync again to verify that if everything works, there should be no changes, which means the vendor process is reproducible.

Table of Contents

WHAT - What is a vendor?

WHY - When do I use vendor in Python?

HOW - How should I vendor?