Writing and Publishing Python Modules

There is a time and a place for specific solutions, and generalizable solutions. This article is aimed at the latter, using python as an example. When building out large systems, it’s important to keep in mind the DRY principle — Do not Repeat Yourself! Repetition in a code base can turn a simple change into a tangled mess befuddled by human errors — hence, spaghetti code. One way to reduce repetition in a single code base is by modularizing functionality — by decomposing repetitious code into modular functions, an update to a single function can replace the arduous task of updating code in different corners of the codebase. But what if you’re building something bigger, and shared code exists beyond a single repo?

Enter, packages. Every major programming language has some mechanism through which code can be shared and used by different people — Ruby has Gems, Node has npm, and Python has pip. Packages help reduce the effort required to compose software by modularizing pre-built functionality that can be imported and used in any codebase. If you wanted to write a python program to graph a dataset, you could write your own library to help facilitate this functionality. Graphing is a common problem, and common problems often already have solutions — the path of least resistance is to import an existing library (such as matplotlib), and use that for your graphing needs (unless, of course, your usecase requires a level of complexity and sophistication that existing packages can’t provide, in which case a more custom solution is required).

import matplotlib as plt

plt.plot([1, 2, 3, 4], [1, 4, 9, 16])

Writing Custom Modules #

There are many reasons to write custom modules — you might want to package some functionality that would be useful in code across your organization’s codebases, you might want to open source a solution that worked well for a particular problem you solved, or you might just be interested in how programmers share code. Whatever the case, mature languages will almost certainly have the facilities to share code. In fact, I’d argue that any moderately useful language should have some type of package management solution — it’s impossible to efficiently write code if every time you want to do something as common as print to stdout, you have to write the functionality to support the use case yourself. Of course, this is coupled with the question of how extensive should a language’s standard library be — that’s a discussion beyond the scope of this article, but I point the curious reader to Guy Steele’s fantastic talk on language design, titled Growing a Language.

Exploring Package Management in Python #

Note: I use the terms package and module interchangeably. Semantically speaking, a package is a collection of modules, whereas a module is a single file. For the sake of simplicity, the examples in the following section use a module, but the same lessons apply to packages in general.

With sufficient motivation for shared packages, we can now look into the tools within python that make this magic possible. pip is the utility that allows you to install packages in python — pip is actually a recursive acronym that means “PIP Installs Python”. Using pip is very simple. In our above example using matplotlib, we can install the package by simply running:

$ pip install matplotlib

Note: Some python installations might not come with pip out-of-box. From the pip documentation:

pip is already installed if you are using Python 2 >=2.7.9 or Python 3 >=3.4 downloaded from python.org or if you are working in a Virtual Environment created by virtualenv or pyvenv. Just make sure to upgrade pip.

Now know how to install packages, but where do those packages come from?

PyPI: Python Package Index #

The Python Package Index is the most popular public index of python packages. Anybody can publish to or install from PyPI. If you have shared code that is sensitive (for example, shared code to handle auth within an organization), there are excellent private options, such as JFrog Artifactory.

pip defaults to using the PyPI index when installing packages, which you can confirm by observing the output from the following command:

$ pip install -h 
  -i, --index-url <url>       Base URL of Python Package Index (default https://pypi.python.org/simple).

If you’re using a private option like Artifactory, you would need to create a configuration file that knows to use Artifactory when attempting to resolve packages.

Writing a Python Module #

Writing a module in python to be shared across files in a single repo is as simple as definng a function and importing it in other files. If I define the following function in my_cool_function.py:

def my_cool_function():
    print("This is a super super useful and cool function!")
    return True

I can import this function in another file as easily as:

from path.to.my_cool_function import my_cool_function

# invoke the function

But perhaps, my_cool_function is so cool and useful that you feel the need to share it with the world — you want to publish your module to PyPI. The difficult part is writing the module — actually publishing it is seamless once you have the proper infrastructure in place.

Publishing a Python Module #

The base infrastructure for writing a module is the actual module code, and a few additional files. The basic module structure should look something like this:

├── README.md
├── requirements-dev.txt
├── requirements.txt
├── setup.py
└── src
    ├── __init__.py
    ├── command_line.py
    └── my_cool_function.py

The actual source code exists in src/. Along with the file we have for our code, we also have an __init__.py — this file lets python know that the files in this directory are part of a package. This is a contrived example — in reality, your module may have many more files in the src directory that serve as support for your core module.

The requirements.txt and requirements-dev.txt are standard within python projects—they contain all of your project’s dependencies (the split between a regular and dev file is purely up to the programmer, but there are always benefits to extra decomposition). Think of these files as containing a list of libraries that any end user would need to install in order to properly use your library (when the module is installed via pip, these dependencies are downloaded automatically).

The README.md is standard across most repos (not just python projects) — this should contain a description of your module and whatever other information you want to include.

Finally, the files that actually provide the backbone of your module are MANIFEST.in and setup.py. MANIFEST.ini lists what additional files to include as part of your module (other than your source files) and setup.py is the actual module configuration. The setup.py file contains all of the metadata and information required to properly package your module for whatever index you want to upload to. An example might look something like this:

from setuptools import setup

# Should match git tag
VERSION = '0.1.1'

def readme():
    with open('README.md') as f:
from setuptools import setup

# Should match git tag
VERSION = '0.1.1'

def readme():
    with open('README.md') as f:
from setuptools import setup

# Should match git tag
VERSION = '0.1.1'

def readme():
    with open('README.md') as f:
        return f.read()

with open('requirements.txt') as file:
    REQUIRED_MODULES = [line.strip() for line in file]

with open('requirements-dev.txt') as file:
    DEVELOPMENT_MODULES = [line.strip() for line in file]

      description='Starter project for python modules',
      keywords='my cool function',
      author='Your Name',
      author_email='[email protected]',
      extras_require={'dev': DEVELOPMENT_MODULES},

The above example is configure to include the README file along with regular and dev dependencies with the package definition. Most of this should be self explanatory, but of note is the name argument to the setup function, which is the actual name of the package (i.e you would install this module by running pip install my_cool_function). The packages arg specifies the location of the soruce files, in this case, src. The rest of the options are metadata for the package and should be fairly self explanatory — for an extended explanation for each, check out the docs.

With the proper infrastructure in place, the final step is to actually upload and publish the package. To do so with PyPI, you need create an account at https://pypi.org/. Once you have an account, you’ll need access to your API key which helps ensure your uploads are secure.

To actually upload your archive, first generate the distribution archives with the following:

$ python3 setup.py sdist bdist_wheel

and then to actually upload the package to PyPI (or whatever private index you have configured), you install the twine utility (if you don’t already have it) and run it on your distribution archive.

$ python3 -m pip install --user --upgrade twine # if you don't have twine installed
$ python3 -m twine upload dist/*

There are ways to debug that your package will be uploaded properly using TestPyPI, a separate index meant to facilitate the testing of package uploads — the python documentation has an excellent tutorial on this exact process https://packaging.python.org/tutorials/packaging-projects/.

Now that your package is published, you can easily pip install it: pip install my_cool_function

Maintaining a Module #

When maintaining a python module (or any code, for that matter), you should automate as much of the work as possible. While we can’t really automate writing a module (at least not entirely), we can automate it’s integration and deployment. If you aren’t practicing rigorous Test Driven Development (TDD), the long term correctness of your code is dependent on automated testing. Discussion on effective testing is beyond the scope of this article, but there are lots of ways to test your code — my recommended solution is pytest.

In the spirit of developing shared code, I have a starter repo called module_starter_cli that comes preconfigured with all the necessary infrastructure to automate testing via pytest , versioning and publishing through python-semantic-release, and deployment through CircleCI. Forking this repo for downstream projects gives you the ability to run automated tests on every branch for every pull request, and will automatically version your code and handle the hassle of deployment to create a seamless and low friction development cycle. I can go into detail on this setup and the philosophy behind shared infrastructure code in a future article. Till then, I hope this article helps your solutions become a little more generalizable.


Now read this

Leveraging AWS to Scale R&amp;D Workflows

Originally posted to Indigo Ag’s Engineering Blog. In order to identify and deliver commercially viable products for our growers, Indigo’s Research and Development teams analyze bacterial and fungal microbes through bioinformatic... Continue →