Testing and continuous integration

Kyle Niemeyer

Oregon State University

2025-02-03

Testing your code

How do you know your code gives the right answers?
… what about after you make changes?
Legacy code: “code without tests” (Michael Feathers, Working Effectively with Legacy Code)

Testing your code

when to test: always.

where to test: external test suite.

Example: tests subdirectory inside package.

Perfect is the enemy of good; a basic level of tests is better than nothing. But a rigorous test suite will save you time and potential problems in the long run.

Why test?

Testing is a core principle of scientific software; it ensures results are trustworthy.

Scientific and engineering software is used for planes, power plants, satellites, and decisionmaking. Thus, correctness of this software is pretty important.

And we all know how easy it is to have mistakes in code without realizing it…

Testing important for contributions

As a contributor, how do I know if my PR hasn’t broken anything?
As a commiter, how do I know if your PR is ok?
Unit tests are often a great repository of how different features of a library actually work

What and how to test?

def kepler_loc(p1, p2, dt, t):
    '''Use Kepler's Laws to predict location of celestial body'''
    ...
    return p3

def test_kepler_loc():
    p1 = jupiter(two_days_ago)
    p2 = jupiter(yesterday)
    exp = jupiter(today)
    obs = kepler_loc(p1, p2, 1, 1)
    if exp != obs:
        raise ValueError("Jupiter is not where it should be!")

What is a test?

Tests compare expected outputs versus observed outputs for known inputs. They do not inspect the body of the function directly. In fact, the body of a function does not even have to exist for a valid test to be written.

def test_func():
    exp = get_expected()
    obs = func(*args, **kwargs)
    assert exp == obs

Good idea: test through assertions

For exactness:

def test_kepler_loc():
    p1 = jupiter(two_days_ago)
    p2 = jupiter(yesterday)
    exp = jupiter(today)
    obs = kepler_loc(p1, p2, 1, 1)
    assert exp == obs

For approximate exactness:

import numpy as np
def test_kepler_loc():
    p1 = jupiter(two_days_ago)
    p2 = jupiter(yesterday)
    exp = jupiter(today)
    obs = kepler_loc(p1, p2, 1, 1)
    assert np.allclose(exp, obs)

Test using pytest

# content of test_sample.py
def inc(x):
    return x + 1

def test_answer():
    assert inc(3) == 5

$ pytest

pytest finds all testing modules and functions, and runs them.

Example

$ tree simple_packaging
simple-packaging
├── LICENSE
├── README.md
├── pyproject.toml
├── src
│   └── rosen
│       ├── __init__.py
│       ├── _version.py
│       └── example.py
└── tests
    └── test_example.py

4 directories, 7 files

$ pip install -e .
$ pytest tests

Types of testing

Kinds of tests

unit test: interrogate individual functions and methods
integration test: verify that multiple pieces of the code work together
regression test: confirm that results match prior code results (“blessed” results, assumed correct)

Developers warn against having unit tests without any integration tests…

Testing numbers

When testing using integers, use assert a == b
When testing using floating-point numbers, use a == pytest.approx(b) or np.allclose(a,b), for both scalars and arrays/lists
Both approx() and allclose allow you to specify relative and absolute tolerances

Test generators (handy)

import numpy as np
import pytest

# contents of add.py
def add2(x, y):
    return x + y

class Test(object):
@pytest.mark.parametrize('exp, x, y', [
    (4, 2, 2),
    (5, -5, 10),
    (42, 40, 2),
    (-128, 0, -128),
])
def test_add2(self, x, y, exp):
    obs = add2(x, y)
    assert pytest.approx(obs) == exp

Types of unit tests

Interior tests: precise points/values do not matter
edge test: test examines beginning or end of a range
Best practice: test all edges and at least one interior point.

Corner cases

Two or more edge cases combined:

import numpy as np

def sinc2d(x, y):
    '''(Describe the function here)'''
    if x == 0.0 and y == 0.0:
        return 1.0
    elif x == 0.0:
        return np.sin(y) / y
    elif y == 0.0:
        return np.sin(x) / x
    else:
        return (np.sin(x) / x) * (np.sin(y) / y)

Corner cases

import numpy as np
from pytest import approx
from mod import sinc2d

def test_internal():
    exp = (2.0 / np.pi) * (-2.0 / (3.0 * np.pi))
    obs = sinc2d(np.pi / 2.0, 3.0 * np.pi / 2.0)
    assert exp == approx(obs)
def test_edge_x():
    exp = (-2.0 / (3.0 * np.pi))
    obs = sinc2d(0.0, 3.0 * np.pi / 2.0)
    assert exp == approx(obs)
def test_edge_y():
    exp = (2.0 / np.pi)
    obs = sinc2d(np.pi / 2.0, 0.0)
    assert exp == approx(obs)
def test_corner():
    exp = 1.0
    obs = sinc2d(0.0, 0.0)
    assert exp == approx(obs)

Test-Driven Development (TDD)

Concept: Write the tests first.
Before you write any lines of a function, first write the test for that function.

Initial version

from pytest import approx
from mod import std

def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert exp == approx(obs)

def std(vals):
    # this must be cheating.
    return 1.0

Improved version

from pytest import approx
from mod import std

def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert exp == approx(obs)

def test_std2():
    obs = std()
    exp = 0.0
    assert exp == approx(obs)

def test_std3():
    obs = std([0.0, 4.0])
    exp = 2.0
    assert exp == approx(obs)

def std(vals):
    # a bit better, but still not quite generic
    if len(vals) == 0:
        return 0.0
    return vals[-1] / 2.0

Final version

from pytest import approx
from mod import std

def test_std1():
    obs = std([0.0, 2.0])
    exp = 1.0
    assert exp == approx(obs)

def test_std2():
    obs = std()
    exp = 0.0
    assert exp == approx(obs)

def test_std3():
    obs = std([0.0, 4.0])
    exp = 2.0
    assert exp == approx(obs)

def test_std4():
    obs = std([1.0, 1.0, 1.0])
    exp = 0.0
    assert exp == approx(obs)

def std(vals):
    # finally some math
    n = len(vals)
    if n == 0:
        return 0.0
    mu = sum(vals) / n
    var = 0.0
    for val in vals:
        var = var + (val - mu)**2
    return (var / n)**0.5

Summary: TDD

Can be a helpful concept for writing code, when you know what you want your code to do.
Good idea to write tests early and often

Test coverage

Meaning: percentage of code for which a test exists, determined by number of line executed during tests.

`pytest-cov`

Instructions:

Install pytest-cov using pip install pytest-cov
pytest -vv --cov./
Look at your coverage; are you at or near 100%?
Get more detailed information by having it create a report: pytest -vv --cov=./ --cov-report html

Test coverage example

# content of test_sample.py
def inc(x):
    if x < 0:
        return x - 1
    return x + 1

def test_answer():
    assert inc(3) == 4

$ pytest -vv test_sample.py --cov=./

test_sample.py::test_answer PASSED

---------- coverage: platform darwin, python 3.13.1-final-0 -----------
Name      Stmts   Miss  Cover
-----------------------------
test_sample.py       6      1    83%

Increasing coverage

# content of test_sample.py
def inc(x):
    if x < 0:
        return x - 1
    return x + 1

def test_answer():
    assert inc(3) == 4

def test_answer_negative():
    assert inc(-2) == -3

$ pytest -vv test_sample.py --cov=./

test_sample.py::test_answer PASSED
test_sample.py::test_answer_negative PASSED

---------- coverage: platform darwin, python 3.13.1-final-0 -----------
Name      Stmts   Miss  Cover
-----------------------------
test_sample.py       8      0   100%

Test coverage overview/takeaways

Work towards 100% coverage, but don’t obsess over it.

Use coverage to help identiy missing edge/corner cases.

Continuous integration (CI)

Ensure all changes to your project pass tests through automated test & build process.

Importance of testing + CI for research software

10.1021/acs.orglett.9b03216

Recent paper found that calculating nuclear magnetic resonance chemical shifts on different operating systems lead to different outputs!

Using Python scripts published in Nature Protocols in 2014, cited over 130 times.

Error due to differences in file sorting across platforms…

Could have easily been caught with a CI system that tested on multiple platforms.

GitHub Actions

GitHub now provides a free CI system called GitHub Actions

$ mkdir -p .github/workflows
$ touch .github/workflows/run-tests.yml

name: Python package

on:
  push:
    branches: [ main ]
  pull_request:
    branches: [ main ]

jobs:
  build:
    strategy:
      fail-fast: false
      matrix:
        os: [ubuntu-latest, macos-latest, windows-latest]
        python-version: ["3.9", "3.10", "3.11"]

    runs-on: ${{ matrix.os }}
    steps:
    - uses: actions/checkout@v4
    - name: Set up Python ${{ matrix.python-version }}
      uses: actions/setup-python@v3
      with:
        python-version: ${{ matrix.python-version }}
    - name: Install dependencies
      run: |
        python -m pip install --upgrade pip
        pip install .
        pip install flake8 pytest
    - name: Lint with flake8
      run: |
        # stop the build if there are Python syntax errors or undefined names
        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
    - name: Test with pytest
      run: |
        pytest

Fun with CI

You can add badges to your README!

![example workflow](https://github.com/github/docs/actions/workflows/main.yml/badge.svg)

Can also add badge for test coverage: https://github.com/marketplace/actions/coverage-badge

CI can also be used to build packages and deploy to PyPI, build documentation, etc.