Cross-Branch Testing

This was originally a newsletter post I wrote back in December, but I kept coming back to the idea and wanted to make it easier to find. I’ve made some small edits for flow.

There’s a certain class of problems that’s hard to test:

The output isn’t obviously inferable from the input. The code isn’t just solving a human-tractable problem really quickly, it’s doing something where we don’t know the answer without running the program.
The output doesn’t have “mathematical” properties, like invertibility or commutativity.
Errors in the function can be “subtle”: there can be a bug that affects only a small subset of possible inputs, so that a set of individual test examples would still be correct.

Some examples of this: computer vision code, simulations, lots of business logic, most “data pipelines”. I was struggling to find a good semantic term for this class of problems, then gave up and started mentally calling them “rho problems”. There’s no meaning behind the name, just the first word that popped into my head.

Rho problems are complex enough that providing a realistic example would take too long to set up and explain. Here’s instead a purely contrived rho problem.

def f(x, y, z):
  out = 0
  for i in range(10):
    out = out * x + abs(y*z - i**2)
    x, y, z = y+1, z, x
  return abs(out)%100 < 10

Not code you’d see in the real world, but it has all the properties of a rho problem. Imagine this is the code in our deploy branch, and we try to refactor this code on a new branch. The refactor introduces a subtle bug that affects only 1% of possible inputs, which we’ll simulate with this change:

def f(x, y, z):
  out = 0
  for i in range(10):
    out = out * x + abs(y*z - i**2)
    x, y, z = y+1, z, x
- return abs(out)%100 < 10
+ return abs(out)%100 < 9

What kind of test would detect this error? Standard options are:

Unit testing. Problem is that if the error is uniformly distributed then each unit test only has a 1%-ish chance of finding the bug. We need to test a lot more inputs.
Property testing. Doesn’t work because we don’t have easily inferable mathematical properties.
Metamorphic Testing. There’s no obvious relationship between inputs and outputs. Knowing the value of f(1, 1, 1) tells us little about f(1, 1, 2).

(Yes, I know we could try decomposing the function and testing the parts individually. That’s not the point: we’re trying to find general principles here regardless of the problem.)

But we do have a “correct” reference point: the deploy branch! We can use property testing where the property is that our refactored code always gives the same output as our regular code.¹ Step one, use git worktree:

git worktree add ref deploy

That creates a new ref folder in our directory that’s identical to our deploy branch. If you modify code in the ref folder and commit, it will write the commit to deploy, not the current branch. If our code is f, our deploy version is ref.f. We can write a property test (using hypothesis and pytest):

from hypothesis import given
from hypothesis.strategies import integers

import ref.foo as ref
import foo


@given(integers(), integers(), integers())
def test_f(a,b,c):
    assert ref.f(a,b,c) == foo.f(a,b,c)

Running this gives us an erroneous input:

Falsifying example: test_f(
    a=5, b=6, c=5,
)

This is nondeterministic, and I’ve also gotten (0, 4, 0) and (-114, -114, -26354) as error cases.² Regardless, I still think it works as a proof of concept. We find an error case where the original code and the refactor diverge. Main blocker I see is that version control systems aren’t designed for this kind of use case, but this seems close enough to something that a sufficiently-motivated person could pull it off.

There’s a few other newsletter pieces I’m turning into blog posts, but they’re all undergoing heavy revisions. I expect this will be the only piece I transfer (mostly) as-is.³ Anyway, I update the newsletter a little over once a week, and most of them will remain exclusive to the newsletter, so if you enjoyed this, why not subscribe? You can also check out the archives here.

This bears a lot of similarity to a technique known as “snapshot testing”, where you compare the output to a result stored in a text file. The difference is that this is generative: we aren’t tied to fixed inputs. This also counts as metamorphic testing. Instead of our relation being between inputs, it’s between versions of the functions. Metamorphic testing is pretty neat! ^[return]
After originally putting this on the newsletter, David MacIver, the inventor of Hypothesis, wrote an essay on why reducing this problem is hard. ^[return]
As a quick data point for what I mean by “as-is”: the last newsletter piece I converted into a blog post was Science Turf War. The newsletter was ~2,000 words; the final version was ~6,000. ^[return]