Hypothesis Testing with Oracle Functions • Hillel Wayne

This post is about the Hypothesis testing library. If you’re unfamiliar with it, check it out! Property tests are a fantastic addition to your testing suite. All examples use pytest.

Imagine we’ve written a bubblesort:

def bubblesort(l: List[int]) -> List[int]:
    # blah blah blah

What are some of the properties we could test? We could check that the input has the same length as the output:

from hypothesis import given
from hypothesis.strategies import lists, integers 

@given(lists(integers()))
def test_same_length(l):
    assert len(bubblesort(l)) == len(l)

Or we could test that the sort is idempotent:

@given(lists(integers()))
def test_idempotent(l):
    assert bubblesort(bubblesort(l)) == bubblesort(l)

But neither test really gets at what we want: confirming the list is actually sorted. Fortunately, though, there’s a simple hypothesis test for that:

@given(lists(integers()))
def test_sorts(l):
    assert bubblesort(l) == sorted(l)

We just need to check that the output of the function matches the output of a sorting function! In this case, sorted is an oracle, something we know is correct and can compare our solution to. This may look like cheating (“if we already have a sorting function, why would we be writing another?”), but this can be a surprisingly powerful testing strategy. Here are some common use-cases:

Refactoring with Oracles

One use case for an oracle is when you refactor an existing function: you want to clean up or optimize something you’re confident is correct and want to make sure your changes don’t break anything. Let’s take a toy example:

def is_prime(x: int) -> bool:
    if x < 2:
        return False
    for i in range(2, x):
        if x % i == 0:
            return False
    return True

A common optimization we can do is only check up to the square root of the number:

from math import ceil, sqrt

def new_is_prime(x: int) -> bool:
    # blah blah blah
    for i in range(2, ceil(sqrt(x))):
    # sup

This seems like it works for simple test cases, but if we compare it to our original function, we find an error:

@given(integers())
def test_matches_oracle(x):
    assert is_prime(x) == new_is_prime(x)

This error comes from how range works: list(range(2, 2)) == [], so our function fails for 4. We found this by using our original version as an oracle against the refactored version.

Special-Case Oracles

Sometimes we know our function works in specific cases but does not generalize. For example, the following function ostensibly takes a list and returns the highest possible product of three elements:

# We'll assume that all inputs will have more than 3 elements
# So no need for a bounds check
def buggy_max_triplet_prod(l: List[int]) -> int:
    a, b, c = sorted(l, reverse=True)[0:3]
    return a * b * c

This is correct when all of the elements are positive. However, it fails for [-4, -3, 1, 2], returning -6 instead of 12. We can still use this as an oracle, though. Since we know it’s correct for certain lists, our final function must match its output for those lists.

@given(lists(integers(min_value=0), min_size=3)) # only positive numbers
def test_matches_on_positive(l):
    assert max_triplet_prod(l) == buggy_max_triplet_prod(l)

Over our restricted domain, the broken function is correct, and can be used as an oracle.

Inefficient Oracles

Sometimes there’s a solution to a problem that we know for sure works but is too slow/memory-intensive/etc for production code. We can use that solution as an oracle to test a more sophisticated function. If you wanted to write a function to insert into a sorted array, you could use the following as a reference oracle:

def ordered_insert_oracle(new_element: int, ordered_list: List[int]) -> List[int]:
    return sorted(ordered_list + [new_element])

@given(lists(integers), integers())
def test_ordered_insert(l, x):
    assert ordered_insert(l, x) == ordered_insert_oracle(l, x)

The oracle will be O(n*log(n)), while a well written ordered insert will be O(n). That might matter in production, but it won’t matter too much in testing, so we can benefit from an oracle here.

Reverse Oracles

Sometimes we can go the other way: instead of writing an oracle that will always get the right answer, we can write a function that takes an answer and finds a question for it. This gets a bit complicated, so let’s walk through an example. We’re writing our own counter.

from random import shuffle

@given(lists(tuples(integers(), integers(min_value=1, max_value=10)), unique_by=lambda x: x[0]))
def test_counter(l):
    test_list = []
    for val, times in l:
        test_list += [val] * times
    shuffle(test_list)
    assert count(test_list) == {val:times for val, times in l}

Here, we’re taking a possible answer and generating a list that matches that. For example, if our test generates l = [(5, 2), (1, 1)], that means we want our function to return {5: 2, 1: 1}. So we find a list that, if count is written correctly, would match that output. Our implementation is to first create the list [5, 5, 1] and then shuffle it.

This test ends up being more complicated than the others because it’s easier for the test itself to have bugs. Note the unique_by; that’s because our test will bug if we generate l = [(1, 1), (1, 1)]: we’ll assert {1: 2} == {1: 1}.

Caveats

Most of these caveats apply to property-based tests (PBTs) in general.

You’ll want to write some unit tests to sanity-check your oracles. This throws you down a rabbit hole (“I have to test my tests?!”), but you get much wider test coverage with PBTs in return for the extra complexity.
Sometimes coming up with the oracles can be as or more difficult than writing the final functions. As always, writing good code involves tradeoffs.
The further you get from pure functions the harder writing PBTs gets. Reverse oracles help a bit here.
When writing bullet points like this I’m always tempted to add something completely unrelated. Here’s something I wrote about how much I like dogs.
Don’t try to be overly clever and use oracles in cases where more basic PBTs (or even unit tests) suffice.
The benefits PBT provides are orthogonal to the benefits you get from unit tests, type systems, formal methods, and getting a full night’s sleep. Ideally you’d want to have all five.
Ruby doesn’t have a good PBT library and that makes me sad.