Weak and Strong Fairness

I’m in the process of revising some of my workshops. I realized that one particular important aspect of TLA+, fairness, is something that I’ve struggled to explain well. Then again, everybody struggles to explain it well! It’s also a great example of how messy concurrent systems can get, so I figured it would make a good post for the blog.

Stuttering

Take the following spec of a clock. I used TLA+ but it should mostly be clear from context:

---- MODULE clock ----
EXTENDS TLC, Integers

VARIABLES hour

Init == hour = 12

Next == 
  IF hour = 23
  THEN hour' = 0 \* hour' is "next value of hour"
  ELSE hour' = hour + 1

\* Our system starts at Init
\* And evolves according to Next
Spec == Init /\ [][Next]_hour

Liveness == <>(hour = 15)

====

I have the property <>(hour = 15), a.k.a. eventually the time is 3 PM. Does the spec guarantee that this property holds?

No, it does not! Here’s one way it could fail:

The time ticks from 12 PM to 1 PM
The time ticks from 1 PM to 2 PM
I throw the clock in a lake.

Crazy, right? But the entire idea of specification is to see what assumptions we need to make our system safe. If I do not explicitly say that I cannot throw the clock in a lake, the spec permits me to throw the clock in a lake. The spec also permits timeouts, infinite delays, crashes, anything that is equivalent to the system not making progress.

TLA+ specifications must be stutter-invariant. A stutter is a temporal step where we reassign all of the variables to the same values. To an outside observer, this looks like nothing has happened. Stutter-invariance is what makes composing specs in TLA+ a miserable experience, as opposed to other specification languages, where composing specs is a punishment for sinners in hell.

A system is unfair if an infinite sequence of stutter steps is a valid behavior. This is equivalent to no agents in the system ever making progress. Our clock is unfair, so it’s allowed to stutter an infinite number of times.

<Init>    time = 12

<Next>    time = 13

<Next>    time = 14
<Stutter> time = 14
<Stutter> time = 14
<Stutter> time = 14
...

This is something that most learners encounter relatively late in learning TLA+. Most people get along fine for a long time with just safety properties, which can only be violated by entering a new state.¹ So even if the system infinitely stutters all the safety properties will still hold. However, liveness properties require the system to eventually reach a state and can be blocked by stuttering. In fact, if the entire spec is unfair no liveness properties can hold because the system might never take a step in the first place.²

We usually aren’t interested in unfair systems. The next step up from unfairness is weak fairness. An action is weakly fair if, if it always can happen, then it eventually will happen. If a thread can always make progress, it will make progress. In the language of TLA+ we write it as

Spec == Init /\ [][Next]_hour /\ WF_hour(Next)

If Next is possible, it will eventually happen, which guarantees Liveness. If you want to read about how we formalize fairness, see the appendix.

Fairness on actions

Weak fairness applies to specific actions- descriptions of how the state can change- in the spec, not the whole spec. This is gives us a lot more flexibility. If I instead implemented the spec as

Tick ==
  IF hour = 23
  THEN hour' = 0
  ELSE hour' = hour + 1

Break == 
  hour /= -1 /\ hour' = -1

Next == Tick \/ Break

Spec == Init /\ [][Next]_hour /\ WF_hour(Next)

Next is an action composed of two sub-actions, Tick and Break. Our spec only ensures that the two actions together are fair, not each individually. This means we might never see Break, as the spec could choose Tick every single time. If we instead wrote

Spec == Init /\ [][Next]_hour /\ WF_hour(Break)

Then we’d ensure at least one Break action.³

We get a lot of expressive power from this level of precision. For example, we can declare only a mission-critical subset of our system fair and let peripheral elements stay unfair. Or we make only part of a workflow fair, so that we aren’t guaranteed to start the workflow but are guaranteed to finish it if we do. You can even make actions that aren’t part of Next fair, though that’s a “here be dragons” kind of thing.

Strong fairness

“Weak” fairness suggests the existence of “strong” fairness. Weak fairness says that if an action is always enabled, it eventually happens. But it cannot guarantee that a “cyclically enabled” action will happen.

For this spec, we have a worker doing some abstract job. It can succeed or fail. If it fails, it retries until it succeeds. We make both Succeed and Retry weakly fair and leave Fail unfair.

VARIABLES status

Init == status = "start"

ChangeStatus(from, to) == status = from /\ status' = to

Succeed == ChangeStatus("start", "done")
Fail == ChangeStatus("start", "fail")
Retry == ChangeStatus("fail", "start")

Next == Succeed \/ Fail \/ Retry \/ UNCHANGED status

Fairness ==
  /\ WF_status(Succeed)
  /\ WF_status(Retry)

Spec == Init /\ [][Next]_status /\ Fairness

Liveness == <>(status = "done")

====

Does Liveness hold? It does not! Our fairness clause only says that if Succeed is guaranteed if it is permanently enabled. The problem it’s not permanently enabled. We could have the following error trace:

<Init>    status = "start"

<Fail>*   status = "fail"

<Retry>   status = "start"
<Fail>*   status = "fail"
<Retry>   status = "start"
...

After every step marked *, status /= "start", so Succeed is not enabled. Retry is enabled, and no action at this point can disable it, so it’s guaranteed to happen. Now we’re back with status = "start", and Succeed is enabled again. But then Fail happens and changes status…

Since Succeed keeps flipping between enabled and disabled, weak fairness can’t guarantee it happens. If we want to make sure Succeed happens we need to make it strongly fair. Strong fairness says that if an action isn’t permanently disabled it will eventually happen. Unlike weak fairness the action can be intermittently enabled and is still guaranteed to happen.

Fairness ==
+ /\ SF_status(Succeed)
- /\ WF_status(Succeed)
  /\ WF_status(Retry)

This satisfies Liveness.

Strong fairness is considered less important than weak fairness. Weak fairness is both a safer assumption and more realistically feasible. Even in cases where strong fairness is required to satisfy liveness properties, people sometimes opt to drop the property instead of build in strong fairness. This is why the Microsoft thread scheduler isn’t actually strongly fair.⁴ They cannot guarantee that all threads make progress, but they make it probabilistically near-certain, and in return they avoid lock convoys.

Thoughts

I find this all very intuitive, but that may just be because I’ve worked with it for a long time. I consistently find that beginners struggle with these concepts. Maybe it’s something that’s easier to teach in writing than in the workshop. If so, I hope this made it a little bit clearer to you.

I shared the first draft of this essay on my newsletter. If you like my writing, why not subscribe?

Appendix: Formalizing Fairness

We formally define weak fairness as follows:

WF_v(A) == 
<>[](ENABLED <<A>>_v) => []<><<A>>_v

That’s a whole lot of non-alphanumeric characters. Going through it step by step:

             <<A>>_v

<<A>>_v is the action “A happens and the value of v changes”, which I’ll call an A-change.

    (ENABLED <<A>>_v)

ENABLED is a special operator that, very roughly, is true if the action can happen, even if it doesn’t happen. If we wrote

    (ENABLED <<A>>_v) =>     <<A>>_v

That would say “if an A-change is possible, then it happens.” This would break stutter-invariance, as we couldn’t have the A-change be enabled and then stutter. But if we instead write

    (ENABLED <<A>>_v) =>   <><<A>>_v

Then it’s fine. The <> means the A-change will eventually happen. It’s fine for there to be another stutter in between, or even completely different states, as long as there is a future point where it happens.

    (ENABLED <<A>>_v) => []<><<A>>_v

With just <> we say that the A-change eventually happens but not how often it happens. It’s valid for the A-change to happen just once. Instead we write []<> for “always eventually”, a.k.a. the A-change will happen an infinite number of times.

<>[](ENABLED <<A>>_v) => []<><<A>>_v

Finally, we prepend the condition with “eventually always”: instead of forcing this if the A-change is enabled, we only force it if we reach a point where the A-change is permanently enabled. This gives us weak fairness.

To get strong fairness, all we do is replace the initial <>[] with []<>:

[]<>ENABLED <<A>>_v => []<><<A>>_v

Instead of guaranteeing an A-change if it’s eventually always enabled, we guarantee it if it’s always eventually enabled. Strong fairness is “more powerful” because it always implies weak fairness: Any action that is permanently enabled is also cyclically enabled.

This is actually the definition of a safety property: a property where every counterexample has a finite prefix. If infinite stuttering breaks safety, then some finite prefix of it also breaks safety, removing the need for infinite stuttering. The TLA+ model checker instead divides properties into “invariants” and “temporal properties”. All invariants are safety properties, while temporal properties include both safety and liveness properties. ^[return]
Okay yes, there are some pathological liveness properties that can hold, like <>TRUE. ^[return]
Why does this only ensure at least one Break and not an infinite number of them? After the Break step it’s disabled until the next Tick step. Since neither Next nor Tick are fair, the spec can now stutter indefinitely at hour = -1. Temporal logic gets very subtle! ^[return]
They say of their scheduling system is unfair, but the example they give is a thread repeatedly running, but being blocked. So the system is weakly fair but not strongly fair. ^[return]