Software Friction

In his book On War, Clausewitz defines friction as the difference between military theory and reality:

Thus, then, in strategy everything is very simple, but not on that account very easy. Everything is very simple in war, but the simplest thing is difficult. These difficulties accumulate and produce a friction, which no man can imagine exactly who has not seen war.

As an instance of [friction], take the weather. Here, the fog prevents the enemy from being discovered in time, a battery from firing at the right moment, a report from reaching the general; there, the rain prevents a battalion from arriving, another from reaching in right time, because, instead of three, it had to march perhaps eight hours; the cavalry from charging effectively because it is stuck fast in heavy ground.

Ever since reading this, I’ve been seeing “friction” everywhere in software development:

A vendor’s API doesn’t work quite as you thought it did, or it did and then they changed it.
Bugs. Security alerts. A dependency upgrade breaks something.
Someone gets sick. Someone’s kid gets sick. Someone leaves the company. Someone leaves for Burning Man.
The requirements are unclear, or a client changes what they want during development. A client changes what they want after development.
A laptop breaks or gets stolen. Slack goes down for the day.
Tooling breaks. Word changes every font to wingdings. (This is a real thing)

This list is non-exhaustive and it’s not possible to catalogue all possible sources of friction.

Some Properties of Friction

Friction matters more over large time horizons and large scopes, simply because more things can go wrong.

Friction compounds with itself: two setbacks are more than twice as bad as one setback. This is because most systems are at least somewhat resilient and can adjust itself around some problem, but that makes the next issue harder to deal with.

(This is a factor in the controversial idea of “don’t deploy on Fridays”. The friction caused by a mistake during deployment, or of needing to doing a rollback, would be made much worse by the friction of people going offline for the weekends. The controversy is between people saying “don’t do this” and people advocating for systemic changes to the process. Either way the goal is to make sure friction doesn’t cause problems, it’s a debate over how exactly to do this.)

Addressing friction can also create other sources of friction, like if you upgrade a dependency to fix a security alert but the new version is subtly backwards incompatible. And then if you’re trying to fix this with a teammate who lives in a different timezone…

Addressing Friction

Friction is inevitable and impossible to fully remove. I don’t think it’s possible to even fully anticipate. But there are things that can be done to reduce it, and plans can be made more resilient to it. I don’t have insight into how military planners reduce friction. This is stuff I’ve seen in software:

Smaller scopes and shorter iterations

This is the justification for “agile” over “waterfall”. When you have short timelines then there’s less room for friction to compound. The more you’re doing and the longer your timeline the more uncertainty you have and the more places things can go wrong. You still have room for friction if you’re doing lots of small sprints back to back, though. Then you’re just running an inefficient marathon.

More autonomy

Friction is the difference between the model and the world, and at a high level you can only see models. If people have enough autonomy to make locally smart decisions, then they can recover from friction more easily. But if people get so much autonomy they isolate, they can make things much worse. I once saw an engineer with a lot of autonomy delete a database that was “too slow”.

Redundancy

This could be spare equipment in storage, high bus factors, or adding slack to a schedule. Then if something goes wrong you can fix it more quickly, leaving less room for another problem to compound. This comes at the cost of efficiency under normal circumstances, which is why projects naturally drift towards less redundancy.

Better planning

Good planning won’t identify all sources of friction, but planning will identify more sources, and that’s a big benefit. For example, writing formal specifications can expose problems in the design or turn unknown-unknowns into known-unknowns (which you can then study into more detail). This can be the difference between being blindsided by 5 things and being blindsided by 15 things. This is why I’m so bullish on formal methods.

Automation

This is a double-edged sword. On one hand, automating processes leaves less room for people to make mistakes. On the other, automated processes can have their own bugs, which creates their own sources of friction. Also, if the automation runs long enough people will forget how it works or the full scope of what it does, leaving everyone completely unprepared for if it breaks. Automation can come at the cost of experience.

Experience

The more problems you’ve encountered, the more problems you will see coming, and the more experience you’ll have recovering from problems. Unfortunately this is something you mostly have to learn the hard way. But one shortcut is…

Gaming

One interesting book on this is the Naval War College Fundamentals of War Gaming. In it they argue that there’s two purpose to wargaming: gathering information on how situations can develop, and giving commanders (some) experience in a safe environment. If a trainee learns that “weather can disrupt your plan” in a wargame, they don’t have to learn it with real human lives. Similarly, I’d rather practice how to retrieve database backups when I’m not desperately trying to restore a dropped table. To my understanding, both security and operations teams use gaming for this reason.

(At the same time, people have to devote time to running participate in games, which is inefficient in the same way adding redundancy is.)

Checklists and runbooks

Ways of formalizing tacit knowledge of dealing with particular problems.

Questions I have about friction

Is it useful to subcategorize sources of friction? Does calling a tooling problem “technical” as opposed to “social” friction do anything useful to us?

How do other fields handle friction? I asked some people in the construction industry about friction and they recognized the idea but didn’t have a word for it. What about event planners, nurses, military officers?

How do we find the right balance between “doing X reduces the effect of friction” and “not doing X is more efficient right now”?

Is friction important to individuals? Do I benefit from thinking about friction on a project, even if nobody else on my team does?

Thanks for Jimmy Koppel for feedback. If you liked this post, come join my newsletter! I write new essays there every week.

I train companies in formal methods, making software development faster, cheaper, and safer. Learn more here.

Update 2024-05-30

I’ve collected some of the comments I received on this post here.