What We Know We Don't Know

NOTE: There is a newer version of this talk.

Official Description: There are many things in software we believe are true but very little we know. Maybe testing reduces bugs, or maybe it’s just superstition. If we want to improve our craft, we need a way to distinguish fact from fallacy. We need to look for evidence, placing our trust in the hard data over our opinions.

Empirical Software Engineering is the study of what actually works in programming. Instead of trusting our instincts we collect data, run studies, and peer-review our results. This talk is all about how we empirically find the facts in software and some of the challenges we face, with a particular focus on software defects and productivity.

Actual Description: Nothing is real, we don’t understand what we’re doing, and the only way to write good software is to stop drinking coffee. Burn it all down. Burn it to the ground.

Talk is here, slides are here.

Sources

I referenced a bunch of papers in my talk. These are links so you can read them yourself:

Intro

Small Functions Have Problems

Effects of Clean Code on Understandability

Tech and GDP

Cyberstates 2018

Big Data is slower than laptops

Scalability! But at what COST?

Programmers make the same mistakes / N-version programming fails

An Experimental Evaluation of the Assumption of Independence in Multi-Version Programming

A Reply to the Criticisms of the Knight and Leveson Experiment

Methods

Abbreviated Code is readable

Fixing Faults in C and Java Source Code: Abbreviated vs. Full-Word Identifier Names (preprint)

Code Smells

On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation

Programming Language Effects

A Large Scale Study of Programming Languages and Code Quality in Github (original faulty study)

On the Impact of Programming Languages on Code Quality (replication study)

Software Defects

Parachute Studies

Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials

Defect detection in code

“Beyond Lines of Code: Do we Need More Complexity Metrics?“, Harraiz & Hassan, Making Software (ch 8)

Conway’s Law

“Conway’s Corollary”, Making Software (ch 11)

Defect detection in organizations

The Influence of Organizational Structure On Software Quality: An Empirical Case Study

Worst bugs are design bugs

“Where Do Most Software Flaws Come From?” Dewayne Perry. Making Software (ch 25)

Testing

Simple Testing Can Prevent Most Critical Failures

Test-Driven Development

Realizing quality improvement through test driven development (positive results)

Analyzing the Effects of Test Driven Development in GitHub (negative results)

Test-Driven Development - Still a Promising Approach?

Code Review

Best Kept Secrets of Peer Code Review

An Empirical Study of the Impact of Modern Code Review Practices on Software Quality

Sleep

Impact of a Night of Sleep Deprivation on Novice Developers Performance

Sleep deprivation: Impact on cognitive performance

Overwork

Scheduled Overtime Effect On Construction Projects

Overtime and Extended Work Shifts

Stress

Crunch Makes Games Worse

STRESS…At Work

Additional Sources

Recommended Reading

Making Software

Leprechauns of Software Engineering

Calling Bullshit

Free research

Making Software

It Will Never Work In Theory

ArXiv

FAQ

These are questions I got after the talk. I will add more as people ask them.

I’ve often heard pairing referred to as “continuous code review”. How do we then reconcile the fact that code review has a detectable positive impact, while pairing hasn’t?

There is some evidence that pair programming is helpful- see the work of Laurie Williams. But there’s more evidence that code review reduces bugs, and the evidence shows a bigger effect. Pairing is nice, but it’s not “continous code review.”

Have you tried ssrn for research papers?

Nope!

Questions on Data

For the questions where people asked me what the research said on X, I decided to instead talk about how I would go about researching it. My goal is to get people interested in researching for themselves what they need.

Has there been any research into the benefit of formal QA with a separate test team?

This is a tricky one. We’d need to figure out the terms people use for a QA team. I suspect it wouldn’t be found in software engineering research, which focuses primarily on the programmers. There might be some, but I suspect there’s a lot more in research on project management and process engineering. So that’s where I’d start.

Alternatively, we can look at the effectiveness of practices that use a separate, formal QA team. One such process is Clean Room Engineering, which claims to have a very low defect rate. But you’d need to find some way of teasing out how much each aspect of CRE contributes to the whole.

How can we measure the happiness at work generated from good readability, code that’s easy to change, and code covered by tests?

With a survey. You can usually trust people well enough to understand their emotions. Assuming everybody has enough trust in each other.

If I was running a study on this, I’d probably start with a case study on organizations which improved their tooling, and study the before/after.

In your opinion, what’s the best metric for software quality?

¯\_(ツ)_/¯

Do you any studies about effectiveness of end-to-end testing?

Is there any evidence about different levels of testing (Unit, Integration, etc.) and how mich to focus on each?

Barry Boehm and Capers Jones have done research on this, but there’s reasons to be skeptical of their findings. There’s other stuff out there, too, but there’s two challenges you should be aware of:

Nobody’s ever consistent with their terms. One paper might use unit test, integration test, and acceptance test, while another might use structural test, functional test, and end-to-end test. Researchers aren’t much better than programmers here.
Modern testing frameworks only appeared in the mid-90’s, or at least only started going mainstream around then. Earlier papers probably aren’t using them, and this may change their conclusions.

Can we do useful empirical studies of our own processes?

Yes! Making Software covers this with the fantastic chapter “Mining Your Own Evidence”, by Kim Herzig and Andreas Zeller. Not only do they talk about how we can study our own processes, they link a lot of other relevant sources, too. A great technique when researching is to read a paper and see what they cite, as well as what other papers cite the same things. Something like Semantic Scholar really helps here.