What We Know We Don't Know
NOTE: There is a newer version of this talk.
Official Description: There are many things in software we believe are true but very little we know. Maybe testing reduces bugs, or maybe it’s just superstition. If we want to improve our craft, we need a way to distinguish fact from fallacy. We need to look for evidence, placing our trust in the hard data over our opinions.
Empirical Software Engineering is the study of what actually works in programming. Instead of trusting our instincts we collect data, run studies, and peer-review our results. This talk is all about how we empirically find the facts in software and some of the challenges we face, with a particular focus on software defects and productivity.
Actual Description: Nothing is real, we don’t understand what we’re doing, and the only way to write good software is to stop drinking coffee. Burn it all down. Burn it to the ground.
Talk is here, slides are here.
Sources
I referenced a bunch of papers in my talk. These are links so you can read them yourself:
Intro
- Small Functions Have Problems
- Tech and GDP
- Big Data is slower than laptops
- Programmers make the same mistakes / N-version programming fails
An Experimental Evaluation of the Assumption of Independence in Multi-Version Programming
A Reply to the Criticisms of the Knight and Leveson Experiment
Methods
- Abbreviated Code is readable
- Fixing Faults in C and Java Source Code: Abbreviated vs. Full-Word Identifier Names (preprint)
- Code Smells
- On the diffuseness and the impact on maintainability of code smells: a large scale empirical investigation
- Programming Language Effects
A Large Scale Study of Programming Languages and Code Quality in Github (original faulty study)
On the Impact of Programming Languages on Code Quality (replication study)
Software Defects
- Parachute Studies
- Parachute use to prevent death and major trauma related to gravitational challenge: systematic review of randomised controlled trials
- Defect detection in code
- “Beyond Lines of Code: Do we Need More Complexity Metrics?“, Harraiz & Hassan, Making Software (ch 8)
- Conway’s Law
- “Conway’s Corollary”, Making Software (ch 11)
- Defect detection in organizations
- The Influence of Organizational Structure On Software Quality: An Empirical Case Study
- Worst bugs are design bugs
- “Where Do Most Software Flaws Come From?” Dewayne Perry. Making Software (ch 25)
- Testing
- Simple Testing Can Prevent Most Critical Failures
- Test-Driven Development
Realizing quality improvement through test driven development (positive results)
Analyzing the Effects of Test Driven Development in GitHub (negative results)
- Code Review
Best Kept Secrets of Peer Code Review
An Empirical Study of the Impact of Modern Code Review Practices on Software Quality
- Sleep
Impact of a Night of Sleep Deprivation on Novice Developers Performance
- Overwork
- Stress
Additional Sources
- Recommended Reading
- Free research
FAQ
These are questions I got after the talk. I will add more as people ask them.
I’ve often heard pairing referred to as “continuous code review”. How do we then reconcile the fact that code review has a detectable positive impact, while pairing hasn’t?
There is some evidence that pair programming is helpful- see the work of Laurie Williams. But there’s more evidence that code review reduces bugs, and the evidence shows a bigger effect. Pairing is nice, but it’s not “continous code review.”
Have you tried ssrn for research papers?
Nope!
Questions on Data
For the questions where people asked me what the research said on X, I decided to instead talk about how I would go about researching it. My goal is to get people interested in researching for themselves what they need.
Has there been any research into the benefit of formal QA with a separate test team?
This is a tricky one. We’d need to figure out the terms people use for a QA team. I suspect it wouldn’t be found in software engineering research, which focuses primarily on the programmers. There might be some, but I suspect there’s a lot more in research on project management and process engineering. So that’s where I’d start.
Alternatively, we can look at the effectiveness of practices that use a separate, formal QA team. One such process is Clean Room Engineering, which claims to have a very low defect rate. But you’d need to find some way of teasing out how much each aspect of CRE contributes to the whole.
How can we measure the happiness at work generated from good readability, code that’s easy to change, and code covered by tests?
With a survey. You can usually trust people well enough to understand their emotions. Assuming everybody has enough trust in each other.
If I was running a study on this, I’d probably start with a case study on organizations which improved their tooling, and study the before/after.
In your opinion, what’s the best metric for software quality?
¯\_(ツ)_/¯
Do you any studies about effectiveness of end-to-end testing?
Is there any evidence about different levels of testing (Unit, Integration, etc.) and how mich to focus on each?
Barry Boehm and Capers Jones have done research on this, but there’s reasons to be skeptical of their findings. There’s other stuff out there, too, but there’s two challenges you should be aware of:
- Nobody’s ever consistent with their terms. One paper might use unit test, integration test, and acceptance test, while another might use structural test, functional test, and end-to-end test. Researchers aren’t much better than programmers here.
- Modern testing frameworks only appeared in the mid-90’s, or at least only started going mainstream around then. Earlier papers probably aren’t using them, and this may change their conclusions.
Can we do useful empirical studies of our own processes?
Yes! Making Software covers this with the fantastic chapter “Mining Your Own Evidence”, by Kim Herzig and Andreas Zeller. Not only do they talk about how we can study our own processes, they link a lot of other relevant sources, too. A great technique when researching is to read a paper and see what they cite, as well as what other papers cite the same things. Something like Semantic Scholar really helps here.