Reliability’s Skinney
It’s a well known facet of systems engineering that the reliability of a linear system is the product of the reliability of each of the system’s components. For example, imagine a hip system with three components shown below.

Each component in this example system has its reliability measured and the values are each determined to be 90%. If you weren’t a systems engineer (like most of us), you’d probably figure the reliability of this entire system is then 90%. That answer, however, isn’t correct: .90 * .90 * .90 is actually .73. That is, the overall reliability of this system is 73%.
Ever driven across a bridge that was 73% reliable? If you had a pen that only worked 73% of the time, wouldn’t you throw it out?
We assume most bridges we drive over are 100% reliable and most pens we use are 100% reliable until they run out of ink. To gain that reliability the builders of bridges and makers of pens ensure reliability at the lowest possible level because that’s the only way to ensure the overall reliability.
This principle, by the way, is why in the Golden Age of Disco Japanese car makers began to eclipse US automakers in sales. The reliability of Japanese made cars were simply much better than US counterparts because they realized they had to ensure reliability at the lowest possible level.
Now imagine a software system, which, by the way is nonlinear (which essentially means you have to also consider the reliability of the interface or connector between each object). Ever worked on a software system with three components (i.e. objects)? Most software systems have 100’s if not 1000’s of objects!
If you wanted to build a software application that had an SLA or QLA of 100% (or close) you’d absolutely have to ensure reliability at the individual object level. In fact, if you can’t ensure and measure reliability at the lowest level, you can’t possibly do that at the system level.
Yet, this is how we, as an industry, have largely been constructing and delivering software. Design it, build it, then throw it over the wall to QA, who tests at the system level and inevitably finds some number of defects. At some point, we then unleash the system to its customers, who unsurprisingly also find defects, sometimes to the determinant of corporate profits. That’s so establishment!
Bottom line: if we are to build software systems which are truly reliable, we have to ensure reliability at the object level, which can only be achieved through unit testing. Otherwise, we can’t possibly hope to build highly reliable applications.
| Related odds and ends | ||
|---|---|---|
Sunday 05 Feb 2006 | Developer Testing
Good points, Andy. How about writing a few posts on some techniques you recommend for increasing reliability. One problem is the definition of reliability, for instance. For a bridge (which has arguably a singular purpose) or a pen (same argument) reliability can be achieved if it meets its objectives.
One problem I’ve seen in software is that we don’t really know what the purpose of the software is (agreement on highly detailed specifications and requirements) and therefore can’t really assure reliability of an entire system even if the components are considered reliable.
Working in regulated environments like medical devices I have learned that biggest impediment to reliability is to define the specific use of the device and test the heck out of it for a small set of requirements. Once it grows beyond a particuar suitability range (large requirements) and the system grows complex, reliability becomes quite difficult.
Check out this article for a group of guys focusing on high-integrity software design:
http://www.healthcareguy.com/index.php/archives/165
[...] My friend Andy Glover (finally!) started his blog and he’s got a great first article on software reliability. He’s already a published author and writes for online magazines but as an expert in software quality it’s great to have him write regularly on his blog so that time between articles is reduced. Permanent link [...]
[...] At its core, a developer test verifies a portion of code in an isolated manner that can’t be achieved through latter cycle neat-o functional style testing. From a reliability standpoint, in linear systems (which are substantially less complex than software systems, which are non-linear), “the product of the reliability of each of the system’s components” equals the reliability of the overall system. Moreover: “If you wanted to build a software application that had an SLA or QLA of 100% (or close) you’d absolutely have to ensure reliability at the individual object level. In fact, if you can’t ensure and measure reliability at the lowest level, you can’t possibly do that at the system level.” [...]