QSM: Volume 1.2: Why Software Gets In Trouble

Part IV. Fault Patterns

Chapter 1: Observing and Reasoning About Errors

Summary

One of the reasons organizations have trouble dealing with software errors is the many conceptual errors they make concerning errors.
Some people make errors into a moral issue, losing track of the business justification for the way in which they are handled.
Quality is not the same thing as absence of errors, but the presence of many errors can destroy any other measures of quality in a product.
Organizations that don't handle error very well also don't talk very clearly about error. For instance, they frequently fail to distinguish faults from failures, or use faults to blame people in the organization.
Well functioning organizations can be recognized by the organized way they use faults and failures as information to control their process. The System Trouble Incident (STI) and the System Fault Analysis(SFA) are the fundamental sources of information about failures and faults.
Error-handling processes come in at least five varieties: detection, location, resolution, prevention, and distribution.
In addition to conceptual errors, there are a number of common observational errors people make about errors, including Selection Fallacies, getting observations backwards, and the Controller Fallacy

Chapter 2: The Failure Detection Curve

Summary

Failure detection is dominated by the tautology that the easiest failures to detect are the first failures to detect, so that as detection proceeds, the work gets harder, producing a characteristic Failure Detection Curve with a long tail.
The long tail of the Failure Detection Curve is one of the principal reasons managers misestimate failure detection tasks.
Because the Failure Detection Curve represents a natural dynamic, there is nothing we can do to perform better than it says. We can, however, perform much worse, if we're not careful of how we manage the failure detection process.
The Failure Detection Curve is not all bad news. The pattern of detected failures over time can be used as a predictor of the time to reach any specified level of failure detection, as long as nothing is happening to undermine test coverage.
Some of the things that can undermine test coverage are blocking faults, masking faults, and late releases to test.
Late finishing modules may arise from a cycle of poor coding, which means that they are more likely to be fault-prone modules. Management policies designed to speed testing of late finishing modules may actually make the problem worse, and may account for much so-called "bad luck" estimating.

Chapter 3: Locating The Faults Behind The Failures

Summary

System size has a direct effect on the dynamics of fault location, but there are indirect effects as well. We use divide and conquer to beat the Size/Complexity Dynamic, and we also divide the labor to beat delivery time. These efforts, however, lead to a number of indirect effects of system size on fault location time.
You can learn a great deal about its culture by observing how an organization handles its STIs. In particular, you can learn to what degree its cultural pattern is under stress of increased customer or problem demands.
An important dynamic describes the circulation of STIs, which grows non-linearly the more STIs are in circulation.
Process errors such as losing STIs also increase location time.
Political issues, such as status boundaries, can also contribute non-linearly to extending location time. Management action to reduce circulation time by punishing those who hold STIs can lead to the opposite effect.
In general, poorly controlled handling of STIs leads to an enlarged administrative burden, which in turn leads to less poorly controlled handling of STIs. When STIs get out of hand, management needs to study what information that gives them about their cultural pattern, then take action to get at the root causes, not merely the symptoms.

Chapter 4: Fault Resolution Dynamics

Summary

Basic fault resolution dynamics are another case of Size/Complexity Dynamics, with more faults and more complexity per fault leading to a non-linear increase in fault resolution time as systems grow larger.
Side effects add more non-linearity to fault resolution. Either we take more time to consider side effects, or we create side effects when we change one thing and inadvertently change another.
The most obvious type of side effect is fault feedback, which can be measured by the Fault Feedback Ration (FFR). Fault feedback is the creation of faults while resolving other faults. Faults can be either functional or performance faults.
The FFR is a sensitive measure of project control breakdown. In a well-controlled project, FFR should decline as the project approaches its scheduled end.
One way to keep the FFR under control is to institute careful reviewing of fault resolutions, even if they are "only one line of code." The assumption that small changes can't cause trouble leads to small changes causing more trouble than bigger changes.
There are a number of ways in which a system deteriorates besides the addition of faults and performance inefficiencies, and these ways do not show up in ordinary project measurements. For instance, design integrity breaks down, documentation is not kept current, and coding style becomes patchy. All of these lead to a decrease in the system's maintainability.
When the integrity of a modular, or "black box," design breaks down, the system shows a growing "ripple effect" from each change. That is, one change ripples through to cause many other changes.
If we are to avoid deterioration of systems, they must not only be maintained, but their maintainability must also be maintained.
Managers and developers often show overconfidence in the initial design as protection against maintenance difficulties. This kind of overconfidence can easily lead to a Titanic Effect, because the thought that nothing can go wrong with the code exposes the code to all sorts of ways of going wrong.

Part V. Pressure Patterns

Chapter 5: Power, Pressure, and Performance

Summary

The Pressure/Performance Relationship says that added pressure can boost performance for a while, then starts to get no response, then leads to collapse.
Pressure to find the last fault can easily prolong the time to find the last fault, perhaps indefinitely.
The Stress/Control Dynamic explains that we not only respond to the external pressures, but to internal pressures we place on ourselves when we think we are losing control. This dynamic makes the Pressure/Performance Relationship even more non-linear.
Breakdown under pressure comes in many forms. Judgment may be the first thing to go, especially in response to peer pressure to see things their way.
As people leave a project, either physically or mentally, it adds pressures to the remaining people, who are then more likely to leave themselves.
Managers may create a Pile-On Dynamic by choosing to give new assignments only to those people who are already the reigning experts. This adds to their load, and their expertise, which makes it more likely they'll get the next assignment.
Some people respond to stress with a Panic Reaction, even though the situation is not anything like life-threatening. Such people must not be in high-stress projects, or they will only add to the stress.
Pressure can be managed. It helps if the workers are self-regulating, the managers are empowering, and that responsiveness, rather than performance, is used to measure readiness for more pressure.

Chapter 6: Handling Breakdown Pressure

Summary

Software projects commonly break down when the reality of time finally forces them to realize where they actually are. When this happens, however, the symptoms displayed are unique to each project and each individual.
Many symptoms are equivalent to shuffling work around, accomplishing nothing or, even worse, actually sending the project backwards. One such backwards dynamic is the attempt to beat Brooks's Law through splitting tasks among existing workers.
Ineffective priority schemes are common ways of doing nothing. These including setting everything to number one priority, choosing your own priority independent of project priority, or simply doing the easiest task first.
A final way of doing nothing is to circulate "hot potatoes," which are tasks that management counts against you if they are on your desk when "measurement" time comes.
There are a number of ways to observe that managers are actually doing nothing. They may
- be accepting poor quality products
- not be accepting schedule slippage
- be accepting of resource overruns
- be unavailable to their workers
- assert that they have no time to do the project right
A sure sign that a project is breaking down under time pressure is when managers and workers start short-circuiting procedures. This invariable creates a boomerang effect in which the very quality the manager intended to improve is made worse by the short-circuiting action.
The decision to ship poor quality to save time and resources always creates a boomerang effect. Bypassing quality assurance is similar. Both of these tactics lead, among other things, to destruction of the development process, more emergencies and interruptions, and devastation of morale.
When morale deteriorates into project depression, process quality will not be maintained, let alone improved. Trust built before the crisis will help an organization recover more quickly, but attempts to build trust during the crisis will probably backfire—especially if they are in the form of telling: "Trust me!"
Multiple customers increase the pressure on the boomerang cycle, up to the point that the resultant poor quality drives away customers, thus stabilizing the organization—or killing it.

Chapter 7: What We've Managed To Accomplish

Summary

In spite of the impression we might get from studying our failures, we've managed to accomplish a great deal in the past 4 decades of the software industry.
One of the reasons we've accomplished a great deal is the quality of our thinking, which is the strongest asset many of us have, when we use it.
Our industry has probably suffered because of the process by which we select our managers. People who select themselves into programming work probably are not the best "naturals" for management jobs. Nevertheless, they could learn to do a good job of managing, if they were given the training. As long as we don't honor management, however, they're not likely to receive one-tenth the management training they need.
The accomplishments of the software industry are much greater than you would believe if you listened to the purveyors of software and hardware tools. It is in their interest to make us believe that we're not doing very well, but that their tool will be the magic bullet we need.
We tend to be suckers for magic bullets because we want to accomplish great things, but great things are usually accomplished through a series of small steps, contrary to the popular image.
We may fail to recognize how much our productivity has increased because we are so ambitious. Once we succeed in doing something well, we immediately attempt something more grand, without stopping to take stock of our accomplishments.
Each pattern has contributed to the development of our industry. Pattern 0 has made computers less frightening to the general public. Pattern 1 has made many innovations that have contributed to our productivity. Pattern 2 has strung these innovations together into methodologies that make it possible to complete many larger projects in routine ways. Pattern 3 has taught us what is needed to keep even larger projects under control. The contributions of Patterns 4 and 5 are still more in terms of visions of possibilities, but that's as important to progress as actual accomplishments.
Meta-patterns are the development patterns of the culture of the industry as a whole. Once again, each pattern has contributed to the development of meta-patterns, and we are not only learning to handle software, but are learning how to learn to handle software.

QSM/Vol1/Vol 1.2 Summaries

QSM: Volume 1.2: Why Software Gets In Trouble

Part IV. Fault Patterns

Chapter 1: Observing and Reasoning About Errors

Chapter 2: The Failure Detection Curve

Chapter 3: Locating The Faults Behind The Failures

Chapter 4: Fault Resolution Dynamics

Part V. Pressure Patterns

Chapter 5: Power, Pressure, and Performance

Chapter 6: Handling Breakdown Pressure

Chapter 7: What We've Managed To Accomplish