Differences between revisions 4 and 5

QSM: Volume 1.2: Why Software Gets In Trouble

Contents

QSM: Volume 1.2: Why Software Gets In Trouble
1. Part IV. Fault Patterns
2. Part V. Pressure Patterns

Part IV. Fault Patterns

Chapter 1: Observing and Reasoning About Errors

Summary

조직이 소프트웨어 오류를 처리하는 데 어려움을 겪는 이유 중 하나는 오류와 관련하여 발생하는 많은 개념적 오류 때문이다. One of the reasons organizations have trouble dealing with software errors is the many conceptual errors they make concerning errors.
어떤 사람들은 오류를 도덕적인 문제로 만들고, 그들이 처리되는 방식에 대한 사업적 정당성을 잃어버린다. Some people make errors into a moral issue, losing track of the business justification for the way in which they are handled.
품질은 오류가 없는 것과 같은 것은 아니지만, 많은 오류가 있는 것은 제품의 품질의 다른 척도를 파괴할 수 있다. Quality is not the same thing as absence of errors, but the presence of many errors can destroy any other measures of quality in a product.
오류를 잘 처리하지 못하는 조직도 오류에 대해 명확하게 말하지 않는다. 예를 들어, 그들은 종종 결함과 실패를 구분하지 못하거나, 결함을 이용하여 조직 내의 사람들을 비난한다. Organizations that don't handle error very well also don't talk very clearly about error. For instance, they frequently fail to distinguish faults from failures, or use faults to blame people in the organization.
잘 기능하는 조직은 결함과 실패를 자신의 공정을 통제하기 위한 정보로 사용하는 조직적인 방법에 의해 인식될 수 있다. 시스템 고장 사고(STI)와 시스템 고장 분석(SFA)은 고장과 고장에 대한 정보의 근본적인 원천이다. Well functioning organizations can be recognized by the organized way they use faults and failures as information to control their process. The System Trouble Incident (STI) and the System Fault Analysis(SFA) are the fundamental sources of information about failures and faults.
오류 처리 공정은 검출, 위치, 분해능, 예방, 유통의 최소 5가지로 나온다. Error-handling processes come in at least five varieties: detection, location, resolution, prevention, and distribution.
개념상의 오류 외에도, 선택 오류, 관찰을 거꾸로 하는 오류, 제어기 오류 등 사람들이 오류에 대해 저지르는 여러 가지 일반적인 관찰 오류가 있다. In addition to conceptual errors, there are a number of common observational errors people make about errors, including Selection Fallacies, getting observations backwards, and the Controller Fallacy

Chapter 2: The Failure Detection Curve

Summary

실패검출은 탐지하기 가장 쉬운 실패가 가장 먼저 검출되는 실패라는 tautology에 의해 지배되므로 탐지가 진행될수록 작업이 어려워져 꼬리가 긴 특징적인 실패검출곡선을 만들어 낸다. Failure detection is dominated by the tautology that the easiest failures to detect are the first failures to detect, so that as detection proceeds, the work gets harder, producing a characteristic Failure Detection Curve with a long tail.
실패 감지 곡선의 긴 꼬리는 관리자가 실패 감지 작업을 잘못 추정하는 주요 원인 중 하나이다. The long tail of the Failure Detection Curve is one of the principal reasons managers misestimate failure detection tasks.
실패탐지곡선은 자연적인 역동성을 나타내기 때문에, 그것이 말하는 것보다 더 좋은 성과를 내기 위해 우리가 할 수 있는 일은 없다. 그러나 실패 감지 프로세스를 관리하는 방법에 주의를 기울이지 않으면 훨씬 더 나쁜 성능을 발휘할 수 있다. Because the Failure Detection Curve represents a natural dynamic, there is nothing we can do to perform better than it says. We can, however, perform much worse, if we're not careful of how we manage the failure detection process.
실패탐지곡선이 모두 나쁜 소식은 아니다. 시간에 따른 검출된 고장 패턴은 시험 적용 범위를 저해하는 일이 발생하지 않는 한 특정 수준의 고장 검출에 도달하는 시간을 예측하는 데 사용할 수 있다. The Failure Detection Curve is not all bad news. The pattern of detected failures over time can be used as a predictor of the time to reach any specified level of failure detection, as long as nothing is happening to undermine test coverage.
시험 적용범위를 저해할 수 있는 것 중 일부는 과실을 차단하는 것, 마스킹하는 것, 시험할 때 늦게 해제하는 것 등이다. Some of the things that can undermine test coverage are blocking faults, masking faults, and late releases to test.
늦게 완성되는 모듈은 부실한 코딩 사이클에서 발생할 수 있는데, 이는 고장 가능성이 높은 모듈일 가능성이 높다는 것을 의미한다. 후기 마감 모듈의 시험 속도를 높이기 위해 고안된 관리 정책은 실제로 문제를 더 악화시킬 수 있으며 소위 "불운" 추정을 많이 설명할 수 있다. Late finishing modules may arise from a cycle of poor coding, which means that they are more likely to be fault-prone modules. Management policies designed to speed testing of late finishing modules may actually make the problem worse, and may account for much so-called "bad luck" estimating.

Chapter 3: Locating The Faults Behind The Failures

Summary

시스템 크기는 고장 위치의 역학관계에 직접적인 영향을 미치지만 간접적인 영향도 있다. 우리는 크기/복잡성 다이나믹을 이기기 위해 분할과 정복을 사용하며, 또한 분만 시간을 이기기 위해 노동력을 나누기도 한다. 그러나 이러한 노력은 시스템 크기가 고장 위치 시간에 미치는 여러 가지 간접적인 영향을 초래한다. System size has a direct effect on the dynamics of fault location, but there are indirect effects as well. We use divide and conquer to beat the Size/Complexity Dynamic, and we also divide the labor to beat delivery time. These efforts, however, lead to a number of indirect effects of system size on fault location time.
조직이 STI를 어떻게 처리하는지를 관찰함으로써 그 문화에 대해 많은 것을 배울 수 있다. 특히, 당신은 그것의 문화적 패턴이 증가하는 고객이나 문제 수요에 대한 스트레스에서 어느 정도인지 배울 수 있다. You can learn a great deal about its culture by observing how an organization handles its STIs. In particular, you can learn to what degree its cultural pattern is under stress of increased customer or problem demands.
중요한 역학관계는 비선형적으로 성장하는 STI의 순환을 기술하는데, STI가 더 많이 순환한다. An important dynamic describes the circulation of STIs, which grows non-linearly the more STIs are in circulation.
STI 손실과 같은 공정 오류도 위치 시간을 증가시킨다. Process errors such as losing STIs also increase location time.
지위의 경계와 같은 정치적 문제도 위치시간 연장에 비선형적으로 기여할 수 있다. STI 보유자를 처벌해 유통시간을 줄이려는 경영행위가 오히려 역효과를 낳을 수 있다. Political issues, such as status boundaries, can also contribute non-linearly to extending location time. Management action to reduce circulation time by punishing those who hold STIs can lead to the opposite effect.
일반적으로 STI의 허술한 취급은 관리 부담을 증가시키고, 그 결과 STI의 허술한 취급으로 이어진다. STI가 통제 불능이 되면 경영진은 어떤 정보를 주고 있는지 연구한 뒤 단순한 증상이 아니라 근본 원인을 파악하는 조치를 취해야 한다. In general, poorly controlled handling of STIs leads to an enlarged administrative burden, which in turn leads to less poorly controlled handling of STIs. When STIs get out of hand, management needs to study what information that gives them about their cultural pattern, then take action to get at the root causes, not merely the symptoms.

Chapter 4: Fault Resolution Dynamics

Summary

Basic fault resolution dynamics are another case of Size/Complexity Dynamics, with more faults and more complexity per fault leading to a non-linear increase in fault resolution time as systems grow larger.
Side effects add more non-linearity to fault resolution. Either we take more time to consider side effects, or we create side effects when we change one thing and inadvertently change another.
The most obvious type of side effect is fault feedback, which can be measured by the Fault Feedback Ration (FFR). Fault feedback is the creation of faults while resolving other faults. Faults can be either functional or performance faults.
The FFR is a sensitive measure of project control breakdown. In a well-controlled project, FFR should decline as the project approaches its scheduled end.
One way to keep the FFR under control is to institute careful reviewing of fault resolutions, even if they are "only one line of code." The assumption that small changes can't cause trouble leads to small changes causing more trouble than bigger changes.
There are a number of ways in which a system deteriorates besides the addition of faults and performance inefficiencies, and these ways do not show up in ordinary project measurements. For instance, design integrity breaks down, documentation is not kept current, and coding style becomes patchy. All of these lead to a decrease in the system's maintainability.
When the integrity of a modular, or "black box," design breaks down, the system shows a growing "ripple effect" from each change. That is, one change ripples through to cause many other changes.
If we are to avoid deterioration of systems, they must not only be maintained, but their maintainability must also be maintained.
Managers and developers often show overconfidence in the initial design as protection against maintenance difficulties. This kind of overconfidence can easily lead to a Titanic Effect, because the thought that nothing can go wrong with the code exposes the code to all sorts of ways of going wrong.

Part V. Pressure Patterns

Chapter 5: Power, Pressure, and Performance

Summary

The Pressure/Performance Relationship says that added pressure can boost performance for a while, then starts to get no response, then leads to collapse.
Pressure to find the last fault can easily prolong the time to find the last fault, perhaps indefinitely.
The Stress/Control Dynamic explains that we not only respond to the external pressures, but to internal pressures we place on ourselves when we think we are losing control. This dynamic makes the Pressure/Performance Relationship even more non-linear.
Breakdown under pressure comes in many forms. Judgment may be the first thing to go, especially in response to peer pressure to see things their way.
As people leave a project, either physically or mentally, it adds pressures to the remaining people, who are then more likely to leave themselves.
Managers may create a Pile-On Dynamic by choosing to give new assignments only to those people who are already the reigning experts. This adds to their load, and their expertise, which makes it more likely they'll get the next assignment.
Some people respond to stress with a Panic Reaction, even though the situation is not anything like life-threatening. Such people must not be in high-stress projects, or they will only add to the stress.
Pressure can be managed. It helps if the workers are self-regulating, the managers are empowering, and that responsiveness, rather than performance, is used to measure readiness for more pressure.

Chapter 6: Handling Breakdown Pressure

Summary

Software projects commonly break down when the reality of time finally forces them to realize where they actually are. When this happens, however, the symptoms displayed are unique to each project and each individual.
Many symptoms are equivalent to shuffling work around, accomplishing nothing or, even worse, actually sending the project backwards. One such backwards dynamic is the attempt to beat Brooks's Law through splitting tasks among existing workers.
Ineffective priority schemes are common ways of doing nothing. These including setting everything to number one priority, choosing your own priority independent of project priority, or simply doing the easiest task first.
A final way of doing nothing is to circulate "hot potatoes," which are tasks that management counts against you if they are on your desk when "measurement" time comes.
There are a number of ways to observe that managers are actually doing nothing. They may
- be accepting poor quality products
- not be accepting schedule slippage
- be accepting of resource overruns
- be unavailable to their workers
- assert that they have no time to do the project right
A sure sign that a project is breaking down under time pressure is when managers and workers start short-circuiting procedures. This invariable creates a boomerang effect in which the very quality the manager intended to improve is made worse by the short-circuiting action.
The decision to ship poor quality to save time and resources always creates a boomerang effect. Bypassing quality assurance is similar. Both of these tactics lead, among other things, to destruction of the development process, more emergencies and interruptions, and devastation of morale.
When morale deteriorates into project depression, process quality will not be maintained, let alone improved. Trust built before the crisis will help an organization recover more quickly, but attempts to build trust during the crisis will probably backfire—especially if they are in the form of telling: "Trust me!"
Multiple customers increase the pressure on the boomerang cycle, up to the point that the resultant poor quality drives away customers, thus stabilizing the organization—or killing it.

Chapter 7: What We've Managed To Accomplish

Summary

In spite of the impression we might get from studying our failures, we've managed to accomplish a great deal in the past 4 decades of the software industry.
One of the reasons we've accomplished a great deal is the quality of our thinking, which is the strongest asset many of us have, when we use it.
Our industry has probably suffered because of the process by which we select our managers. People who select themselves into programming work probably are not the best "naturals" for management jobs. Nevertheless, they could learn to do a good job of managing, if they were given the training. As long as we don't honor management, however, they're not likely to receive one-tenth the management training they need.
The accomplishments of the software industry are much greater than you would believe if you listened to the purveyors of software and hardware tools. It is in their interest to make us believe that we're not doing very well, but that their tool will be the magic bullet we need.
We tend to be suckers for magic bullets because we want to accomplish great things, but great things are usually accomplished through a series of small steps, contrary to the popular image.
We may fail to recognize how much our productivity has increased because we are so ambitious. Once we succeed in doing something well, we immediately attempt something more grand, without stopping to take stock of our accomplishments.
Each pattern has contributed to the development of our industry. Pattern 0 has made computers less frightening to the general public. Pattern 1 has made many innovations that have contributed to our productivity. Pattern 2 has strung these innovations together into methodologies that make it possible to complete many larger projects in routine ways. Pattern 3 has taught us what is needed to keep even larger projects under control. The contributions of Patterns 4 and 5 are still more in terms of visions of possibilities, but that's as important to progress as actual accomplishments.
Meta-patterns are the development patterns of the culture of the industry as a whole. Once again, each pattern has contributed to the development of meta-patterns, and we are not only learning to handle software, but are learning how to learn to handle software.

-  ⇤ ← Revision 4 as of 2020-10-03 16:16:47 → 
  Size: 15435
  Editor: 정수
  Comment:
+   ← Revision 5 as of 2020-10-03 16:18:05 → ⇥
  Size: 16893
  Editor: 정수
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 34:
-. System size has a direct effect on the dynamics of fault location, but there are indirect effects as well. We use divide and conquer to beat the Size/Complexity Dynamic, and we also divide the labor to beat delivery time. These efforts, however, lead to a number of indirect effects of system size on fault location time.
 2. You can learn a great deal about its culture by observing how an organization handles its STIs. In particular, you can learn to what degree its cultural pattern is under stress of increased customer or problem demands.
 3. An important dynamic describes the circulation of STIs, which grows non-linearly the more STIs are in circulation.
 4. Process errors such as losing STIs also increase location time.
 5. Political issues, such as status boundaries, can also contribute non-linearly to extending location time. Management action to reduce circulation time by punishing those who hold STIs can lead to the opposite effect.
 6. In general, poorly controlled handling of STIs leads to an enlarged administrative burden, which in turn leads to less poorly controlled handling of STIs. When STIs get out of hand, management needs to study what information that gives them about their cultural pattern, then take action to get at the root causes, not merely the symptoms.
+. 시스템 크기는 고장 위치의 역학관계에 직접적인 영향을 미치지만 간접적인 영향도 있다. 우리는 크기/복잡성 다이나믹을 이기기 위해 분할과 정복을 사용하며, 또한 분만 시간을 이기기 위해 노동력을 나누기도 한다. 그러나 이러한 노력은 시스템 크기가 고장 위치 시간에 미치는 여러 가지 간접적인 영향을 초래한다. ~-System size has a direct effect on the dynamics of fault location, but there are indirect effects as well. We use divide and conquer to beat the Size/Complexity Dynamic, and we also divide the labor to beat delivery time. These efforts, however, lead to a number of indirect effects of system size on fault location time.-~
 2. 조직이 STI를 어떻게 처리하는지를 관찰함으로써 그 문화에 대해 많은 것을 배울 수 있다. 특히, 당신은 그것의 문화적 패턴이 증가하는 고객이나 문제 수요에 대한 스트레스에서 어느 정도인지 배울 수 있다. ~-You can learn a great deal about its culture by observing how an organization handles its STIs. In particular, you can learn to what degree its cultural pattern is under stress of increased customer or problem demands.-~
 3. 중요한 역학관계는 비선형적으로 성장하는 STI의 순환을 기술하는데, STI가 더 많이 순환한다. ~-An important dynamic describes the circulation of STIs, which grows non-linearly the more STIs are in circulation.-~
 4. STI 손실과 같은 공정 오류도 위치 시간을 증가시킨다. ~-Process errors such as losing STIs also increase location time.-~
 5. 지위의 경계와 같은 정치적 문제도 위치시간 연장에 비선형적으로 기여할 수 있다. STI 보유자를 처벌해 유통시간을 줄이려는 경영행위가 오히려 역효과를 낳을 수 있다. ~-Political issues, such as status boundaries, can also contribute non-linearly to extending location time. Management action to reduce circulation time by punishing those who hold STIs can lead to the opposite effect.-~
 6. 일반적으로 STI의 허술한 취급은 관리 부담을 증가시키고, 그 결과 STI의 허술한 취급으로 이어진다. STI가 통제 불능이 되면 경영진은 어떤 정보를 주고 있는지 연구한 뒤 단순한 증상이 아니라 근본 원인을 파악하는 조치를 취해야 한다. ~-In general, poorly controlled handling of STIs leads to an enlarged administrative burden, which in turn leads to less poorly controlled handling of STIs. When STIs get out of hand, management needs to study what information that gives them about their cultural pattern, then take action to get at the root causes, not merely the symptoms.-~

Diff for "QSM/Vol1/Vol 1.2 Summaries"

QSM: Volume 1.2: Why Software Gets In Trouble

Part IV. Fault Patterns

Chapter 1: Observing and Reasoning About Errors

Chapter 2: The Failure Detection Curve

Chapter 3: Locating The Faults Behind The Failures

Chapter 4: Fault Resolution Dynamics

Part V. Pressure Patterns

Chapter 5: Power, Pressure, and Performance

Chapter 6: Handling Breakdown Pressure

Chapter 7: What We've Managed To Accomplish