QSM: Volume 1.2: Why Software Gets In Trouble

Contents

QSM: Volume 1.2: Why Software Gets In Trouble
1. Part IV. Fault Patterns
2. Part V. Pressure Patterns

Part IV. Fault Patterns

Chapter 1: Observing and Reasoning About Errors

Summary

[V] 조직이 소프트웨어 오류를 처리하는 데 어려움을 겪는 이유 중 하나는 오류와 관련하여 발생하는 많은 개념적 오류 때문이다. One of the reasons organizations have trouble dealing with software errors is the many conceptual errors they make concerning errors.
[V] 어떤 사람들은 오류를 도덕적인 문제로 만들고, 그것들이 처리되는 방식으로 인해 사업적 정당성을 잃어버린다. Some people make errors into a moral issue, losing track of the business justification for the way in which they are handled.
[V] 품질은 오류의 부재가 아니지만, 많은 오류의 존재는 제품의 품질의 다른 척도를 파괴할 수 있다. Quality is not the same thing as absence of errors, but the presence of many errors can destroy any other measures of quality in a product.
[V] 오류를 잘 처리하지 못하는 조직은 오류에 대해서도 명확하게 말하지 않는다. 예를 들어, 그들은 종종 결함(faults)과 실패(failures)를 구분하지 못하거나, 결함(faults)을 이용하여 조직 내의 사람들을 비난한다. Organizations that don't handle error very well also don't talk very clearly about error. For instance, they frequently fail to distinguish faults from failures, or use faults to blame people in the organization.
[V] 잘 기능하는 조직은, 그들이 결함과 실패를 자신의 프로세스를 제어하기 위한 정보로 사용하는 조직된 방법을 보면 알아볼 수 있다. 시스템 고장 사고(STI: System Trouble Incident)와 시스템 결함 분석(SFA: System Fault Analysis)은 실패(failures)와 결함(faults)에 대한 정보의 근본적인 소스이다. Well functioning organizations can be recognized by the organized way they use faults and failures as information to control their process. The System Trouble Incident (STI) and the System Fault Analysis(SFA) are the fundamental sources of information about failures and faults.
[V] 오류-처리 프로세스는 최소한 다섯 종류로 구분된다: 탐지(detection), 문제 지점 찾기(location), 해결(resolution), 예방(prevention), 배포(distribution). Error-handling processes come in at least five varieties: detection, location, resolution, prevention, and distribution.
[V] 개념적 오류 외에도, 사람들이 오류에 대해 저지르는 여러 가지 일반적인 관찰 오류가 있다. 선택 오류(Selection Fallacies), 관찰을 거꾸로(backwards) 하는 오류, 컨트롤러 오류(Controller Fallacy) 등. In addition to conceptual errors, there are a number of common observational errors people make about errors, including Selection Fallacies, getting observations backwards, and the Controller Fallacy

Chapter 2: The Failure Detection Curve

Summary

[V] 실패 탐지는 탐지하기 가장 쉬운 실패가 가장 먼저 검출되는 실패라는 동어 반복에 의해 지배되었기 때문에, 탐지가 진행될수록, 작업이 더 어려워지고, 꼬리가 긴(long tail) 특징적인 실패 탐지 곡선(Failure Detection Curve)을 만들어 낸다. Failure detection is dominated by the tautology that the easiest failures to detect are the first failures to detect, so that as detection proceeds, the work gets harder, producing a characteristic Failure Detection Curve with a long tail.
[V] 실패 탐지 곡선의 긴 꼬리(long tail)는 관리자가 실패 탐지 작업을 잘못 추정하는 주요 원인 중 하나이다. The long tail of the Failure Detection Curve is one of the principal reasons managers misestimate failure detection tasks.
[V] 실패 탐지 곡선은 자연적인 역동을 나타내기 때문에, 그것이 말하는 것보다 더 좋은 성과를 내기 위해 우리가 할 수 있는 일은 없다. 그러나, 우리는 더 나쁜 성과를 낼 수도 있다. 만약 우리가 어떻게 실패 감지 프로세스를 관리하는지에 주의하지 않으면 말이다. Because the Failure Detection Curve represents a natural dynamic, there is nothing we can do to perform better than it says. We can, however, perform much worse, if we're not careful of how we manage the failure detection process.
[V] 실패 탐지 곡선이 모두 나쁜 소식은 아니다. 시간에 흐름에 따라 검출된 고장의 패턴은 특정 수준의 실패 탐지에 도달하는 시간을 예측하는 데 사용할 수 있다. 테스트 커버리지를 저해하는 일이 발생하지 않는 한. The Failure Detection Curve is not all bad news. The pattern of detected failures over time can be used as a predictor of the time to reach any specified level of failure detection, as long as nothing is happening to undermine test coverage.
[V] 테스트 커버리지를 저해할 수 있는 것 중 일부는, 결함을 차단하기, 결함을 마스킹하기, 테스트하기에는 늦은 릴리즈 등이다. Some of the things that can undermine test coverage are blocking faults, masking faults, and late releases to test.
[V] 늦게 완성되는 모듈들은 부실한 코딩의 사이클에서 발생할 수 있는데, 이는 고장 가능성이 높은 모듈일 가능성이 더 높다는 것을 의미한다. 늦게 완료되는 모듈의 테스트 속도를 높이기 위해 고안된 관리 정책은 실제로는 문제를 더 악화시킬 수 있으며, 소위 "불운(bad luck)" 추정의 이유를 설명할 수 있다. Late finishing modules may arise from a cycle of poor coding, which means that they are more likely to be fault-prone modules. Management policies designed to speed testing of late finishing modules may actually make the problem worse, and may account for much so-called "bad luck" estimating.

Chapter 3: Locating The Faults Behind The Failures

Summary

[V] 시스템 크기는 결함 위치의 역동에 직접적인 영향을 미치지만, 간접적인 영향도 있다. 우리는 크기/복잡성 역동을 이기기 위해 분할과 정복(device and conquer)을 사용하며, 배달 시간을 맞추기 위해 노동력을 나누기도 한다. 그러나, 이러한 노력은 시스템 크기가 결함 발견에 걸리는 시간에 미치는 여러 가지 간접적인 영향을 일으킨다. System size has a direct effect on the dynamics of fault location, but there are indirect effects as well. We use divide and conquer to beat the Size/Complexity Dynamic, and we also divide the labor to beat delivery time. These efforts, however, lead to a number of indirect effects of system size on fault location time.
[V] 어느 조직이 자신의 STI들을 어떻게 처리하는지를 관찰하면 그 조직의 문화에 대해 많은 것을 파악할 수 있다. 특히, 당신은 그 조직의 문화적 패턴이 증가하는 고객이나 문제 수요에 대해 어느 정도로 스트레스를 받고 있는지 파악할 수 있다. You can learn a great deal about its culture by observing how an organization handles its STIs. In particular, you can learn to what degree its cultural pattern is under stress of increased customer or problem demands.
[V] 중요한 역동은 STI의 순환을 묘사하는데, 더 많은 STI들이 순환에 있을수록 비선형적으로 증가한다. An important dynamic describes the circulation of STIs, which grows non-linearly the more STIs are in circulation.
[V] STI를 잃는 것과 같은 프로세스 오류도 결함 지점을 찾는 시간을 증가시킨다. Process errors such as losing STIs also increase location time.
[V] 정치적 문제도, 지위의 경계와 같은, 결함 지점을 찾는 시간의 연장에 비선형적으로 기여할 수 있다. STI 보유자를 처벌해 순환 시간을 줄이려는 관리 행위가 오히려 역효과를 낳을 수 있다. Political issues, such as status boundaries, can also contribute non-linearly to extending location time. Management action to reduce circulation time by punishing those who hold STIs can lead to the opposite effect.
[V] 일반적으로, STI를 허술하게 취급하면 관리 부담이 증대되고, 결국에는 STI의 허술한 취급으로 이어진다. STI가 통제 불능이 되면, 경영진은 그들의 문화 패턴에 대해 어떤 정보를 주고 있는지 연구한 뒤, 단순한 증상이 아니라 근본 원인(root cause)을 파악하는 조치를 취해야 한다. In general, poorly controlled handling of STIs leads to an enlarged administrative burden, which in turn leads to less poorly controlled handling of STIs. When STIs get out of hand, management needs to study what information that gives them about their cultural pattern, then take action to get at the root causes, not merely the symptoms.

Chapter 4: Fault Resolution Dynamics

Summary

기본 결함 해소 역학은 크기/복잡성 역학의 또 다른 사례로, 결함이 많아질수록, 또한 결함당 복잡성이 클수록, 시스템이 커질수록 고장의 해결 시간이 비선형적으로 증가한다. Basic fault resolution dynamics are another case of Size/Complexity Dynamics, with more faults and more complexity per fault leading to a non-linear increase in fault resolution time as systems grow larger.
부작용은 고장해결에 비선형성을 더한다. 부작용을 고려하는 데 더 많은 시간이 걸리거나, 한 가지를 바꾸고 무심코 다른 것을 바꿀 때 부작용을 일으키기도 한다. Side effects add more non-linearity to fault resolution. Either we take more time to consider side effects, or we create side effects when we change one thing and inadvertently change another.
가장 분명한 부작용의 유형은 고장 피드백이며, 고장 피드백은 고장 피드백 비율(FFR)으로 측정할 수 있다. 결함 피드백은 다른 결함을 해결하는 동안 결함을 생성하는 것이다. 고장은 기능적 또는 성능적 결함일 수 있다. The most obvious type of side effect is fault feedback, which can be measured by the Fault Feedback Ration (FFR). Fault feedback is the creation of faults while resolving other faults. Faults can be either functional or performance faults.
FFR은 프로젝트 제어 붕괴에 대한 민감한 척도다. 잘 통제된 프로젝트에서 FFR은 프로젝트가 예정된 종료에 가까워질수록 감소해야 한다. The FFR is a sensitive measure of project control breakdown. In a well-controlled project, FFR should decline as the project approaches its scheduled end.
FFR을 제어하는 한 가지 방법은 고장 해결 방법이 "단 한 줄의 코드"라 하더라도 신중한 검토를 실시하는 것이다. 작은 변화가 문제를 일으킬 수 없다는 가정은 작은 변화로 이어져 큰 변화보다 더 큰 문제를 일으킨다. One way to keep the FFR under control is to institute careful reviewing of fault resolutions, even if they are "only one line of code." The assumption that small changes can't cause trouble leads to small changes causing more trouble than bigger changes.
결함 및 성능 비효율의 추가 외에도 시스템이 악화되는 여러 가지 방법이 있으며, 이러한 방법은 일반적인 프로젝트 측정에 나타나지 않는다. 예를 들어, 설계 무결성이 파괴되고, 문서가 최신 상태로 유지되지 않으며, 코딩 스타일이 복잡해진다. 이 모든 것이 시스템의 유지 보수성의 저하로 이어진다. There are a number of ways in which a system deteriorates besides the addition of faults and performance inefficiencies, and these ways do not show up in ordinary project measurements. For instance, design integrity breaks down, documentation is not kept current, and coding style becomes patchy. All of these lead to a decrease in the system's maintainability.
모듈형, 즉 "블랙박스"의 무결성이 깨졌을 때, 시스템은 각각의 변화로부터 증가하는 "리플 효과"를 보여준다. 즉, 한 가지 변화가 많은 다른 변화를 일으키기 위해 파급되는 것이다. When the integrity of a modular, or "black box," design breaks down, the system shows a growing "ripple effect" from each change. That is, one change ripples through to cause many other changes.
시스템의 악화를 피하려면, 시스템뿐만 아니라 유지관리성 또한 유지되어야 한다. If we are to avoid deterioration of systems, they must not only be maintained, but their maintainability must also be maintained.
관리자와 개발자는 정비 난관에 대한 보호로서 초기 설계에 대한 과신력을 보이는 경우가 많다. 이런 종류의 과신감은 쉽게 타이타닉 효과로 이어질 수 있는데, 왜냐하면 코드에 문제가 있을 수 없다는 생각은 코드를 모든 종류의 잘못된 방법에 노출시키기 때문이다. Managers and developers often show overconfidence in the initial design as protection against maintenance difficulties. This kind of overconfidence can easily lead to a Titanic Effect, because the thought that nothing can go wrong with the code exposes the code to all sorts of ways of going wrong.

Part V. Pressure Patterns

Chapter 5: Power, Pressure, and Performance

Summary

압력/성능 관계에 따르면 압력이 가해질 경우 잠시 성능을 끌어올린 다음 반응이 없어지기 시작하다가 붕괴로 이어질 수 있다고 한다. The Pressure/Performance Relationship says that added pressure can boost performance for a while, then starts to get no response, then leads to collapse.
마지막 결함을 찾도록 압력을 가하면 마지막 결함을 찾는 시간이 쉽게, 어쩌면 무한정 길어질 수 있다. Pressure to find the last fault can easily prolong the time to find the last fault, perhaps indefinitely.
스트레스/컨트롤 다이나믹은 우리가 외부 압력에 반응할 뿐만 아니라, 우리가 통제력을 상실하고 있다고 생각할 때 자신에게 가하는 내부 압력에 반응한다고 설명한다. 이 동적 특성은 압력/성능 관계를 더욱 비선형적으로 만든다. The Stress/Control Dynamic explains that we not only respond to the external pressures, but to internal pressures we place on ourselves when we think we are losing control. This dynamic makes the Pressure/Performance Relationship even more non-linear.
압력에 의한 고장은 여러 가지 형태로 나타난다. 특히 사물을 자기 뜻대로 보라는 동료들의 압력에 대응하여 판단력이 가장 먼저 갈지도 모른다. Breakdown under pressure comes in many forms. Judgment may be the first thing to go, especially in response to peer pressure to see things their way.
사람들이 육체적으로든 정신적으로든 프로젝트를 떠나면, 그것은 그들 자신을 떠날 가능성이 더 높은 나머지 사람들에게 압력을 가한다. As people leave a project, either physically or mentally, it adds pressures to the remaining people, who are then more likely to leave themselves.
관리자는 이미 군림하고 있는 전문가에게만 새로운 과제를 부여하는 것을 선택함으로써 '말뚝이 다이나믹'을 만들어 낼 수 있다. 이것은 그들의 부하와 전문성을 더해주기 때문에 그들은 다음 과제를 받을 가능성이 더 높다. Managers may create a Pile-On Dynamic by choosing to give new assignments only to those people who are already the reigning experts. This adds to their load, and their expertise, which makes it more likely they'll get the next assignment.
생명을 위협하는 상황이 아닌데도 '공황 반응'으로 스트레스에 반응하는 사람도 있다. 그런 사람들은 스트레스를 많이 받는 프로젝트에 참여해서는 안 되며, 그렇지 않으면 스트레스를 가중시킬 뿐이다. Some people respond to stress with a Panic Reaction, even though the situation is not anything like life-threatening. Such people must not be in high-stress projects, or they will only add to the stress.
압력을 관리할 수 있다. 노동자들이 자율규제하고, 경영자들이 힘을 실어주고, 더 많은 압박에 대한 대비태세를 측정하기 위해 성과보다는 대응력을 발휘하는 것이 도움이 된다. Pressure can be managed. It helps if the workers are self-regulating, the managers are empowering, and that responsiveness, rather than performance, is used to measure readiness for more pressure.

Chapter 6: Handling Breakdown Pressure

Summary

소프트웨어 프로젝트는 시간의 현실이 마침내 그들이 실제로 어디에 있는지 깨닫게 할 때 일반적으로 무너진다. 그러나 이렇게 되면 표시되는 증상은 각 프로젝트와 개인마다 고유하게 나타난다. Software projects commonly break down when the reality of time finally forces them to realize where they actually are. When this happens, however, the symptoms displayed are unique to each project and each individual.
많은 증상들은 일을 이리저리 뒤척이며, 아무것도 이루지 못하거나 심지어 실제로 프로젝트를 거꾸로 보내는 것과 같다. 이러한 역학관계 중 하나는 기존 근로자들 간의 업무 분담을 통해 브룩스의 법칙을 물리치려는 시도다. Many symptoms are equivalent to shuffling work around, accomplishing nothing or, even worse, actually sending the project backwards. One such backwards dynamic is the attempt to beat Brooks's Law through splitting tasks among existing workers.
비효율적인 우선 계획은 아무것도 하지 않는 일반적인 방법이다. 여기에는 모든 것을 최우선 순위로 설정하거나, 프로젝트 우선 순위에 관계 없이 자신의 우선 순위를 선택하거나, 가장 쉬운 작업을 먼저 수행하는 것 등이 포함된다. Ineffective priority schemes are common ways of doing nothing. These including setting everything to number one priority, choosing your own priority independent of project priority, or simply doing the easiest task first.
아무것도 하지 않는 최종적인 방법은 '뜨거운 감자'를 순환시키는 것인데, '측정' 시간이 되면 경영진이 책상 위에 있으면 자신에게 불리하게 작용한다. A final way of doing nothing is to circulate "hot potatoes," which are tasks that management counts against you if they are on your desk when "measurement" time comes.
관리자들이 실제로 아무것도 하지 않고 있다는 것을 관찰할 수 있는 여러 가지 방법이 있다. 그들은 그럴지도 모른다. There are a number of ways to observe that managers are actually doing nothing. They may
- 품질이 나쁜 제품을 받아들이고 있다. be accepting poor quality products
- 일정 미끄러짐을 수용하지 않음 not be accepting schedule slippage
- 리소스 오버런을 수용하고 있음 be accepting of resource overruns
- 작업자가 사용할 수 없음 be unavailable to their workers
- 프로젝트를 제대로 수행할 시간이 없다고 주장 assert that they have no time to do the project right
시간의 압박에 의해 프로젝트가 무너지고 있다는 확실한 신호는 관리자와 노동자가 합선 절차를 시작하는 것이다. 이러한 불변성은 관리자가 개선하고자 했던 바로 그 품질이 단락 작용에 의해 악화되는 부메랑 효과를 만들어낸다. A sure sign that a project is breaking down under time pressure is when managers and workers start short-circuiting procedures. This invariable creates a boomerang effect in which the very quality the manager intended to improve is made worse by the short-circuiting action.
시간과 자원을 절약하기 위해 열악한 품질을 선적하기로 한 결정은 항상 부메랑 효과를 낳는다. 우회 품질보증도 비슷하다. 이 두 가지 전술은 무엇보다도 개발 과정의 파괴, 더 많은 비상 사태와 방해, 그리고 사기의 파괴로 이어진다. The decision to ship poor quality to save time and resources always creates a boomerang effect. Bypassing quality assurance is similar. Both of these tactics lead, among other things, to destruction of the development process, more emergencies and interruptions, and devastation of morale.
사기가 저하되어 프로젝트 불경기로 접어들면 개선은커녕 공정의 질도 유지되지 않는다. 위기 이전에 구축된 신뢰는 조직이 더 빨리 회복하는 데 도움이 되겠지만, 위기 동안 신뢰를 구축하려는 시도는 역효과를 낳을 것이다. 특히 "나를 믿어!" When morale deteriorates into project depression, process quality will not be maintained, let alone improved. Trust built before the crisis will help an organization recover more quickly, but attempts to build trust during the crisis will probably backfire—especially if they are in the form of telling: "Trust me!"
다수의 고객이 부메랑 사이클에 대한 압력을 증가시켜 결과적으로 품질이 저하되어 고객이 멀어지게 되고, 따라서 조직이 안정화되거나 조직이 사망하게 된다. Multiple customers increase the pressure on the boomerang cycle, up to the point that the resultant poor quality drives away customers, thus stabilizing the organization—or killing it.

Chapter 7: What We've Managed To Accomplish

Summary

우리의 실패를 연구함으로써 얻을 수 있는 인상에도 불구하고, 우리는 소프트웨어 산업의 지난 40년 동안 많은 것을 해냈다. In spite of the impression we might get from studying our failures, we've managed to accomplish a great deal in the past 4 decades of the software industry.
우리가 많은 것을 성취한 이유 중 하나는 우리들 중 많은 이들이 가지고 있는 가장 강력한 자산인 사고의 질이다. One of the reasons we've accomplished a great deal is the quality of our thinking, which is the strongest asset many of us have, when we use it.
관리자를 선발하는 과정 때문에 우리 산업은 아마 어려움을 겪었을 것이다. 프로그래밍 업무에 자신을 선택하는 사람들은 아마도 관리직에 가장 적합한 "자연인"은 아닐 것이다. 그럼에도 불구하고 훈련을 받으면 관리를 잘 하는 법을 배울 수 있었다. 하지만 우리가 관리자을 존중하지 않는 한, 그들은 그들이 필요로 하는 관리 교육을 10분의 1도 받지 못할 것 같다. Our industry has probably suffered because of the process by which we select our managers. People who select themselves into programming work probably are not the best "naturals" for management jobs. Nevertheless, they could learn to do a good job of managing, if they were given the training. As long as we don't honor management, however, they're not likely to receive one-tenth the management training they need.
소프트웨어와 하드웨어 도구의 공급자에게 귀를 기울인다면 소프트웨어 산업의 성과는 당신이 믿을 수 있는 것보다 훨씬 더 크다. 우리가 잘 하고 있지 않지만, 그들의 도구가 우리가 필요로 하는 마법의 총알이 될 것이라고 믿게 만드는 것이 그들의 이익이다. The accomplishments of the software industry are much greater than you would believe if you listened to the purveyors of software and hardware tools. It is in their interest to make us believe that we're not doing very well, but that their tool will be the magic bullet we need.
우리는 위대한 일을 성취하고 싶기 때문에 마법의 총알에 어리버리들이 되는 경향이 있지만, 위대한 일은 보통 대중적인 이미지와는 달리 일련의 작은 단계를 통해 성취된다. We tend to be suckers for magic bullets because we want to accomplish great things, but great things are usually accomplished through a series of small steps, contrary to the popular image.
야심이 너무 강해서 생산성이 얼마나 높아졌는지 인식하지 못할 수도 있다. 우리가 어떤 것을 잘 해내는데 성공하면, 우리는 즉시 우리의 성과를 평가하기 위해 멈추지 않고 좀더 웅장한 것을 시도한다. We may fail to recognize how much our productivity has increased because we are so ambitious. Once we succeed in doing something well, we immediately attempt something more grand, without stopping to take stock of our accomplishments.
각각의 패턴이 우리 산업의 발전에 기여했다. 패턴 0은 일반 대중들에게 컴퓨터를 덜 무섭게 만들었다. 패턴 1은 우리의 생산성에 기여하는 많은 혁신을 이루었다. 패턴 2는 이러한 기술 혁신을 일상적인 방법으로 많은 대형 프로젝트를 완료할 수 있도록 하는 방법론으로 묶어 왔다. 패턴 3은 우리에게 더 큰 프로젝트를 통제하는 데 필요한 것을 가르쳐 주었다. 패턴4와 5의 기여는 여전히 가능성의 비전 측면에서 더 많지만, 그것은 실제 성취만큼 진척에 중요하다. Each pattern has contributed to the development of our industry. Pattern 0 has made computers less frightening to the general public. Pattern 1 has made many innovations that have contributed to our productivity. Pattern 2 has strung these innovations together into methodologies that make it possible to complete many larger projects in routine ways. Pattern 3 has taught us what is needed to keep even larger projects under control. The contributions of Patterns 4 and 5 are still more in terms of visions of possibilities, but that's as important to progress as actual accomplishments.
메타패턴은 산업 전반의 문화의 발전 패턴이다. 다시 한 번 각 패턴이 메타패턴의 발전에 기여했고, 우리는 소프트웨어 취급법을 배울 뿐만 아니라 소프트웨어 취급법을 배우고 있다. Meta-patterns are the development patterns of the culture of the industry as a whole. Once again, each pattern has contributed to the development of meta-patterns, and we are not only learning to handle software, but are learning how to learn to handle software.

QSM/Vol1/Vol 1.2 Summaries

QSM: Volume 1.2: Why Software Gets In Trouble

Part IV. Fault Patterns

Chapter 1: Observing and Reasoning About Errors

Chapter 2: The Failure Detection Curve

Chapter 3: Locating The Faults Behind The Failures

Chapter 4: Fault Resolution Dynamics

Part V. Pressure Patterns

Chapter 5: Power, Pressure, and Performance

Chapter 6: Handling Breakdown Pressure

Chapter 7: What We've Managed To Accomplish