Department of Computer Science
University of Maryland
College Park, Maryland 20742
Software inspections have long been considered to be an effective way to detect and remove defects from software. However, there are costs associated with carrying out inspections and these costs may outweigh the expected benefits.
It is important to understand the tradeoffs between these costs and benefits. We believe that these are driven by several mechanisms, both internal and external to the inspection process. Internal factors are associated with the manner in which the steps of the inspection are organized into a process (structure), as well as the manner in which each step is carried out (technique). External ones include differences in reviewer ability and code quality (inputs), and interactions with other inspections, the project schedule, personal calendars, etc. (environment).
We started a study to identify the mechanisms that strongly influence an inspection's costs and effectiveness. Most of the existing literature on inspections have discussed how to get the most benefit out of inspections by proposing changes to the process structure, but with little or no empirical work conducted to demonstrate how they worked better and at what cost.
We hypothesized that these changes will affect the defect detection effectiveness of the inspection, but that any increase in effectiveness will have a corresponding increase in inspection interval and effort. We evaluated this hypothesis with a controlled experiment on a live development project at Lucent Technologies, using professional software developers.
We found that these structural changes were largely ineffective in improving the effectiveness of inspections, but certain treatments dramatically increased the inspection interval. We also noted a large amount of unexplained variance in the data suggesting that other factors must have a strong influence on inspection performance.
On further investigation, we found that the inputs into the process (reviewers and code units) account for more of the variation than the original treatment variables, leading us to conclude that better techniques by which reviewers detect defects, not better process structures, are the key to improving inspection effectiveness.
Table of Contents
For twenty years, a simple three-step inspection procedure has been widely accepted as the most cost-effective way to detect and remove defects from software. First, each member of a team of reviewers reads the artifact separately, detecting as many defects as possible (Preparation). Next, these newly discovered defects are collected and discussed, usually at a team meeting (Collection). Then the author corrects them (Repair).
In many software organizations, inspections take up a significant part of a project's time and effort. For example, a typical release of Lucent Technologies' 5ESS.6exTM switch ( 0.5M lines of added and changed code per release on a base of 5M lines) can require roughly 1500 inspections, each with five or more participants. Since this represents 15% of a project's total effort, it is important to understand when, if ever, the costs outweigh the benefits. Several studies have tried to establish that inspections are worth doing. We believe however there are questions that have not been completely answered.
Several new inspection methods have recently been proposed[10, 44, 38, 59]. Although each claims to improve effectiveness, in most cases, little or no empirical work was conducted to find out what causes them to work better, why, and at what cost.
We believe that differences in effectiveness and cost associated with inspections come from many sources, both internal and external to the process.
Internal sources include factors from the process structure (the manner in which the steps of the inspection are organized into a process, e.g., number of persons to use, number of sessions, etc.) and from the process techniques (the manner in which each step is carried out, e.g., reading technique used, computer support, etc.).
External sources include factors from the process inputs (differences in reviewers' abilities and in code unit quality) and from the process environment (changes in schedules, calendars, workload, etc.).
As many of the new inspection methods propose changes to the process structure, we began by examining factors related to the process structure. We hypothesized that manipulating these factors will affect the effectiveness of the inspection, but that any increase in effectiveness will have a corresponding increase in interval.
We evaluated this hypothesis with a controlled experiment in a live development project at Lucent Technologies. The project was to develop an internal compiler known as P5CC. The finished system had around 65K lines of C++ code, and involved 6 developers. We collected data from the code inspections associated with this project.
Although conducting an experiment of this type appears to be prohibitively expensive and risky, we found that novel statistical methods, well-defined instrumentation, and clever design could be used to minimize its cost. In addition to evaluating our hypotheses, carrying out this experiment allowed us to address the issues previously mentioned regarding costs and benefits of inspections.
We believe that different process structures have different tradeoffs between effectiveness, interval, and effort. Specifically, we hypothesized that (1) inspections with large teams have longer intervals, but find no more defects than smaller teams, (2) multiple-session inspections with multiple small teams are more effective than single-session inspections with one large team, but at the cost of significantly increasing interval, and (3) repairing the defects found in one session of a multiple-session inspection before carrying out the next session will catch even more defects but also take significantly longer than having multiple sessions carried out in parallel.
For this experiment, I joined the P5CC development team in the role of inspection quality engineer (IQE). My duties include assigning treatments and reviewers to upcoming inspections, observing and taking notes during the inspection meetings, attending the team meetings, handling the inspection paperwork, collecting and extracting the data from the forms, and maintaining the database of inspection data. I did not have any code development responsibilities. The experiment spanned 18 months and we have collected data from 88 code inspections.
We found that (1) inspection intervals and observed defect densities for large teams were not significantly different than for smaller teams, (2) inspections with two sessions were not significantly different than inspections with one session, and did not have significantly longer intervals, and (3) there was no difference in effectiveness of two-session inspections which did repair in between sessions and those which did not, though repairing in between sessions did significantly increase the interval. In short, we found that changes in process structure were largely ineffective in improving the effectiveness of inspections, but that certain combinations dramatically increased the inspection interval.
We also observed a large amount of unexplained variance within each treatment, suggesting that other sources of variation have a significant influence on inspection performance. Therefore, we extended the results of our experiment by examining the process inputs, modeling their influence on inspection effectiveness and interval.
We found that differences in process inputs account for more of the variation than the differences in process structure. In particular, reviewer variability significantly affects inspection effectiveness. This suggests that improving the techniques by which reviewers study code and detect defects is a promising way to improve inspection effectiveness but that changing structural organizations is not.
To study the costs and benefits of inspections, researchers are faced with two questions. (1) How should the costs and benefits of inspections be measured? (2) What factors strongly drive these measurements?
Several studies have addressed these questions, usually at one of two different levels of analysis, global (which examines the effect of one or more inspection methods on the entire development process) or local (which compares inspection methods, but without regard to their effect on the entire development process).
We survey existing inspection research with the goal of understanding how much progress has been made on each question and at each level of analysis.
The idea of reviewing software is as old as programming itself. It started with informal reviews as programmers realized that writing completely accurate programs was too great a problem for the unaided human mind. As software projects increased in size and ambition, the steps of the review process were gradually written down and formalized. By the 1960's, most large projects had some kind of formal reviewing procedure. In 1976, Fagan published an influential paper describing an inspection process in use at IBM, with a detailed cost-benefit analysis showing its cost-effectiveness. It was the first widely publicized work on inspections which had at its center the three main steps, Preparation, Collection, and Repair. Since then, there has been a proliferation of inspection methods. These may be organized according to the following taxonomy of inspection methods.
We choose to describe inspection methods based on the following attributes: (1) team size, (2) number of sessions, (3) coordination between multiple sessions, (4) collection technique, (5) defect detection method, and (6) use of post-collection feedback. Although other classification schemes could also be used, we believe these attributes represent underlying mechanisms that drive the costs and benefits of inspections.
Team sizes can be small (1-4 reviewers) or large (more than 4 reviewers). The inspection team is normally composed of several reviewers. Presumably, this allows a wide variety of defects to be found since each reviewer relies on different expertise and experiences when inspecting. Thus, the larger and more varied the team, the better the coverage. However, large teams require more effort since more people analyze the artifact (which is often unfamiliar to them). This also reduces the time they can spend on other development work. In addition, it becomes harder to find a suitable meeting time as the number of attendees grows. Finally, it is more difficult for everyone to contribute fully during the meeting because of limited air time.
Smaller teams take less effort and are easier to schedule. However, they risk missing more defects and becoming superficial if personnel with required domain expertise are not included.
This refers to the number of times the artifact undergoes the inspection process, possibly with different teams of inspectors. Multiple-session inspections will find more defects than single-session inspections as long as some important or subtle defects escape detection by any one inspection session. Also conducting the inspection in several sessions with small subteams may be more effective than doing a single session with one large team. The main problem with multiple sessions is that inspection effort increases as the number of sessions grows.
For multiple-session inspections, there is the additional option of conducting the sessions in parallel - with each session inspecting the same version of the artifact - or in sequence - with defects found in one session being repaired before going on to the next session. Parallel sessions will be more effective only if different teams find few defects in common. They should also have nearly the same interval to completion as single-session inspections since the meetings can be scheduled to occur at nearly the same time. In addition, the author can collect all defect reports and do just one pass at the rework. But collecting the reports takes more effort, especially in sorting out which issues from different reports actually refer to the same defect in the artifact.
Sequential sessions shouldn't duplicate issues since those found by an earlier team would have already been repaired. More defects may be found, since cleaning out old defects might make it easier to find new ones. However, it does take longer because the author cannot schedule the next phase of the inspection until defects from the first session have been resolved.
This refers to whether a collection meeting is to be held (group-centered) or not (individual-centered). Although there is almost always some meeting between reviewers and the artifact's author to deliver the reviewer's findings, the main goal of group-centered meetings is to find defects. Many people consider the meeting to be the central step of the inspection process because they believe that several people working together will find defects that none of them would find while working separately. This is known as ``synergy''. Meetings also serve as a way to spread domain knowledge since unfamiliar inspectors interact with more experienced developers. Finally, meetings provide a natural milestone for the project under development. It does however take time and effort to schedule a meeting and recent studies have shown that meetings do not create as much synergy as previously believed. In addition, the problems of improperly held meetings are well-documented[19, 48]. These include free-riding (one person depending on others to do the work), conformance pressure (the tendency to follow the majority opinion), evaluation apprehension (failure to raise a seemingly ``stupid'' issue for fear of embarrassment), attention blocking (failure to comprehend someone else's contribution and to build on it), dominance (a single person dominating the meeting), and others.
Individual-centered inspections avoid these problems by eliminating the inspection meeting or de-emphasizing it (e.g., making it optional, making attendance optional, etc.). However, they risk losing the meeting synergy.
Preparation, the first step of the inspection process, is accomplished through the application of defect detection methods. These are composed of defect detection techniques, individual reviewer responsibilities, and a policy for coordinating responsibilities among the review team. Defect detection techniques range in prescriptiveness from intuitive, nonsystematic procedures (such as ad hoc or checklist techniques) to explicit and highly systematic procedures (such as scenarios or correctness proofs).
A reviewer's individual responsibility may be general, to identify as many defects as possible, or specific, to focus on a limited set of issues (such as ensuring appropriate use of hardware interfaces, identifying untestable requirements, or checking conformity to coding standards).
Individual responsibilities may or may not be coordinated among the review team members. When they are not coordinated, all reviewers have identical responsibilities. In contrast, each reviewer in a coordinated team has different responsibilities.
The most frequently used detection methods (ad hoc and checklist) rely on nonsystematic techniques. Reviewer responsibilities are general and identical. However, multiple-session inspection approaches normally require reviewers to carry out specific and distinct responsibilities.
In most inspections, the author is left alone after the inspection meeting to analyze the issues raised and deal with the rework. Consequently, the development community may not learn why defects were made, nor how they they could have been avoided. Some authors argue that a brainstorming meeting should be held after the inspection meeting to determine the root cause of each issue recorded in the meeting.
The problems with this are the same as with other meetings: they require more effort and congest schedules as well as suffer from other group-interaction problems.
Computer support adds a new dimension to the inspection process. By automating some parts of the process and providing computer support for others, the inspection process can possibly be made more effective and efficient. For example, during preparation computer support allows artifacts to be inspected, inspector comments to be recorded and project management reports to be handled online. This eliminates much of the bulky printed materials and the forms normally generated by inspections.
Software tools can also perform automated detection of simple defects, freeing inspectors to concentrate on major defects. Using such tools required that artifacts are specified with some formal notation, or programming language. For example, a C language-specific inspection tool called ICICLE uses lint, to identify C program constructs that may indicate the presence of defects. It also checks the C program against its own rule-based system.
Computer support for meetings can reduce the cost of meetings. With videoconferencing, inspectors in different locations can easily meet. Computer support can also mitigate the group-interaction-related problems by allowing meetings to be held in ``nominal'' fashion, where inspectors do not have to actually meet, but can just asynchronously place their comments in a central repository which others can read and extend at their convenience.
The main disadvantage is inadequate technological support. Most computer-aided inspection systems are still in the research labs and not yet ready for industrial use. In addition, some special equipment may be needed for videoconferencing.
We present some example inspection methods in the literature. Table 2.1.2 gives a summary and comparative view.
In 1976, Fagan published an influential paper detailing a software inspection process used at IBM. Basically, it consists of six steps.
Before the meeting, one person is designated as the team leader or moderator, who orchestrates the meeting. Another person, designated as the reader, paraphrases the artifact. Defects are found during the reader's discourse and questions are pursued only to the point that defects are recognized. The issues found are noted in an inspection report and the author is required to resolve them. (Extensive solution hunting is discouraged during inspection.) The inspection meeting lasts no more than two hours to prevent exhaustion.
Many software organizations have adopted this process for their own review procedures. The term ``software inspection'' is now almost exclusively associated with some form of this method.
Bisant and Lyle proposed modifying the Fagan inspection by reducing the inspection team to two persons: the author and one reviewer.
Gilb inspections are similar to Fagan inspections, but introduces a root cause analysis meeting right after the inspection meeting. This step enables process improvement through studying and discussing the causes of the defects found at the inspection to find positive recommendations for eliminating them in the future. These recommendations may affect the technical, organizational, and political environment in which the developers work.
Many people believe that most defects are identified during the inspection meeting. However, several recent studies have indicated that most defects are actually found during the preparation step[55, 62]. Humphrey states that ``three-quarters of the errors found in well-run inspections are found during preparation.'' Votta suggests replacing inspection meetings with depositions, where the author and, optionally, the moderator meet separately with each of the reviewers to get their inspection results.
Parnas and Weiss present active design reviews (ADR). The authors believe that in conventional design reviews, reviewers are given too much information to examine, and they must participate in large meetings which allow for limited interaction between reviewers and author. In ADR, the authors provide questionnaires to guide the inspectors. The questions are designed such that they can only be answered by careful study of the document. Some of the questions force the inspector to take a more active role than just reading passively. For example, he or she may be asked to write a program segment to implement a particular design in a low-level design document being reviewed.
Each inspection meeting is broken up into several smaller, specialized meetings, each of which concentrates on one attribute of the artifact. An example is checking consistency between assumptions and functions, i.e., determining whether assumptions are consistent and detailed enough to ensure that functions can be correctly implemented and used.
Britcher takes ADR one step further by incorporating correctness arguments into the questionnaires. The correctness arguments are based on four key program attributes: Topology (whether the hierarchical decomposition into subproblems solves the original problem), Algebra (whether each successive refinement remains functionally equivalent), Invariance (whether the correct relationships among variables are maintained before, during, and after execution), and Robustness (how well the program handles error conditions).
By applying formal verification methods informally through inspections, this approach makes a compromise between the difficulty of scaling formal methods to large systems and the benefit of using systematic detection techniques in inspection.
Knight and Myers present phased inspections, where the inspection step is divided into several mini-inspections or ``phases.'' Standard inspections check for many types of defects in a single examination. With phased inspections, each phase is conducted by one or more inspectors and is aimed at detecting one class of defects. Where there is more than one inspector, they will meet just to reconcile their defect list. The phases are done in sequence, i.e., inspection does not progress to the next phase until rework has completed on the previous phase.
Schneider, et al., developed the N-fold inspection process. This is based on the hypotheses that a single inspection team can find only a fraction of the defects in an artifact and that multiple teams will not significantly duplicate each others efforts. In an N-fold inspection, N teams each carry out parallel, independent inspections of the same artifact. The results of each inspection are collated by a single moderator who removes duplicate defect reports.
Code reading has been proposed as an alternative to formal code inspections . In code reading, the inspector simply focuses on reading source code and looking for defects. The author hands out the source listings (1K-10K lines) to two or more inspectors who read the code at a typical rate of 1K lines per day. This is the main step. The inspectors may then meet with the author to discuss the defects, but this is optional. Removing the emphasis on meetings allows for more emphasis on individual defect discovery. In addition, the problems associated with meetings automatically disappear (including scheduling difficulties and inadequate air time).
Stepwise abstraction is a code-reading technique. The inspector decomposes the program into a set of proper subprograms where a proper subprogram is a chunk of code that performs a single function that can be conveniently documented. A proper subprogram implementing a function that cannot be decomposed further is known as a prime subprogram. The program is decomposed until only prime subprograms remain. Then their functions are composed together to determine a function for the entire program. This derived function is then compared to the original specifications of the program.
Mashayekhi, et al., describe the Collaborative Software Inspection (CSI), a software system to support inspections. Computer support is provided for the preparation and meeting steps. CSI assists with online examination of the artifact and recording of inspector comments. In addition, CSI collates the comments into a single list. The main feature of CSI is that it allows the meeting to be geographically distributed, with the artifact being displayed on each inspector's screen and a voice connection that allows people to talk to each other.
Johnson presents the Formal Technical Asynchronous review method (FTArm), a computer-aided inspection method implemented on top of a software environment called the Collaborative Software Review System (CSRS). CSRS was designed to automate the support functions required for various inspection methods without specifying any particular inspection policy. FTArm is geared towards asynchronous software inspections. All comments by reviewers are kept online. The inspection consists primarily of a private review step and a public review step. During the private review step, reviewers cannot see each other's comments. In the public review step, all comments become public and reviewers can build on each other's suggestions. They then vote on whether they agree or disagree with the comments made about each section of the artifact being inspected. This is an example of a collection meeting held in nominal fashion. If unresolved issues remain, they are handled in a conventional face-to-face group review meeting.
Software inspections are one of many techniques for improving the quality of software artifacts. Consequently, before choosing to perform inspections we should ascertain (1) the costs and benefits of individual inspection methods and (2) how the use of a given inspection method affects the costs and benefits of the entire software development process. This section discusses models for measuring the costs and benefits of software inspections and then presents examples of cost-benefit analyses from previous studies.
To measure the local costs and benefits of one or more inspection methods we can construct two models: one for calculating inspection interval and effort, and another for estimating the number of defects in an artifact. These models are depicted in Figure 2.1.
Two of the most important inspection costs are interval and effort. The inspection process begins when an artifact is ready for inspection and ends when the author finishes repairing the defects found. The elapsed time between these events is called the inspection interval.
The length of this interval depends on the time spent working (preparing, attending collection meetings, and repairing defects) and the time spent waiting (time during which the inspection is held up by process dependencies, higher priority work, scheduling conflicts, etc).
In order to measure inspection interval and its various subintervals, we devised an inspection time model based on visible inspection events . Whenever one of these events occurs it is timestamped and the event's participants are recorded.
These events occur, for example, when the artifact is ready for inspection, or when a reviewer starts or finishes his or her preparation. This information is entered into a database, and inspection intervals are reconstructed by performing queries against the database. Inspection effort can also be calculated using this information.
The most important benefit of an inspection is its effectiveness, and one important measure of an inspection's effectiveness is its defect detection ratio - the number of defects found during the inspection divided by the total number of defects in the artifact. Because we never know exactly how many defects an artifact contains, it is impossible to make this measurement directly, and therefore we are forced to use surrogate measures.
Several methods can provide these measures. Each differ in their accuracy (how close they come to the true measure), and their availability (how early in the software development process they can be applied).
In this section we survey previous work, showing how each study measured the costs and benefits of its proposed inspection method.
The cost-effectiveness of a method may be described anecdotally. Parnas and Weiss applied ADR on an actual review of the design document for the operational flight program of one of the Navy's aircraft.
Mashayekhi, et al. discuss a case study on the use of CSI. This was conducted with 9 student volunteers from a software engineering class and compared the effectiveness of using CSI with face-to-face inspection meetings. The participants were divided into 3 teams, each of which inspected the same 4 pieces of code for a total of 12 inspections. Of these, 5 inspection meetings were randomly selected to use CSI while the rest met face-to-face. The results showed that in only one of the 4 pieces did CSI find more defects. However, because the teams retained their relative rankings across all modules inspected (i.e., Team 1 was always first in each module, Team 2 was always second, Team 3 was always last), the authors concluded that the use of CSI did not have any positive or negative effect on any of them.
Bisant and Lyle ran an experiment using two sets of student projects in a programming language class to study the effects of using a two-person inspection team, with no moderator, on programmer productivity, or time to complete the project. The experiment used a pretest-posttest, control group design. The students were divided into an experimental group, which held inspections, and a control group, which did not. There were 13 students in the experimental group and 19 students in the control group. Both groups did not inspect their design or code during the first project. For the second project, the members of the experimental group were asked to inspect, along with a classmate, each other's design or code. The results showed that the programming speed of the experimental group improved significantly in the second project.
Knight and Myers carried out an experiment involving 14 graduate students and using a phased inspection with four phases. Each student was involved in exactly one of the phases. The artifact was a C program with more than 4,000 lines and 45 seeded defects, whose types were distributed across those which the four phases are expected to find. The inspections raised a total of 115 issues. (Of these, only about 26 appear to affect the execution of the program.) The inspectors also found 30 of the 45 seeded defects. The amount of effort totaled 66 person-hours. This was determined from the usage of the inspection tool, and from the meeting times of the of the phases using more than one inspector.
Acknowledging that they cannot make definitive comparisons, Knight and Myers found it interesting to compare their results to Russell; which are also described in Section 2.2.2. They show that while Russell found 1 defect per hour, the phases found 1.5 to 2.75 defects per hour.
The N-fold inspection method is based on the idea that no single inspection team can find all the defects in a software requirements document, that N separate inspection teams do not significantly duplicate each others' efforts, and therefore that N inspections will be significantly more effective than one. Replicating the inspection process 5 or 10 times will certainly be expensive, but it might be acceptable for critical systems if the detection rate increased significantly.
To evaluate the hypothesis, they designed and ran an experiment with 27 students who were taking a graduate course in software engineering as subjects. The subjects were divided into 9 inspection teams of 3 persons each. An attempt was made to form evenly matched teams based on background experiences. These teams inspected a single requirements document that was seeded with 99 defects. After the inspections, each recorded defect was to be checked to see if it was one of the 99 seeded defects. If so it was entered into the defect database. The authors then calculated the number of defects found by exactly x teams, where x = 0...N.
The results show that the 9 teams combined found a little more than twice as many of the seeded defects as the average found by any single team (78% compared to 35%). Also, no single defect was found by every team. The authors suggest that this supports their claims that parallel teams do not duplicate each other's work. The inspection took 1.5 weeks, from distribution of the document to completion of the meetings, and used 324 person-hours.
To test the cost-effectiveness of meetingless inspections, Votta collected data from 13 inspections with meetings. He modeled the effort needed to hold depositions by the following formula:
The model suggests that depositions would always take less effort than an inspection meeting, as long as the number of reviewers is not greater than 20. Their actual data showed that foregoing inspection meetings would however reduce the percentage of defects found by only 5%.
The rationale most often used to justify inspections is that it's cheaper to find and fix defects today than it is to do it later. Several studies have evaluated this conjecture by (1) measuring the costs and benefits of inspections (local analysis) and by (2) estimating the effect of inspections on the rest of the development process (global analysis).
Global analysis usually involves evaluating alternative scenarios (i.e., if we hadn't found those 20 defects during the inspection, how much more testing and rework would we have had to do?) This information is normally extrapolated from historical data and requires that the analyst make strong assumptions about its representativeness. As a result any analysis of the global cost-benefits of inspections must be examined critically.
The costs of performing inspections include the local costs described in Section 2.2.1 as well as any costs that stem from including inspections in the development process, for example, duplicating inspection artifacts and maintaining inspection reports. Another significant cost comes from increasing schedule. Inspections, like other labor-intensive processes, require group meetings, which can cause delays and increase interval. Since longer intervals may incur substantial economic penalties, this cost must be considered. Extra interval can lead to:
Since these costs are difficult to quantify, we believe that the cost of inspections is often underestimated.
Inspections provide the direct benefit of finding defects. Many people believe that they also positively affect later stages of development by reducing rework, testing, and maintenance. As we mentioned earlier, measuring these benefits directly is impossible and therefore they must be estimated. Of course, any attempt to do this will involve making certain assumptions about how observed data relates to the values being estimated. This section examines several commonly made assumptions and explains why the some studies may be overstating the benefits of inspections.
One of the problems with this approach is that inspections may not find the same classes of defects as testing. For example, inspections turn up many issues which do not affect the operational behavior of the system. These defects will never be found by testing. In another example, some studies have shown that almost half the defects found in testing are interface defects, suggesting that inspections are not effectively finding this class of defects, even though effort is spent looking for them.
In this section, we present examples of cost benefit analyses from previous papers on software inspections. We evaluate each one in the context of the four assumptions stated in Section 2.2.2.
The reader must be cautioned that claims on improvement cited by each study occurred within specific development environments, under the influence of many factors not directly related to inspection such as design notation, programming language, development processes, available hardware, process maturity, artifact size, etc. Also, units of measurement may have differing operational definitions.
Fagan studied the use of design and code inspections on an IBM operating system component. The data was compared against that for similar components which did not use inspections. The results showed an increase in productivity, attributed to a minimized overall amount of error rework. For instance, there was a 23% increase in coding productivity compared to projects which did not use inspections. Design and code inspections resulted in a net savings of 94 and 51 person-hours per KNCSL, respectively. This included the cost of defect rework, which was 78 and 36 person-hours per KNCSL for design and for code inspections, respectively. It should be noted that this data is 20 years old! As explained in assumption A4, the advertised benefits may have diminished over the years as technology, defect prevention methods, and software development skills improved.
In a follow-up study, Fagan summarized several industrial case studies of inspection performance. His conclusions were that inspections of a 4,000 line program at AETNA Life and Casualty and a 6,000 line program at IBM detected 82% and 93%, respectively, of all defects detected over the entire life cycle of the programs; that the inspection of a 143,000 line software project at Standard Bank of South Africa reduced corrective maintenance costs by 95%; and that inspection of test plans and test cases for a 20,000 line program at IBM saved more than 85% of programmer effort by detecting major defects through inspection instead of testing.
Russell observed inspections for a two year period at Bell-Northern Research. These inspections found about one defect for every man-hour invested in inspections. He also concluded that each defect found before it reached the customer saved an average of 33 hours of maintenance effort. As the following excerpt shows, the article assumes that the benefit of finding a defect during inspection equals the cost of fixing it after the software has been released.
Here's some more perspective on this data. Statistics collected from large BNR software projects show that each defect in software released to customers and subsequently reported as a problem requires an average of 4.5 man-days to repair. Each hour spent on inspection thus avoids an average of 33 hours of subsequent maintenance effort, assuming a 7.5-hour workday.
Using assumption A1 Doolan calculated that inspecting requirements specifications at Shell Research saved an average of 30 hours of maintenance work for every hour invested in inspections (not including rework).
Bush related the first 21 months of inspection experience at the Jet Propulsion Laboratory. In that time 300 inspections had been conducted over 10 projects. She calculated that inspections cost $105 per defect. (The effort to find, fix, and verify the correction of a defect varies between 1.5 and 2.1 hours, corresponding to a cost between $90 and $120 or an average of $105.) But this saved them $1,700 per defect in costs which would have been incurred by testing and repair. (It was not explained how this value was calculated.) The papers assumes that finding and fixing a defect during inspection costs the same as finding and fixing a defect during test (assumption A2).
Kelly, et al., report on 203 inspections at the Jet Propulsion Laboratory. They showed that inspections cost about 1.6 hours per defect, from planning, overview, preparation, meeting, root cause analysis, rework, and follow-up. This is less than the 5 to 17 hours required to fix defects found during formal testing. Although this calculation requires assumption A2, many of the defects found didn't affect the behavior of the software and wouldn't have been caught by testing.
Weller relates 3 years of inspection experience at Bull HN. In one case study, data at the end of system test showed that inspections found 70% of all defects detected up to that point. In the same project, which was to replace C code with Forth, the developers had initially decided not to do any inspections on the rewritten code, but found that testing was taking six hours per failure. After inspections were instituted, they began to find defects at the cost of less than one hour per defect. In another case study, inspections of fixes dropped the number of defective fixes to half of what it had been without inspections.
Franz and Shih report the effects of using inspection on various artifacts of a sales and inventory tracking project at Hewlett-Packard. They calculate that inspections saved a total of 618 hours (after taking into account the 90 hours needed to perform the inspections). The total time saved by inspection is the time saved in system test plus the time saved by reduced maintenance. System test time is the estimated black box testing effort needed to find each critical defect. Maintenance effort is the estimated effort saved for noncritical defect These savings are subtracted from the cost of performing inspections - the time to do preparation, meeting, causal analysis, discussion, rework, and follow-up. In this particular project, inspections found 12 critical and 78 noncritical defects. Based on an estimated black box testing time of 20 hours per defect and 6 hours of maintenance for each noncritical defect, the total time saved amounted to hours. The estimated black box testing time and noncritical defect maintenance time seem to be loose upper bounds, based also on assumptions A1, A2 and A3. Also, unit and module testing found and fixed another 51 defects at a cost of 310 hours, or hours per defect. This shows that it would take far less than 20 hours to find and fix the critical defects from inspections if they happen instead to be discovered before system testing.
Discovering defects in unit and module testing saved an estimated 710 hours in subsequent maintenance. While testing seemed to give a lower return on investment ( as compared to for inspections), it should be noted again that the farther along the test stage, the longer it takes to find defects. Also note that the 310 hours included machine time (which may be less expensive than people time) to execute the test cases, as explained in assumption A4.
Another interesting point is that noncritical defects comprised 85% of the defects found at inspection. It is not clear how much of the 90 hours invested in inspections were spent looking for and fixing these - they might be dealt with using automated tools, as explained in assumption A4. Also, the return on investment comparison between inspection and testing might be more accurate if only the savings and costs from critical defects found at inspection were considered.
Grady and van Slack discuss nearly 20 years of inspection experience at Hewlett-Packard. In one 50,000-line project, they report that design inspections saved at total of 1759 engineering hours in defect-finding effort. (It was not explained how this value was calculated.) The cost was 169 engineering hours in training and start-up. (The cost of performing the actual inspections was not given.) The inspections also shortened the estimated development interval by 1.8 months. Overall, they estimated that inspections saved HP $21.4 million dollars in 1993.
Fowler summarizes the results of several studies on the use of inspections in industry. In one study, a major software organization increased its productivity by 14% from one release to the next after introduction of improved project phasing and tracking mechanisms, including inspections. It also showed a tenfold improvement in quality. Fowler acknowledges, however, that these results cannot be attributed solely to inspections. Another study gave the results of using inspections in developing AT&T's telephone switching system. It claimed that defects detected in inspections cost ten times less to fix than defects found during other development phases (assumption A2). Another study gave the results of using inspections in a project within AT&T's network services. These results showed that inspections are twenty times more effective than testing in finding bugs and make up only 2% of the total cost of testing.
Having looked at how inspection costs and benefits are measured, we now look at studies that investigate the underlying mechanisms driving those costs and benefits.
Some of the studies use student subjects to inspect nonindustrial artifacts (in vitro - in the laboratory) while others are conducted with professional software developers using industrial projects (in vivo - in the industry). Typically, it is more economical to use students subjects, but results may be more generalizable with industrial subjects. Nevertheless, using student subjects is an important first step towards eventually replicating the experiment with professional subjects because the design and instrumentation can then be refined and improved as experience is gained.
Earlier we described the attributes of different inspection methods. Supposedly, different values for these attributes produce different cost-benefit tradeoffs (``How many reviewers should we use?'', ``Do we need a collection meeting?'', etc.). In this section we describe several empirical studies that investigate some of these tradeoffs.
Votta surveyed software developers of AT&T's telephone switching system to find out what factors they believed had the largest influence on inspection effectiveness. The most frequent reason cited was synergy (mentioned by 79% of those polled).
Informally, synergy allows a team working together to outperform any individual or subgroup working alone. The Subarctic Survival Situation exercise  dramatically shows this effect. (Groups outperform individuals unless the individual is an arctic survival expert.)
If synergy is fundamental to the inspection process, we would expect to see many inspection defects found only by holding a meeting. That is that few defects are found in preparation (before the meeting), but many are found (during the meeting). Votta made this measurement as part of a study of capture-recapture sampling techniques for estimating the number of defects remaining in a design artifact after inspection. Figure 2.2 displays data showing that synergy is not responsible for inspection effectiveness (it only accounted for 5% of the defects found by inspections).
Two types of defect detection methods are most frequently used, Ad Hoc and Checklist. Ad Hoc reviewers use nonsystematic techniques and are assigned the same general responsibilities. Checklist reviewers are given a list of items to search for. Checklists embody important lessons learned from previous inspections within a specific environment or domain.
Porter et al., hypothesized that an alternative approach which assigned individual reviewers separate and distinct detection responsibilities and provided specialized techniques for meeting them would be more effective. This hypothesis is depicted in Figure 2.3.
To explore this alternative they prototyped a set of defect-specific techniques called Scenarios - collections of procedures for detecting particular classes of defects. Each reviewer executes a single Scenario and all reviewers are coordinated to achieve broad coverage of the document.
The experiment manipulated five independent variables:
For each inspection they measured four dependent variables:
They evaluated this hypothesis in a controlled experiment, using a partial factorial, randomized experimental design . Forty-eight graduate students in computer science participated in this experiment. They were assembled into 16 three-member teams. Each team inspected two software requirements specifications (SRS) using some combination of ad hoc, checklist and scenario methods.
The experimental results showed
The number of defects found in an inspection is not an adequate indicator of inspection quality because it depends on the initial number of defects in the artifact being inspected. Buck conducted a study at IBM to identify a variable, other than the number of defects found, that would differentiate high quality inspections from low quality ones.
He collected data from 106 code inspections of a single piece of COBOL source code. Next he examined several potential indicators.
The collected data showed that
Thus, the study suggests that quality inspections are a result of following a low inspection rate.
An implied requirement of inspections is understanding the artifact being reviewed. Rifkin and Deimel suggest teaching program comprehension techniques during code inspection training classes in order to improve program understanding during preparation and inspection. Using historical data they argued that while inspections reduced the number of defects discovered by testing, they did not significantly decrease the number of customer-identified defects.
Rifkin and Deimel hypothesized that introducing inspections have had little effect on reducing customer-identified defects because, although reviewers were being thoroughly trained in the group aspects of the inspection process, they were being given little guidance how to analyze a software work product.
To test this hypothesis, they collected data from three software development groups, each composed of 30-35 professionals. Everyone was familiar with the inspection process. One group was given 1.5 days training in program reading comprehension. The variable being measured was the number of customer-identified defects reported to each group per day.
The data showed that the number of customer-reported defects dropped by 90% after the reviewers received reading comprehension training, while results of the other two groups of reviewers showed no change.
Holding, or not holding, inspections has an effect on the cost of the overall software development process. Several factors influence the relationship between inspections and the rest of the software development process.
Testing is traditionally the most widespread method for validating software. The tester prepares several test cases and runs each on through the program, comparing actual output with expected output. Testing puts theory into practice: a program thought to work by its creator is applied to a real environment with a specific set of inputs, and its behavior is observed. Defects are normally found one at a time. When the program behaves incorrectly on certain inputs, the author carries out a debugging procedure to isolate the cause of the defect.
Inspections have an advantage over testing in that they can be performed earlier in the software development process, even before a single line of code is written. Defects can be caught early and prevented from propagating down to the source code. In terms of the amount of effort to fix a defect, inspections are more efficient since they find and fix several defects in one pass as opposed to testing, which tends to find and fix one defect at a time . Also, there is no need for the additional step of isolating the source of the defect because inspections look directly at the design document and source code. It may be argued that this additional step in testing is offset by inspection preparation and meeting effort, but testing also requires effort in preparation of test cases and setting up test environments. However, testing is better for finding defects related to execution, timing, traffic, transaction rates, and system interactions. So inspections cannot completely replace testing (although some case studies argue that unit testing may be removed)[1, 64].
The following two studies compare inspection methods with testing methods. The first is a controlled experiment while the second is a retrospective case study.
Comparing the Effectiveness of Software Testing Strategies. Basili and Selby investigated the effectiveness of 3 program validation techniques: functional (black box) testing, structural (white box) testing, and code reading by stepwise abstraction (described in Section 2.1.2).
The goals of the study were to determine which of the 3 techniques detects the most faults in programs, and which detects faults at the highest rate, and to find out if each technique finds a certain class of faults.
A controlled experiment was conducted in which both students and professionals validated 4 different pieces of software, labeled , , , and . Three independent variables were manipulated: (1) testing technique (functional testing, structural testing, code reading), (2) software type ( , , , ), and (3) level of expertise (advanced, intermediate, junior).
The dependent variables measured included (1) number of faults detected, (2) percentage of faults detected (the total number of faults was predetermined), (3) total fault detection time, and (4) fault detection rate.
The experiment was carried out in three phases, the first two with student subjects and the third with professional developers. Each phase validated three of the four programs. The experiment employed a partial factorial design, assigning each subject to validate all three programs using a different technique on each. The sequence of programs and techniques was randomized.
The most interesting result is that code reading was more effective than functional and structural testing at finding faults in the first and third phases and was equally good in the second phase. With respect to fault detection rate, code reading achieved the highest rate in the third phase and the same rate as the testing techniques in the other two phases. Finally, code reading found more interface faults.
Evaluating Software Engineering Technologies. Card, et al. describe a study measuring the importance of certain technologies (practices, tools, and techniques) on software productivity and reliability.
Eight technologies were assessed:
A non-random sample of 22 software projects from NASA Goddard Space Flight Center was chosen. The selection criteria were chosen to minimize the effects of the programming language and the development environment. Variation in the sizes of projects was also minimized. The effects of nontechnological factors were removed - productivity was corrected for computer use (amount of time spent using computers) and programmer effectiveness (development teams' years of experience), while reliability was corrected for programmer experience and data complexity (number of data items per subsystem).
The results showed that no technological factor explained any of the remaining variation in productivity. However, variation in software reliability was reduced using code reading and quality assurance. The authors conclude that since reliability and productivity are positively correlated, improving reliability improves productivity.
We have presented a survey of existing research, paying attention to how each study measured the costs and benefits of holding inspections and how they explained the factors that influence these measurements, at either a local or a global level. From this, we see several deficient areas.
Many of the cost-benefit studies are at least 10 years old. And the newer ones fail to take into account that technological improvements in other areas of software development may have changed the cost-benefit tradeoffs of software inspections. For example, reviewers spend considerable amounts of time identifying and reporting issues that might be found more easily or prevented altogether with automated tools. As new tools appear, inspections may no longer be cost-effective for finding certain kinds of defects.
Most local cost-benefit analyses of one inspection method against another consisted of comparing the methods as a whole. Very little research has gone on to empirically identify the mechanisms which significantly influence inspection costs and benefits. Consequently, it is difficult to determine why a new method is better than an older one.
The effect of inspections on development interval is still poorly understood, as indicated by the lack of existing literature. Global cost-benefit studies talk about cost in terms of effort spent conducting the inspections, but not interval. While it is believed that inspections shorten interval by reducing the need for testing, we believe that the amount of testing is independent of the amount of inspection. In addition, carrying out dozens of inspections may actually lengthen the development interval. This could have serious economic consequences, especially in a highly competitive environment where being the first to introduce a new (even poorly implemented) feature to the market may mean the difference between success and failure of a product. So far there has been no attempt to model and estimate this cost.
To address these deficiencies, in particular, the second one, we started a study to identify mechanisms underlying successful inspections. Since many of the new methods propose changes to the structure of the inspection process, we began by conducting an experiment to assess the effects of manipulating process structure on inspection performance.
Very few inspection studies proposing new methods have directly investigated the effects of the underlying mechanisms independently of the inspection methods they were bundled with. As a result, little is known about what goes on inside inspections and how changing one mechanism affects the effectiveness and cost of the inspection. We believe that changes in different underlying mechanisms causes different tradeoffs between interval, effort, and effectiveness. We began by examining the effects of different process structures. We hypothesize that different process structures have different effects on inspection performance, but that any increase in effectiveness will have a corresponding increase in interval. Specifically,
To evaluate these hypotheses we designed and conducted a controlled experiment. Our purpose was to compare the tradeoffs between minimum interval, minimum effort, and maximum effectiveness of several different process structures.
We ran this experiment at Lucent Technologies on a project that was developing a compiler and environment to support developers of the 5ESS telephone switching system. The finished system contains over 55K new lines of C++ code, plus 10K which was reused from a prototype. See Appendix B for a description of the project.
Our inspector pool consisted of 11 experienced developers, each of whom had received inspection training in the last 5 years. The experiment ran for 18 months (June, 1994 to December 1995), during which the team performed 88 code inspections.
The first code units were inspected from July 1994 to September 1994, at which time the first integration build delivered the compiler's front end. After this there were few inspections as the development team tested and modified the front end and continued designing the back end. By January 1995, the back end code became available and there was a steady stream of inspections throughout 1995.
The experiment manipulated 3 independent variables:
The treatments are arrived at by selecting a value for each of the independent variables and are denoted [1,or 2] sessions X [1,2, or 4] persons [No-repair,Repair], so, for example, the label 2sX1pN indicates a two-session, one-person, without-repair inspection. These distributions changed during the experiment because some of the poorly performing or excessively expensive treatments were discontinued.
For each inspection we measured 5 dependent variables:
This experiment used a partial factorial design to compare the interval, effort, and effectiveness of inspections with different team sizes, number of inspection sessions, and coordination strategies. We chose a partial factorial design because some treatment combinations were considered too expensive (e.g., two-session-four-person inspections with and without repair).
To measure interval, effort, and effectiveness, we applied the models described in Section 2.2.1.
The inspection process begins when a code unit is ready for inspection and ends when the author finishes repairing the defects found in the code. The elapsed time between these events is called the inspection interval.
To measure the interval, we put a timestamp on each visible inspection event (ready for inspection, package distributed to reviewers, preparation start/end, meeting start/end, repair start/end) and stored each in an event database. (In most cases this information was obtained from the manually recorded forms described in Section 3.1.4.) The interval between any two events is just the difference of their dates.
Inspection effort was calculated by summing the appropriate subintervals (sum of all preparation and meeting time).
One important measure of an inspection's effectiveness is its defect detection ratio - the number of defects found during the inspection divided by the total number of defects in the code. Because we never know exactly how many defects an artifact contains, it was impossible to make this measurement directly, and therefore we were forced to approximate it.
The estimation procedure needed to be (a) as accurate as possible and (b) available throughout the study because we were experimenting with a live project and needed to identify and eliminate dangerously ineffective approaches as soon as possible.
We found no single approximation that met both criteria. Therefore we tried all three methods described in Section 2.2.1.
We took special care to insure that the experimental design did not inadvertently influence subject behavior (professional developers and inspectors). Each study participant was given a simple ``bill of rights'', reminding them of their right to withdraw from the study at anytime with no recriminations from the researchers or his/her management . Each participant acknowledged this right at the beginning of the experiment by signing a release form. No subject used this right during the experiment.
In our initial briefings with the development team, we were asked, ``What happens if a treatment cost too much or takes too long?'' They were concerned that the experiment could jeopardize the budget or schedule of the product.
We took this concern seriously and realized that if a treatment was jeopardizing the project's budget, schedule, or quality, we would have to discontinue the treatment. However, the professional developers also realized that they were gaining some valuable knowledge from the study. So our compromise was to discontinue any treatment after enough inspections had been done, and we could convince ourselves that nothing ``unlucky'' had happened. (See Appendix C for more details.)
Threats to internal validity are influences that can affect the dependent variable without the researcher's knowledge. We considered three such influences: (1) selection effects, (2) maturation effects, and (3) instrumentation effects.
Selection effects are due to natural variation in human performance. For example, if one-person inspections are done only by highly experienced people, then their greater than average skill can be mistaken for a difference in the effectiveness of the treatments. We limited this effect by randomly assigning team members for each inspection. This way individual differences were spread across all treatments.
Maturation effects result from the participants' skills improving with experience. Again we randomly assigned the treatment for each inspection to spread any performance improvements across all treatments.
Instrumentation effects are caused by the code to be inspected, by differences in the data collection forms, or by other experimental materials. In this study, one set of data collection forms was used for all treatments. Since we could not control code quality or code size, we randomly assigned the treatment for each inspection.
Threats to external validity are conditions that limit our ability to generalize the results of our experiment to industrial practice. We considered three sources of such threats: (1) experimental scale, (2) subject generalizability, and (3) subject and artifact representativeness.
Experimental scale is a threat when the experimental setting or the materials are not representative of industrial practice. We avoided this threat by conducting the experiment on a live software project.
A threat to subject generalizability may exist when the subject population is not drawn from the industrial population. This is not a concern here because our subjects are software professionals.
Threats regarding subject and artifact representativeness arise when the subject and artifact population is not representative of the industrial population. This may endanger our study because our subjects are members of a development team, not a random sample of the entire development population and our artifacts are not representative of every type of software professional developers write.
Our strategy for analyzing the experiment has three steps: resolution analysis, calibration, and hypothesis testing.
We performed the resolution analysis using a Monte Carlo simulation. The simulation indicates that with as few as 5 observations per treatment the experiment can reliably detect a difference as small as .075 in the defect detection rate of any two treatments. The strongest influence on the experiment's resolution is the standard deviation of the code units' defect content - the smaller the standard deviation the finer the resolution. (See Appendix C for more details.)
We continuously calibrated the experiment by monitoring the sample mean and variance of each treatment's detection ratio and inspection interval, and the number of observed inspections. Based on this information and the resolution analysis we discontinued some treatments because their effectiveness was so low or their interval was so long that it put the project at risk. We also monitored the experiment to ensure that the distribution of treatments did not produce too few data points to identify statistically significant performance differences.
Once the data was collected we analyzed the combined effect of the independent variables on the dependent variables to evaluate our hypotheses. Once the significant explanatory variables were discovered and their magnitude estimated, we examined subsets of the data to study specific hypotheses.
We designed several instruments for this experiment: preparation and meeting forms, author repair forms, and participant reference cards.
We designed two data collection forms, one for preparation and another for the collection meeting.
The meeting form was filled in at the collection meeting. When completed, it gives the time during which the meeting was held, and a page number, a line number, and an ID for each defect.
The preparation form was filled in during both preparation and collection. During preparation, the reviewer recorded the times during which he or she reviewed, and the page and line number of each issue (``suspected'' defect). During the collection meeting the team decided which of the reviewer's issues were, in fact, real defects. At that time, real defects were recorded on the meeting form and given an ID. If a reviewer had discovered this defect during preparation then they record this ID on their preparation form.
The author repair form captured information about each defect identified during the inspection. This information included Defect Disposition (no change required, repaired, deferred); Repair Effort ( , , , or > 8hr ), Repair Locality (whether the repair was isolated to the inspected code unit), Repair Responsibility (whether the repair required other developers to change their code), Related Defect Flag (whether the repair triggered the detection of new defects), and Defect Characteristics (whether the defect required any change in the code, was changed to improve readability or to conform to coding standards, was changed to correct violations of requirements or design, or was changed to improve efficiency).
Each participant received a set of reference cards containing a concise description of the experimental procedures and the responsibilities of the authors and reviewers.
To support the experiment, I joined the development team in the role of inspection quality engineer (IQE). I was responsible for tracking the experiment's progress, capturing and validating data, and observing all inspections. I also attended the development team's meetings, but had no development responsibilities.
When a code unit was ready for inspection, its author sent an inspection request to me. I then randomly assigned a treatment (based on the treatment distributions given in Table 3.1.3) and randomly drew the review team from the reviewer pool. These names were then given to the author, who scheduled the collection meeting. Once the meeting was scheduled, I put together the team's inspection packets.
The inspection process used in this environment is similar to a Fagan inspection, but there are some differences. During preparation, reviewers analyze the code in order to find defects, not just to acquaint themselves with the code. During preparation reviewers have no specific technical roles ( i.e., tester, or end-user) and have no checklists or other defect detection aids. All suspected defects are recorded on the preparation form. The experiment places no time limit on preparation, but a organizational limit of 300 NCSL over a maximum of 2 hours is generally observed.
For the collection meeting one reviewer is selected to be the reader. This reviewer paraphrases the code. (Often this involves reading several lines of code at a time and emphasizing their function or purpose. During this activity, reviewers may bring up any issues found during preparation or discuss new issues. One reviewer acts as the the moderator. This person runs the meeting and makes sure all required changes are made. The code unit's author compiles the master list of all defects and no other reviewer has a predefined role.
Self-reported data tend to have some systematic errors. We minimized these errors by employing direct observation and interviews. I attended almost every collection meeting to ensure that all the procedures were followed correctly. I also answered questions about how to fill out the forms and took extensive field notes. After the collection meeting, the author kept the master list of defects, repaired them, filled out the author repair form, and returned all paperwork to me. After the forms were returned, I interviewed the author to validate any questionable data.
Four sets of data are important for this study: the team defect summaries, the individual defect summaries, the interval summaries, and the author repair summaries. This information is captured on the preparation, meeting, and repair forms.
The team defect summary forms show all the defects discovered by each team. This form is filled out by the author during the collection meeting and is used to assess the effectiveness of each treatment. It is also used to measure the added benefits of a second inspection session by comparing the meeting reports from both halves of two-session inspections with no repair.
The individual defect summary forms show whether or not a reviewer discovered a particular defect. This form is filled out during preparation to record all suspected defects. The data is gathered from the preparation form and is compiled during the collection meeting when reviewers cross-reference their suspected defects with those that are recorded on the meeting form. This information, together with the team summaries, is used to calculate the capture-recapture estimates and to measure the benefits of collection meetings.
The interval summaries describe the amount of calendar time that was needed to complete the inspection process. This information is used to compare the average inspection interval and the distribution of subintervals for each treatment.
The author repair summaries characterize all the defects and provide information about the effort required to repair them.
Data reduction is the manipulation of data after its collection. We have reduced our data in order to (1) remove data that is not pertinent to our study, and to (2) adjust for systematic measurement errors.
The preparation and meeting forms capture the set of issues that were raised during each inspection. The reduction we made was to remove duplicate reports from 2-session-without-repair inspections. I performed this task with the help of the code unit's author.
Another reduction was made because, in practice, many issues, even if they went unrepaired, would not lead to incorrect system behavior, and they are therefore of no interest to our analysis.
Although defect classifications are usually made during the collection meeting, we feel that authors understand the issues better after they have attempted to repair them, and therefore, can make more reliable classifications. Consequently, we use information in the repair form and interviews with each author to classify the issues into one of three categories:
The distribution of defect classifications for each treatment appears in Figure 3.1. Across all inspections, 22% of the issues are False Positives, 60% involve Soft Maintenance, and 18% are True Defects. We consider only True Defects in our analysis of estimated defect detection ratio (a dependent variable).
The preparation, meeting, and repair forms show the dates on which important inspection events occur. This data is used to compute the inspection intervals.
We made two reductions to this data. First, we observed that some authors did not repair defects immediately following the collection meeting. Instead, they preferred to concentrate on other development activities, and fix the defects later, during slow work periods. To remove these cases from the analysis, we use only the pre-meeting interval (the calendar period between the submission of an inspection request and the completion of the collection meeting) as our initial measure of inspection interval.
When this reduction is made, two-session inspections have two inspection subintervals - one for each session. The interval for a two-session inspection with no repair is the longer of its two subintervals, since both of them begin at the same time. For a two-session inspection with repair, it is the two sessions placed end-to-end, excluding the repair from the second session.
Next, we removed all non-working days from the interval. Non-working days are defined as either (1) weekend days during which no inspection activities occur, or (2) days during which the author is on vacation and no reviewer performs any inspection activities. We use these reduced intervals as our measure of inspection interval.
Figure 3.2 is a boxplot showing the number of working days from the issuance of the inspection request to the collection meeting (Pre-Meeting), from the collection meeting to the completion of repair (Repair), and the total (Total). The total inspection interval has a median of 21 working days, 10.5 before and 10.5 after the collection meeting.
Table 3.2.2 shows the number of observations for each treatment. Figure 3.3 is a contrast plot showing the interval, effort, and effectiveness of all inspections and for every setting of each independent variable. This information is used to determine the amount of the variation in the dependent variables that is explained by each independent variable. We also show another variable, total number of reviewers (the number of reviewers per session multiplied by the number of sessions). This variable provides information about the relative influence of team size vs. number of sessions.
During preparation, reviewers analyze the code units to discover defects. After all reviewers are finished preparing, a collection meeting is held. These meetings are believed to serve at least two important functions: (1) suppressing unimportant or incorrect defect reports, and (2) finding new defects. In this section we analyze how defect discovery is distributed across the preparation and collection meeting activities.
One input to the collection meeting is the list of defects found by each reviewer during his or her preparation. Figure 3.4 shows the percentage of defects reported by each reviewer that are eventually determined to be true defects. Across all 233 preparation reports, only 13% of all issues turn out to be true defects. We can find no clear relationship between the independent variables and preparation effectiveness ( ).
It is generally assumed that collection meetings suppress unimportant or incorrect defect reports, and that without these meetings, authors would have to process many spurious reports during repair. As we deduce from the previous section an average of 87% of reviewer reports (100% - 13%) do not involve true defects.
Figure 3.5 shows the suppression rates for all 233 reviewer reports. Across all inspections about 26% of issues are suppressed. This appears to be independent of the treatment ( ).
Another function of the collection meeting is to find new defects in addition to those discovered by the individual reviewers. Defects that are first discovered at the collection meeting are called meeting gains.
Figure 3.6 shows the meeting gain rates for all 131 collection meetings. Across all inspections, 30% of all defects discovered are meeting gains. The data suggests meeting gains are independent of treatment ( ).
The common measure of inspection cost is total effort - the number of hours spent in preparation and meeting by each reviewer and author. Figure 3.7 shows the effort spent per KNCSL for each inspection by treatment and for all treatments. Across all treatments, the median effort is about 22 person-hours per KNCSL.
The data suggest that effort increases in direct proportion with the total number of reviewers while the number of sessions and the repair between sessions have no effect . That is, inspections involving 4 reviewers (1sX4p, 2sX2pN, and 2sX2pR) required significantly more effort than inspections involving 2 reviewers ( ). Likewise, inspections involving 2 reviewers (1sX2p, 2sX1pN, and 2sX1pR) required significantly more effort than inspections involving 1 reviewer ( ).
Inspection interval is another important, but often overlooked cost. Figure 3.8 shows the inspection interval (pre-meeting only) by treatment and for all treatments.
The cost of increasing team size is suggested by comparing 1-session inspections (1sX1p, 1sX2p, and 1sX4p). Since there is no difference between the intervals ( ), team size alone did not affect interval.
The additional cost of multiple inspection sessions can be seen by comparing 1-session inspections with 2-session inspections (1sX2p and 1sX1p with 2sX2p and 2sX1p inspections). We find that 2sX1p inspections didn't take longer to conduct than 1sX1p inspections ( ), but that 2sX2p inspections took longer to complete than 1sX2p inspections ( ). (This effect is caused solely by the 2sX2pR treatment, since there was no difference between 1sX2p and 2sX2pN inspections ( ). )
The cost of serializing two inspection sessions is suggested by comparing 2-session-with-repair inspections to 2-session-without-repair inspections (2sX2pN and 2sX1pN with 2sX2pR and 2sX1pR inspections). When the teams had only 1 reviewer we found no difference in interval ( ), however, we did see a difference for 2-reviewer teams ( ). This indicates that requiring repair between sessions only increases interval as the team size grows.
Another interesting observation in that the median interval for the 2sX2pR treatment is extremely long (20 days), while all others have a median of only 10 days. Since this treatment took twice as long to complete than did the others we discontinued it early in the experiment. Consequently, we conducted only four of these inspections. Nevertheless, we are convinced that this finding warrants further study, because it suggests that relatively straightforward changes to a process can have dramatic, negative effects on interval.
The primary benefit of inspections is that they find defects. This benefit varied with different inspection treatments. Figure 3.9 shows the observed defect density for all inspections and for each treatment separately.
The effect of increasing team size is suggested by comparing the effectiveness of all 1-session inspections (1sX1p, 1sX2p, and 1sX4p inspections). There was no difference between 2- and 4-person inspections ( ), but both performed better than 1-person inspections (1sX1p vs. 1sX2p: ; 1sX1p vs. 1sX4p: ).
The effect of multiple sessions is suggested by comparing 1-session inspections with 2-session inspections. When team size is held constant (1sX2p vs. 2sX2pN and 1sX1p vs. 2sX1pN inspections), 2-session inspections were more effective than 1-session inspection only for 1-person teams (1sX1p vs. 2sX1pN: ; 1sX2p vs. 2sX2pN: ). However, when total number of reviewers is held constant (1sX2p vs. 2sX1pN and 1sX4p vs. 2sX2pN) there were no differences in effectiveness (1sX2p vs. 2sX1pN: ; 1sX4p vs. 2sX2pN: ).
The effect of serializing multiple sessions is suggested by comparing 2-session-with-repair inspections to 2-session-without-repair inspections (2sX2pN and 2sX1pN with 2sX2pR and 2sX1pR inspections). The data show that repairing defects between multiple sessions didn't increase effectiveness when the team size was one ( ), but did when the team size was two ( ). This result should be viewed with caution, however, because there are only four 2sX2pR and five 2sX1pR inspections, respectively. Also, during the time in which the with-repair treatments were used they performed no differently than did without-repair treatments (2sX1pN vs. 2sX1pR: ; 2sX2pN vs. 2sX2pR: ), and furthermore the overall mean dropped steadily as the experiment progressed possibly exaggerating the differences between the 2sX2pR and 2sX2pN treatments. (See Appendix D for more details.)
We draw several observations from this data: (1) increasing the number of reviewers did not necessarily lead to increased defect discovery, (2) increasing the number of sessions did not always improve performance, (3) splitting one large team into two smaller teams did not increase effectiveness, and (4) repairing defects in between 2-session inspections doesn't guarantee increased effectiveness.
Several software inspection researchers have proposed changes to the structure of the process, hoping to improve its performance. For example, originally researchers claimed that large teams would bring a wide diversity of expertise to an inspection, and, therefore find more defects than would smaller teams. But later researchers believed that smaller teams would be better because they would minimize the inefficiencies of large team meetings. Some argued further that multiple sessions with small teams would be more effective than a single session with a larger team because the small teams would be nearly as effective as large ones, wouldn't duplicate each other's effort and would have more effective collection meetings. Finally, some claim that repairing defects in between multiple sessions would be more effective than two sessions without repair because repair would improve the ability of the second team to find defects.
Our initial analysis suggests, however, that many of these changes have little or no effect on observed defect density. For example,
One possible explanation is that the assumptions driving inspection process changes didn't hold in practice. (e.g., that repairing defects between multiple sessions didn't improve the ability of the second team to find defects.) Another possible explanation is that the treatments had unintended, negative side effects (i.e., the treatment improved some aspect of the inspection while degrading another).
To evaluate these potential explanations we examined the effect of each treatment on several inspection sub-activities.
As long as additional reviewers find some new defects and don't negatively affect collection meeting performance, we would expect larger teams to find more defects than smaller teams, yet we found that 1sX2p inspections performed the same as 1sX4p inspections. Somewhere the supposed advantage of having more reviewers didn't materialize, so we investigated how team size affected both preparation and meeting performance.
First, we investigated two aspects of preparation performance: individual preparation and amount of overlap in the defects found by the reviewers.
Figure3.10(b) shows the number of defects per NCSL found in preparation by each reviewer in 1sX2p and 1sX4p inspections. There was no difference between the two treatments ( ).
Then we examined the amount of overlap in the reviewer's defect reports. This is the number of defects found by more than one reviewer divided by the total number found in preparation. There was no difference in overlap between 1sX2p and 1sX4p inspections ( ) and both distributions had a median of 0. (See Figure 3.10(c)).
Next we examined two aspects of meeting performance: defect suppression and meeting gains. We found that defect suppression rates were higher for 1sX4p than for 1sX2p inspections, but not significantly ( ). (See Figure 3.5).
Finally, Figure 3.10(a)) shows that there is no difference in the meeting gains per NCSL for 1sX2p and 1sX4p inspections ( ).
One interpretation of these results is that larger teams don't improve inspection performance because meeting gains do not increase as the number of reviewers increases, and because larger teams may suppress a large number of (possibly true?) defects.
Aside from looking at the performance of small teams versus large teams, we want to see the effect of using more than one team of the same size. It is argued that, as long as teams do not significantly duplicate each others' efforts, then adding more teams will result in finding more defects.
However, we saw that while 2sX1pN was better than 1sX1p, 2sX2pN was not better than 1sX2p.
To investigate why 2sX2pN was not better than 1sX2p, we first tested if each session of the 2sX2pN inspections performed as well as 1sX2p. We found that (after removing the outliers) 2sX2pN was significantly lower ( , see Figure 3.11(a)).
Examining the inspection process further into its subtasks, we found that (after removing outliers) the combined preparation per session of 2sX2pN is significantly lower ( , see Figure 3.11(b)) while meeting gains per session are not significantly different ( , see Figure 3.11(c)).
A possible explanation of the lower preparation for 2sX2pN may again be found by comparing the suppression rates. We found that the suppression rates for 2sX2pN are higher than for 1sX2p, but not significantly ( ). (See Figure 3.5).
These results suggest that increasing the number of sessions does not improve inspection performance because the number of defects found per session goes down, possibly because the author, who attends all sessions, ends up suppressing more issues.
Another recommendation that has appeared in the literature is to substitute several small (1- or 2-person) teams for one larger team. This approach should be more effective if the combined defect detection of the smaller teams is greater than that of the single larger team, and if the small teams don't significantly duplicate each other's efforts.
Nevertheless we saw that 2sX2p (2sX1p) inspections did not perform better than 1sX4p (1sX2p) inspections.
To evaluate this explanation we compared the distribution of observed defect densities for 1-session inspections with the sum of the defect densities found in both sessions of the 2-session inspections (defects found by both teams are counted twice). We found that combined defect densities of 2sX1p (2sX2p) inspections are not greater than the defect densities of 1sX2p (1sX4p) inspections (Compare 2sX1pN ``with dups'' with 1sX2p ( ) and 2sX2pN ``with dups'' with 1sX4p ( ) in Figure 3.12.) We also found that there was effectively no overlap in the defects found by the two sessions (Compare 2sX1pN ``no dups'' with 2sX1pN ``with dups'' ( ) and 2sX2pN ``no dups'' with 2sX1pN ``with dups'' ( ) in Figure 3.12.)
This data suggests that for our experimental setting overlap among reviewers is a rare occurrence, but that splitting teams did not improve performance because the two smaller teams found no more defects than the one larger team.
Repairing defects between sessions of a multiple session inspection should result in greater defect detection than not repairing if (1) the teams in the with-repair inspections perform as well as the teams in the without-repair inspections, (2) there are significantly more defects than one team can find alone, and (3) the teams doing without-repair inspection find many of the same defects.
However, we saw that during the period in which with-repair inspections were conducted they did not perform better than without-repair inspections. One or more of the assumptions may have been violated.
To test whether with-repair teams perform as well as without-repair teams we compared defect densities per session of with-repair inspections with those of without-repair inspections completed before the with-repair inspections were discontinued. We found no differences in the performances (2sX1pN vs. 2sX1pR ( ) and 2sX2pN vs. 2sX2pR ( ), see Figures 3.13(a) and 3.14(a)), suggesting that the with-repair teams perform no differently than without-repair teams.
To test whether there are enough defects to warrant two inspection teams we compared the performance of with-repair teams inspecting the same unit. If the second team (inspecting after the repair) consistently found fewer defects than the first team, (i.e., the difference between the first and the second is significantly higher than 0), then the first team may have found most of the defects that can be found with current inspection techniques. If not, this suggests that that there are more than enough defects to be found by two teams, and that on the average, one team is as good as the other. We found some drop-off defect density for the second team of 2sX1pR inspections ( , see Figure 3.13(b)), but none for the second team of 2sX2pR inspections ( , see Figure 3.14(b)).
To test whether overlap has a significant influence on without-repair inspections we first calculated the number of defects identified by the first team that were subsequently rediscovered by the second team. If we assume that an equal number of new defects would have been found had repair been done prior to the second inspection, then an approximation for the total number of defects that would have been found by the two sessions would be just the sum of the two sessions. We found that this approximate defect density was not different than defect density of the actual without-repair inspections (2sX1pN ``no dups'' vs. 2sX1pN ``with dups'' ( ) and 2sX2pN ``no dups'' vs. 2sX2pN ``with dups'' ( ), see Figures 3.13(c) and 3.14(c)).
These results are based on a very small number of observations and should be viewed with considerable caution. Tentatively, it suggests that multiple sessions inspections will improve performance only when there is an excess of defects to be found, and that repairing defects in between multiple sessions may not improve the performance of a second inspection team.
We have run an 18-month experiment in which we applied different software inspection methods to all the code units produced during a professional software development. We assessed the methods by randomly assigning different team sizes, numbers of inspection sessions, author repair activities, and reviewers to each code unit.
In the following Section we summarize our specific results and discuss their implications from points of view of both practitioners and researchers.
For practitioners this suggests that a good deal of effort is currently being expended on issues that might better be handled by automated tools or standards.
For researchers this suggests that developing better defect detection techniques may be much more important than any of the organizational issues discussed in this experiment .
For practitioners this suggests that reducing the default number of reviewers from 4 to 2 may significantly reduce effort without increasing interval or reducing effectiveness.
The implications of this result for researchers is unclear. We need to develop a better understanding of why 4-reviewer teams weren't more effective than 2-reviewer teams. The low level analysis suggests that larger teams may suppress a larger number of possibly true defects. It is also possible that the code was relatively defect-free, so that there weren't too many defects to find. This needs further investigation.
In practice this indicates that 2-session inspections aren't worth their extra effort.
These results are significant for researchers as well. Multiple session methods such as active design reviews (ADR) and phased inspections (PI) rely on the assumption that several one person teams using specially developed defect detection techniques can be more effective than a single large team without special techniques. Some of our experimental treatments mimic the ADR and PI methods (without special defect detection techniques). This suggests that any improvement offered by these techniques will not come just from the structural organization of the inspection, but will depend heavily on the development of defect detection techniques.
In practice, we see no reason to repair defects between multiple sessions. Furthermore, some of the developers in our study felt that the 2-session-with-repair treatments caused the greatest disruption in their schedule. For example, they had to explicitly schedule their repairs although they would normally have used repair to fill slow work periods.
This result also provides some information about the recently proposed phased inspection method. This method requires small teams each using specialized defect detection techniques to perform several inspections in sequence, repairing defects between each session. Our data shows no improvement due solely to the presence of repair. Consequently, without special defect detection techniques the approach in unlikely to be effective.
In Chapter 3, we experimented with different process structures - achieved by manipulating the sizes of the teams, the number of sessions, and by repairing or not repairing defects in between multiple sessions. We determined that changing the independent variables we controlled had little effect on defect detection, but some of the changes dramatically increased the inspection interval.
However, regardless of the treatment used, both defect detection and interval data seemed to vary widely. It is important to study and explain this variance. Not only would it strengthen the credibility of our experiment, but it may also lead to new hypotheses regarding the workings of the inspection process.
There are two questions we want to answer. Is the ``signal'' being swamped by a lot of ``noise''? Are we looking at the wrong mechanisms? (Signal refers to the effect due to our treatments, while noise is caused by other sources of variation).
The high variance in defect and interval data lowers the power of our significance tests, i.e., some treatments may actually have a significant effect but the tests showed otherwise. If we can remove the effects of some of the other sources of variation, then the effects of our treatments may become more obvious. Also, by studying the extent to which these sources of variation affect the data in each treatment, we may be able to evaluate how well our experimental design controlled for external variation. This, in turn, may guide the development of better experimental designs.
Second, it is clear from the defect data that other factors affect inspection effectiveness more strongly than the process structure. This suggests that inspection effectiveness may be improved by properly manipulating these other factors. Identifying and understanding these factors may guide future research towards the development of better inspection techniques.
Therefore, we have extended the results of our experiment by studying the variance, identifying the sources, and modeling their influence on inspection effectiveness and interval.
Figure 4.1 is a diagram of the inspection process and associated inputs, e.g., the code unit, the reviewers, and the author. It shows how these inputs interact with each process step. The number and types of issues raised in the preparation step are influenced by which reviewer is selected and by the number of defects originally in the code unit (which in turn may be affected by the author of the code unit). The number and types of issues recorded in the collection step are influenced by the reviewers on the inspection team and the author (who joins the collection meeting), the number of issues raised in preparation, and the number remaining undetected in the code unit.
In looking at the inspection process and its inputs, we see that differences in code unit and reviewers would directly affect inspection outcomes. We must separate the effects of these inputs from the effects of the process structure. Therefore, our focus will be on estimating the amount of variation contributed by these inputs.
Our first question may be refined as ``How will our previous results change when we eliminate the contributions due to variability in the process inputs?'' and ``Did our experimental design spread the variance in process inputs uniformly across treatments?''
Our second question then becomes ``Are the differences due to process inputs significantly larger than the differences in the treatments?'' and ``If so, what factors or attributes affecting the variability of these process inputs have the greatest influence?''
The analysis approach we took in explaining the variance in the data was to build statistical models of the inspection process, guided by what we know about it. Model building involves formulating the model, fitting the model, and checking that the model adequately characterizes the process.
Before we can build a model of the inspection defect detection process, we were faced with using either observed defect density or just the actual number of observed defects as the variable to be modeled. The choice depended on the modeling technique we would use. The defect density is widely used in inspection literature. It is also approximately normally distributed, so we can use simple and well-known linear models to analyze the data. On the other hand, the actual number of defects is a more natural response variable because we can think of the inspection process as a process for counting the number of defects. In this case, we can apply the methods of generalized linear modeling, in particular, the Poisson family of generalized linear models.
Figure 4.2 shows the number of defects versus size. The dashed straight line and the curved solid line show the fitted data when we try to explain the variance with just the size variable, using linear modeling and generalized linear modeling, respectively. We can see that they give approximately the same fit so we can use either one.
One problem with linear models is that there is the chance that they may yield negative estimates. So we decided to use the generalized linear model because it is more natural for our problem (fitted values will always be nonnegative counts). Hence we used actual number of defects as our dependent variable.
We modeled the defects found in inspection with a generalized linear model (GLM) from the Poisson family. To fit the model, we thought about which factors affecting reviewer and author performance and code unit quality might systematically influence the outcome of the inspection. Some of these are shown in the augmented inspection model in Figure 4.3.
Some of the possible variables affecting the number of defects in the code unit include: size, author, time period when it was written, and functionality. Here we examine each one and explain how they might influence the number of defects.
Code Size. The size of a code unit is given in terms of non-commentary source lines (NCSL). It is natural to think that, as the size of the code increases, the more defects it will contain. From Figure 4.4 we see that there is little correlation between size and number of defects found (cor = 0.4).
Author. The author of the code may inadvertently inject defects into the code unit, some authors possibly more likely than others. There were 6 authors in the project. Figure 4.5 is a boxplot showing the number of defects found, grouped according to the code unit's author. The number of defects depends on the author's level of understanding and implementation experience.
Development Phase. The performance of the reviewers and the number of defects in the code unit at the time of inspection might well depend also on the state of the project when the inspection was held. Figure 4.6 is a plot of the total defects found in each inspection, in chronological order. Each point was plotted in the order the code unit became available for inspection. There are two distinct distributions in the data. The first calendar quarter of the project (July - September 1994) - which has about a third of the inspections - has a significantly higher mean than the remaining period. This coincided with the project's first integration build. With the front end in place, the development team could incrementally add new code units to the system, possibly with a more precise idea of how the new code is supposed to interact with the integrated system, resulting in fewer misunderstandings and defects. In our data, we tagged each code unit as being from ``Phase 1'' if they were written in the first quarter and ``Phase 2'' otherwise.
At the end of Phase 1, we met with the developers to evaluate the impact of the experiment on their quality and schedule goals. We decided to discontinue the 2-session treatments with repair because they effectively have twice the inspection interval of 1-session inspections of the same team size. We also dropped the 1-session, 1-person treatment because inspections using it found the lowest number of defects.
Figure 4.7 shows a time series plot of the number of issues raised for each code unit inspection. While the number of true defects being raised dropped as time went by, the total number of issues did not. This shows that either the reviewers' defect detection performance were deteriorating in time, or the authors were learning to prevent the true defects but not the other kinds of issues being raised.
Functionality. Functionality refers to the compiler component to which the code unit belongs, e.g., parser, symbol table, code generator, etc. Some functionalities may be more straightforward to implement than others, and, hence, will have code units with lower number of defects. Figure 4.8 is a boxplot showing the number of defects found, grouped according to functionality.
Table 4.3.2 shows the number of code units each author implemented within each functional area. Because of the way the coding assignments were partitioned among the development team, the effects of functionality are confounded with the author effect. For example, we see in Figure 4.8 that SymTab has the lowest number of defects found. However, Table 4.3.2 shows that almost all the code units in SymbTab were written by author 6, who has the lowest number of reported defects. Nevertheless, we may still be able to speculate about the relative impact of the two factors by examining those functionalities with more than one author (CodeGen) and authors implementing more than one functionality (author 6).
In addition, functionality is also confounded with development phase as Phase 1 had most of the code for the front end functionalities (input-output, parser, symbol table) while Phase 2 had the back end functionalities (code generation, report generation, libraries).
Because author, phase, and functionality are related, they cannot all be considered in the model as they account for much of the same variation. In the end, we selected functionality as it is the easiest to understand.
Pre-inspection Testing. The code development process employed by the developers allowed them to perform some unit testing before the inspection. Performing this would remove some of the defects prior to the inspection. Figure 4.9 is a scatter plot of pre-inspection testing effort against observed defects in inspection. One would suspect that the number of observed defects would go down as the amount of pre-inspection testing goes up, but this pattern is not observed in Figure 4.9.
A possible explanation to this is that testing patterns during code development may have changed across time. As the project progressed and a framework for the rest of the code was set up, it may have become easier to test the code incrementally during coding. This may result in code which has different defect characteristics compared to code that was written straight through. It would be interesting to do a longitudinal study to see if these areas had high maintenance cost.
Here we examine how different reviewers affect the number of defects detected. Note that we only look at their effect on the number of defects found in preparation, because their effect as a group is different in the collection meeting's setting.
Reviewer. Reviewers differ in their ability to detect defects. Figure 4.10 shows that some reviewers find more defects than others. Even for the same code unit, different reviewers may find different numbers of defects. This may be because they were looking for different kinds of issues. Reviewers may raise several kinds of issues, which may either be suppressed at the meeting, or classified as true defects, soft maintenance issues (issues which required some non-behavior-affecting changes in the code, like adding comments, enforcing coding standards, etc.), or false positives (issues which were not suppressed at the meeting, but which the author later regarded as non-issues). Figure 4.12 shows the mean number of issues raised by each reviewer as well as the percentage breakdown per classification. We see that some of the reviewers with low numbers of true defects (see Figure 4.10), like Reviewers H and I, simply do not raise many issues in total. Others, like Reviewers J and K, raise many issues but most of them are suppressed. Still others, like Reviewers E and G, raise many issues but most turn out to be soft maintenance issues.
The members of the development team (Reviewers A to F) raise significantly more total issues, though a very high percentage turn out to be soft maintenance issues, possibly because, as authors of the project, they have a higher concern for its long-term maintainability than the rest of the reviewers. An exception is Reviewer F, who found almost as many true defects as soft maintenance issues.
Preparation Time. The amount of preparation time is a measure of the amount of effort the reviewer put into studying the code unit. For this experiment, the reviewers were not instructed to follow any prescribed range of preparation time, but to study the code in as much time as they think they need. Figure 4.13 plots preparation time against defects found. Generally, there is a positive trend but little correlation. However, even if there is a high correlation, the amount of preparation time depends not only on the amount of effort the reviewer is planning to put into the preparation, but also on the code unit itself. Especially, it is influenced by the number of defects existing in the code, i.e., the more defects he finds, the more additional time he spends in preparation. Hence, high preparation time may be considered partly as a consequence of detecting a large number of defects, rather than a causal factor. Further investigation is needed to quantify the effect of preparation time on defects found as well as the effect of defects found on preparation time. Because there is no way to tell how much of the preparation time was due to reviewer effort or number of defects, we decided not to include it in the model.
Team-specific variables also add to the variance in the number of meeting gains.
Team Composition. Since different reviewers have different abilities and experiences, and possibly interact differently with each other, different teams also differ in combined abilities and experiences.
Apparently, this mix tended to form teams with nearly the same performance. This is illustrated in Figure 4.14 which shows number of defects found by different 2-person teams in each 2sX2pN inspection. Most of the time, the two teams found nearly the same number of defects. This may be due to some interactions going on between team members. However, because teams are formed randomly, there are only a few instances where teams composed of the same people were formed more than once, not enough to study the interactions.
We incorporated the team composition into the model by representing it as a vector of boolean variables, one variable per reviewer in the reviewer pool. When a particular reviewer is in that collection meeting, his corresponding variable is set to ``True''.
Meeting Duration. The meeting duration is the number of hours spent in the meeting. In the meeting, one person is appointed the reader, and he reads out the code unit, paraphrasing each chunk of code. The meeting revolves around him. At any time, reviewers may raise issues related to the particular chunk being read and a discussion may ensue. All these contribute toward the pace of the meeting. Generally, the meeting duration is positively correlated with the number of meeting gains, as shown in Figure 4.15. As with the case of preparation time, the meeting duration is partly dependent on the number of defects found, as detection of more defects would trigger more discussions, thus lengthening the duration. It is also dependent on the complexity or readability of the code. Further investigation is needed to determine how much of the meeting duration is due to the team effort independent of the complexity and quality of the code being inspected. For similar reasons as with preparation time, we did not include this in the model.
Combined Number of Defects Found in Preparation. The number of defects already found going into the meeting may also affect the number of defects found at the meeting. Each reviewer gets a chance to raise each issue he found in preparation as a point of discussion, possibly resulting in the detection of more defects. Figure 4.16 shows little correlation between number of defects found in the preparation and in the meeting.
A generalized linear model (GLM) from the Poisson family was constructed from the factors described above. We started with a model which had Functionality, Size, all reviewers, and the original treatment variables, TeamSize, Sessions, Repair. With the help of stepwise model selection, we selected those factors that significantly affect the variance in the defect data. We then added and dropped terms until we found a physically explainable model.
The final set of variables are Functionality, Size, and the presence of Reviewers B and F. This is represented by the model formula:
In this model, Defects is the number of defects found in each of the 88 inspections. Note that the presence of certain reviewers in the inspection team strongly affects the outcome of the inspection (Reviewers B and F). The logarithmic transformation was applied to the Code Size because it gives the best fit for the model. The resulting model explains of the variance using just 10 degrees of freedom.
A model is adequate if it estimates the data reasonably and its residuals are just ``white noise,'' i.e., its residuals have no detectable pattern of data. Figure 4.17 gives a graphical test of these two conditions. The left plot shows the values estimated by the model compared to the original values. The presence of a correlation suggests that the model reasonably estimates the original data. The right plot shows the values estimated by the model compared to the residuals. There appears to be no discernible pattern in the plot, suggesting that the residuals are random.
The inspection model is a high level description of the inspection defect detection process. It is useful for comparing the effects of the process inputs against the effects of the treatment factors on the variance of the overall data. But we also know that defect detection in inspections is performed by the two steps, preparation and collection. These two may be considered as independent processes which can be modeled separately. This has several advantages. We can understand the resulting models of the simpler, separate processes better than the model for the composite inspection process. In addition, there are more data points to fit - 233 individual preparations and 130 collection meetings, as opposed to 88 inspections.
Using stepwise model selection, we selected the variables that significantly affect the variance in the preparation data. These are Functionality, Size, and Reviewers B, E, F, and J. This is represented by the model formula:
In this model, PrepDefects is the number of defects found in each of the 233 preparation reports. The presence of all the significant factors from the overall model at this level gives us more confidence in the validity of the overall model.
Using stepwise model selection to select the variables that significantly affect the meeting data, we ended up with Functionality, Size, and the presence of Reviewers B, F, H, J, and K. This is represented by the model formula:
In this model, MeetingGains is the number of defects found in each of the 130 collection meetings. This is again consistent with the previous two models.
We are now in a position to answer the questions raised in Section 4.2 with respect to inspection effectiveness.
In this analysis, we build a GLM composed of the significant process input factors plus the treatment factors and check if their contributions to the model would be significant.
The effect of increasing team size is suggested by plotting the residuals of the overall inspection model, grouped according to Team Size (Figure 4.18(a)). We observe no significant difference in the distributions. When we included the Team Size factor into the model, we saw that its contribution was not significant (p = 0.6, see Table 4.3.5).
The effect of increasing sessions is suggested by plotting the residuals of the overall inspection model, grouped according to Session (Figure 4.18(b)). We observe no significant difference in the distributions. When we included the Session factor into the model, we saw that its contribution was not significant (p = 0.5).
The effect of adding repair is suggested by plotting the residuals of the overall inspection model (for those inspections that had 2 sessions), grouped according to Repair policy (Figure 4.18(c)). We observe no significant difference in the distributions. When we included the Repair factor into the model, we saw that its contribution was not significant (p = 0.2).
We want to determine if the factors of the process inputs which significantly affect the variance are spread uniformly across treatments. This is useful in evaluating our experimental design. We took each of the significant factors in the overall inspection model and tested if they are independent of the treatments. For each factor, we built a contingency table, showing the frequency of occurrence of each value of that factor within each treatment. We then used Pearson's -test for independence[8, pp. 145-150,]. If the result is significant, then the factor is not independently distributed across the treatments.
Results show that the distribution of Reviewer B is independent of treatment (p = 0.6) while Functionality (p = 0.05) and Reviewer F (p = 0.06) may be unevenly assigned to treatments. Examining further shows us that Reviewer F never got to do any 1sX1p inspections, and that Functionality was not distributed evenly because some functionalities were implemented earlier than others, when there were more treatments.
Contingency tables only work with response variables that have discrete values. To test the independence of the log(Size) to treatment, we modeled it instead with a linear model, , to determine if treatment contribution to log(Size) is significant. The ANOVA result (p = 0.7) shows that it is not, indicating that there is no dependence between code sizes and treatment.
Table 4.3.5 shows our original treatment factors as well as the identifiable factors affecting reviewer performance and code unit quality. A generalized linear model was fitted using these factors. The given sum of squares for each factor are calculated by fitting GLMs without that particular factor and getting the difference in the residual sum of squares between the GLM with the factor removed and the one with all factors included. These sum of squares give a measure of the amount of variance explained by each factor relative to other factors. We can clearly see that the process inputs are very significant while the treatment factors don't seem to significantly contribute more to the variance. This shows that differences in code units and reviewers drive inspection performance more than differences in any of our treatment variables. This suggests that relatively little improvement in effectiveness can be expected of additional work on manipulating the process structure.
The dominance of process inputs over process structure in explaining the variance also suggests that more improvements in effectiveness can be expected by studying the factors associated with reviewers and code units that drive inspection effectiveness.
The fact that differences in code units strongly affect defect detection effectiveness suggests that inspections may be effective only for certain types of code units. It is important to study the attributes that influence the number of defects in the code unit. Of the code unit factors we studied, code size was the most important in all the models. This is consistent with the accepted practice of normalizing the defects found by the size of the code. The next most important factor is functionality. This may indicate that code functionalities have different levels of implementation difficulty, i.e., some functionalities are more complex than others. Because functionality is confounded with authors, it may also be explained by differences in authors. And because it is also confounded with development phase, another possible explanation is that code functionalities implemented later in the project may have less defects due to improved understanding of requirements and familiarity with implementation environment.
The choice of people to use as reviewers strongly affects the defect detection effectiveness of the inspection. The presence of certain reviewers (in particular, Reviewer F) is a major factor in all the models. It suggests that improvements in effectiveness may be expected by selecting the right reviewers or by studying the characteristics and background of the best reviewers and the implicit techniques by which they study code and detect defects.
Using the same set of factors, we also built a statistical model of the interval data. We measured the interval from submission of the code unit for inspection up to the holding of the collection meeting. Unlike defect detection though, we do not see any further decomposition of the inspection process that drives the interval. The author schedules the collection meeting with the reviewers and the reviewers fit in some time before this to do their preparation. So instead of splitting the inspection process into preparation and collection, we just modeled the overall inspection process, with our dependent variable being the interval from submission to meeting.
A linear model was constructed from the factors described in the previous section. We started with a model which had Author, Size, all reviewers, and the original treatment variables, TeamSize, Sessions, Repair. With the help of stepwise model selection, we selected Author, Reviewer I, and the treatment factor Repair. This is represented by the model formula:
In this model, Interval is the number of days from availability of code unit for inspection up to the last collection meeting. The presence of Reviewer I in the inspection team strongly affects the length of the inspection interval. The model explains of the variance using 7 degrees of freedom.
Figure 4.19 shows a the fitted values plotted against the original and against the residuals. The presence of a correlation suggests that the model reasonably estimates the original data. The right plot shows the values estimated by the model compared to the residuals. The residuals appear to be independent of the fitted values, suggesting that the residuals are randomly distributed.
We are now in a position to answer the questions raised in Section 4.2, with respect to inspection interval.
In this analysis, we build a linear model, composed of the significant process input factors plus the treatment factors and check if their contributions to the model would be significant.
The effect of increasing team size is suggested by plotting the residuals of the interval model consisting only of input factors, grouping them according to Team Size (Figure 4.20(a)). We observe no significant difference in the distributions. When we included the Team Size factor into the model, we saw that its contribution was not significant (p = 0.4, see Table 4.4.2).
The effect of increasing sessions is suggested by plotting the residuals of the interval model consisting only of input factors, grouping them according to Session (Figure 4.20(b)). We observe no significant difference in the distributions. When we included the Session factor into the model, we saw that its contribution was not significant (p = 0.3).
The effect of adding repair is suggested by plotting the residuals of the interval model consisting only of input factors (for those inspections that had 2 sessions), grouping them according to Repair policy (Figure 4.20(c)). We have already seen that Repair has a significant contribution (p = 0.04) to the model in the previous section and this is supported by the plot.
As with the defect detection model, we built contingency tables, showing the frequency of occurrence of each significant factor within each treatment and performed Pearson's -test for independence between the factor and treatment. Results show that both Author (p = 0.27) and Reviewer I (p = 0.12) are independent of the treatment.
Table 4.4.2 shows the factors affecting inspection interval and the amount of variance in the interval that they explain. We can see that some treatment factors and some process input factors contribute significantly to the interval. Among treatment factors Repair contributes significantly to the interval. This shows that while changes in process structure do not seem to affect defect detection, it does affect interval.
In developing the model, we found that the author of the code unit has significant effect on the inspection interval. This is consistent since the author is the one who schedules the inspection.
These results also show that while process inputs explain a good part of the variance in inspection interval, they do not explain all. Even accounting for process structure factors, only of the variance is explained. Clearly, other factors, apart from the process structure and inputs affect inspection interval. Some of these may have to do with interactions between multiple inspections, developer and reviewer calendars, and project schedule, i.e., the process environment. This deserves further investigation.
Our proposed models of the inspection process proved useful in explaining the variance in the data gathered from our experiment, enabling us to show it is caused mainly by factors other than the treatment variables.
When the effects of these other factors are removed, the result is a data set with significantly reduced variance across all of the treatments, improving the resolution of our experiment. Even accounting for the variance (noise) caused by the process inputs, we showed that the results of our experiment do not change (we see the same signal).
Our results also show that when process inputs are accounted for, it made little difference in defect detection which process structure was followed. This reinforces initial findings that inspection methods which propose to modify the structure are largely ineffective in improving the defect detection rate. This is especially important considering that a significant percentage of the research conducted to improve inspection techniques has proposed modifications to the process structure[49, 38, 59].
We believe that to develop better inspection methods, we must investigate instead the technical processes being used to carry out the steps in the inspection. We must develop better reading techniques[3, 55] for the preparation step and asynchronous techniques for the collection step[19, 48, 34, 45].
With respect to interval, our results support the suggestion in our experiment that doing repair between sessions of a two-session inspection can significantly increase the inspection interval.
We have conducted an experiment to investigate the effects of changes in the structure of the software inspection process (team size, number of sessions, and repair between multiple sessions) upon the effectiveness (defect detection rate) and the inspection interval. We have extended the analysis by studying the effects of process inputs on these two dependent variables.
Our results showed that altering the 3 independent variables related to process structure was largely ineffective in improving the effectiveness of the inspection. We found that 1-person inspections performed poorly compared to 2-person inspections, but on increasing the team size from 2 to 4 persons, (1sX2p 1sX4p), we found that the increase in observed defect density owing to the increase in reviewers is not enough to make a significant difference. Similarly, on increasing the number of sessions (1sX1p 2sX1pN and 1sX2p 2sX2pN), we found that the increase in observed defect density was not enough to make a significant difference. Finally, there was not much overlap between sessions to be eliminated, so performing repair in between two sessions would not have significantly increased the observed defect density. From this discussion, the 1sX2p inspection is the best, since it requires the least effort.
On the other hand, our results showed that inputs (code units, reviewers) into the inspection process may significantly affect effectiveness. Improving the quality of the incoming code units is, obviously, out of the scope of inspections. So we concentrate on the reviewers. In particular, the presence of certain reviewers were important factors. This suggests that training people to be better reviewers by having them emulate the practices of the best reviewers, or by teaching them program comprehension techniques[3, 56], as well as systematic detection techniques may have more effect on increasing the number of defects found than any changes in the structure of the inspection process.
Hence, from the standpoint of cost-effectiveness, we recommend the 1sX2p inspection. But from the standpoint of improving effectiveness, we recommend improving the technical processes by which reviewers study code and detect defects.
Our results showed that certain combinations of the 3 independent variables dramatically increased the inspection interval. In particular, adding repair in between 2-person inspections resulted in inspections which took twice longer to finish. Repair was confirmed to be a significant variable in subsequent analysis using statistical models.
Hence, we recommend that to improve interval, attention must be given to the structural organization of the inspection process.
This study and its results have several implications for the design and analysis of industrial experiments.
Because this was an experiment done in vivo, on a project dealing with real deadlines, budgets, and customers, we would put the project at risk if any one of our treatments turn out to be too costly or too ineffective. We agreed that treatments like these must be terminated. At the same time, we wanted to be reasonably certain that we have enough points to determine that the ineffectiveness or costliness was due to the treatment and not random chance. So we ran a simulation of the experiment (see Appendix C) and found that, with a few as 5 points, we could tell if two treatments are different. At the end of the first calendar quarter of the experiment, we did discontinue several treatments because they were either significantly less effective or costlier than other treatments. Industrial experimenters must be aware of the risks they introduce to the projects under study.
Variances in reviewer and author performance do, of course, affect the variance in the defect detection data, but far less than previously suggested. Figure 4.5 shows that the highest author median and the lowest author median differ by only 3 defects. Figure 4.11 shows that, for the same code unit, the most successful reviewer found at most, only 8 defects more than did the least successful reviewer. This difference appears to be even less between teams (see Figure 4.14). This contradicts previous studies which suggested the performance of software developers might differ by orders of magnitude, and which had therefore discouraged the use of experiments in empirical software studies.
The overall drop in defect data across time (see Figure 4.6) underscores the fact researchers doing long term studies must be aware that some characteristics of the processes they are examining may change during the study.
Although a significant amount of software inspection research has focused on making structural changes (team size, number of sessions, etc.) to the process, these changes had little or no effect in our experiment. Consequently, we believe that to develop better inspection methods we must investigate and improve not the way the steps in the inspection process are organized (structure), but the way they are carried out by reviewers (technique). We need to learn the implicit techniques by which good reviewers study code and detect defects. We need to continue studies on the effect of different defect detection techniques available for the preparation step. We also need to continue studies on the effects of reading comprehension techniques. Further research is also needed to quantify the costs and benefits of asynchronous collection techniques[19, 48, 34, 45].
We have tried to measure the value added by having inspection meetings. Since meetings take time and effort to schedule and hold, it is important to understand what goes on inside them and determine if the benefits associated with meetings can be achieved by some other, less expensive way. We also found that the meeting gain rates we measured are higher than an earlier study by Votta. It is extremely important that contradictory findings be examined and resolved.
We believe that the inspections in the two studies had been using different process techniques in the sense that there were different intentions in the manner each step is carried out. In our study, we think the reviewers used the preparation step literally as a time to prepare for the meeting, familiarizing themselves with the artifact by making one pass at detecting defects, and the collection step as the central defect detection step of the inspection (Preparation-Inspection). In Votta's study, we think the reviewers used the preparation step as the time to detect defects and the collection step as merely a defect collection step (Detection-Collection). Furthermore, we think the driving factor why our experiment yielded higher meeting gains was the fact that the reviewers get a second intensive pass at the artifact during the meeting. We hypothesize that such a second pass need not be done in a meeting, but the same effect can be achieved by having a second individual detection step (Detection-Detection). We are currently conducting another experiment to test this hypothesis.
The 2sX2pR treatment had an interval twice that of the other treatments. Although we were able to gather only four observations, the magnitude of this difference surprises us. Furthermore, it highlights the fact that although researchers frequently argue for changes to software development processes, we have no reliable methods for predicting the effect of these changes on development interval.
When we studied the sources of variation affecting interval data, we found that process inputs explain some of the variance, but much of it is still unaccounted for. We believe that no credible experiment can be considered complete until the variance has been adequately explained. The author of the code unit is one of the significant inputs, suggesting that his personal schedule and deadlines may be a significant influence on inspection interval. This must be examined further. To address this, we must study the process environment including interactions between inspections, developer and reviewer calendars, and project schedule.
It is also important to study how all these interactions involving inspections affect the development interval as a whole. So far there has been little attempt to investigate this cost, which we believe to be significantly higher than has been realized.
Future research will need to find better models for both the inspection and development intervals, perhaps utilizing queueing and other simulation models[2, 9].
An alternative explanation for the results of the experiment is that the code being inspected was relatively defect-free, so that there weren't too many defects to find, regardless of treatment. And even if there were, it must be noted that observed defect density is an imprecise estimate of the defect detection ratio. In either case, it is important to get a better measure of effectiveness. We attempted to use capture-recapture estimation to derive more precise estimates, but for several reasons this method failed to produce useful estimates. Therefore, the only method we can find to improve our estimates is to track the artifact through testing and field deployment, in the process, counting the number of additional defects found. We are instrumenting the development process to capture this data.
Finally, we feel it is important that others attempt to replicate our work, and we are preparing materials to facilitate this. These will be available, along with the entire data set, at http://www.cs.umd.edu/users/harvey/thesis.html. Although we have rigorously defined our experiment and tried to remove the external threats to validity, it is only through replication that we can be sure all of them have been addressed.
Capture-recapture (CRC) methods are sampling, resampling schemes for estimating the size of a population  - in our case the total number of defects in the artifact (which we denote by N). The idea is to compare the defect reports of several reviewers. Assuming reviewers are independent and that defects have identical detection probabilities, if several reviewers find many of the same defects then we conclude that most of the defects have been found. On the other hand, if every reviewer finds many defects which were not found by the other reviewers, we conclude that many undiscovered defects remain. Capture-recapture methods translate these intuitive ideas into statistical estimation procedures.
From the preparation, collection meeting, and author repair we gather enough information to support a large number of CRC methods. Our choice of a specific capture-recapture method will depend on how well our specific application conforms to the assumptions underlying several available methods.
Three assumptions are made when deriving the simplest estimators: (A) reviewer performances are statistically independent, (B) all reviewers are equally effective at finding defects, and (C) all defects have the same probability of being found.
If assumption (A) is violated, it will not be possible to derive a reliable estimator. Assumption (A) may be violated because of collusion (reviewers working together, causing N to be underestimated) and/or specialization (reviewers looking for disjoint sets of defects, causing N to be overestimated). We will use a statistical test constructed by Eick, et al. to establish whether or not assumption (A) holds for our data.
If assumption (B) holds, we can use a jackknife estimator for N. This allows us to ignore assumption (C), since it does not enter into the estimator's derivation. The jackknife estimator for N of order k (k <= m) has the form
where m is the number of reviewers, n is the total number of defects found by all reviewers during their preparation, and denotes the number of defects discovered by exactly j reviewers . The constants depend on m and k. Full details as well as a method for selecting k are given in Burnham and Overton.
If assumption (B) is violated, but assumption (C) holds, we can compute a maximum likelihood estimation, , for N. This will be the value of X that maximizes the following equation:
where is the number of defects found by the reviewer in his or her preparation.
When neither assumption (B) nor assumption (C) hold, then an estimator cannot always be derived. However, Vander Wiel, et al.  found that if the defects can be partitioned into a small number of groups, where the defects in each group have similar detection probabilities, then Equation A.2 can be applied to each group separately and the results added together to calculate an estimate for N.
An analysis of our data indicates that assumption (A) holds, but assumptions (B) and (C) do not. However, partitioning the defects would result in several empty partitions because of the relatively small number of defects. Therefore, we decided that none of the techniques discussed here is applicable to our experiment.
In this appendix, the compiler project is presented in some detail. Some of the nuances of the development are pointed out and their possible influence on the inspections data is discussed. This is so that the inspection researcher and the experimentalist reading the results may be aware that the code units being inspected in the experiment are not uniform and may vary in number of injected defects in a systematic manner, as when certain functionalities are tied to certain developers. For an experiment to gain credibility, it must be replicated in many different settings, which may or may not give consistent results. In either case, a description of the original experimental setting may be used to strengthen corroborating results or give non-corroborating results some alternative explanation.
The 5ESS is Lucent Technologies' flagship local/toll switching system, containing an estimated 10 million lines of code in product and support tools. At the heart of the 5ESS software is a distributed relational database with information about hardware connections, software configuration, and customers. For the switch to function properly, this data must conform to certain integrity constraints.
PRL5, a declarative language based on first-order predicate logic, was created to specify these integrity constraints. PRL5 specifications were to be translated automatically into C. For this purpose, a PRL5-to-C compiler (P5CC) was needed. (For more details on the history of PRL5, see Ladd and Ramming).
The basic compilation scenario for P5CC is shown in Figure B.1. The assignment of the development team is to implement P5CC as well as the P5CC runtime library.
Figure B.2 is a diagram depicting the components of the compiler. For our discussion, the parser refers to the code which checks both the syntax and semantics of the input program. It also includes code to implement the preprocessor and lexical analyzer. The optimizer refers to code which performs query optimization and also to all code that performs transformations on the parse tree, ``massaging'' it into a form suitable for processing by the code generator. The code generator refers to code that converts the transformed tree into opcodes, and eventually to C language statements. It also generates source file-dependent code needed for runtime support, such as memory management. The symbol table manager refers to all code which interface with the symbol table, handling access to all of the source program's data definitions. The main driver and utilities contains the main program which calls each routine, as well as miscellaneous utilities used by the other major functions.
Figure B.3 is a SeeSoft.6exTM visualization showing the functional division of the compiler code. Each source file is associated with a particular function, as shown by the colors. We can see that the major pieces are code implementing the symbol table manager, parser, and code generator. The parser unit makes up 40% of the code. The syntax checker is in the yacc file, prl.y and the rest of the code mainly implements the semantic checker (the output of yacc is not counted as part of the source). The optimizer unit makes up 5% of the code. The code generator makes up 15% of the code. The symbol table manager makes up 25% of the code, including a huge file holding descriptors of the language's possible data types.
Figure B.4 shows the lines of code written or ported by each developer. Note that each developer primarily worked on one or two of the subsystems and that each subsystem has one or two primary authors. This confounding suggests that each subsystem may have a significantly different mean defect density than the others.
The simulation involves just two treatments, and , whose defect detection probabilities are and .
The simulation comprises three distinct steps:
This process is repeated a hundred times for each experimental setting. Even though the two treatments have different detection probabilities, under some conditions the test may fail to recognize the difference. Running the simulation in a wide variety of experimental settings helps us to determine when and how confidently we can say that two treatments are different.
We created 600 experimental settings consisting of 25 different combinations of means (53,67,80,93,107) and standard deviations (3,7,13,27,40) to generate defect densities, and 24 different pairs of and ( ).
Figure C.1 shows some (36 out of 600 settings) of the simulation results. The x-axis shows the true difference between and and the y-axis shows the probability that the null hypothesis ( = ) will be rejected. Each combination of a symbol and a line segment represents the outcomes of 100 simulation runs of one experimental setting. The symbol indicates the median, and the line segment through the symbol spans the .25 through the .75 quartiles.
We define the experimental resolution as the value when more than 50% of the 100 outcomes have a significance greater than .9 (the symbol in Figure C.1 lies above the resolution line), and the next smaller true difference value has the symbol with less than 50% of the 100 outcomes greater than .9 (the symbol in Figure C.1 lies below the resolution line).
Initially, the experiment involved seven treatments. At the beginning of 1995, we evaluated the existing results and discussed them with the project's management. Although we would have preferred to gather more data, it would have been risky for the project to continue performing expensive or ineffective treatments. Therefore, we discontinued three treatments: 1sX1p, 2sX1pR, and 2sX2pR.
The 1sX1p treatment was the least effective, while the two with-repair treatments (2sX1pR and 2sX2pR) were no more effective than the without-repair treatments. In addition, the 2sX2pR treatment was, by far, the most expensive treatment in terms of interval. Figure D.1 confirms that the last instances of these discontinued treatments were held in the first quarter of 1995.
Our primary concern is that discontinuing treatments may compromise the experiment's internal validity (i.e., factors that affected all treatments early in the experiment, will affect only the remaining treatments later in the experiment). Consequently, we must be careful when we compare treatments that were discontinued with those that were not.
Figures D.1 and D.3 show inspection effectiveness and interval over time, with observations sorted according to the time at which the code unit became available for inspection.
The data presented in Figure D.1 suggests that there are two distinct performance distributions. That is, that the first quarter (July - September, 1994) - during which about one-third of the inspections occurred - has a significantly higher mean and variance than the remaining quarters (October, 1994 - December, 1995).
One reason for this may be that the end of the first quarter coincides with the system's first integration build. Our records show that with the compiler's front end in place, the developers were able to do more thorough unit testing for the back end code than they did for front end code itself.
Other factors may be that the reviewers had become more familiar with the programming language as the project progressed, that the requirements for the front-end (language definition, parsing, and intermediate code generation) was more prone to misinterpretation than the final code generation and optimization
In particular, this suggests to us that had we continued using the 2sX2pR treatment its effectiveness would have dropped in a manner consistent with the other treatments.
Figure D.3 is a time series plot showing inspection interval as project progressed. We see that the mean inspection interval did not vary significantly throughout the project, although there is a gradual increase as the project nears completion.
Although there were only four 2sX2pR inspections, the stability of the interval for the other treatments suggests that had we continued the treatment, its interval would not have changed significantly.
A statistical model takes the general form, , where y is the vector of observed data, is a function taking as input a set of factors with associated coefficients, , describing the process and giving as output , the expected value of y, and is a function giving the difference between y and , with being the set of factors, , in the process whose effects are ignored or whose presence is unknown to us. Model formulation deals mainly with describing , specifying factors and the interaction between them. Model fitting deals with moving factors to and from and and adjusting the coefficients to give the best fit to y. A model may be considered adequate when is just white noise, i.e., the residuals is a vector of independently distributed values having zero mean and constant variance.
S is a programming environment for data analysis[6, 16]. In this appendix, we will outline our approach in using S to build the statistical models and analyze the data.
The possible factors to be incorporated into the model are usually determined from prior knowledge of the process being modeled. The initial model is normally specified with the full set of available factors.
Note that factors may also depend on each other, i.e., have interactions with each other. Each set of possibly interacting factors is represented as an additional factor. (Since we had a limited number of observations, we avoided fitting interaction between factors.)
With our data, the linear model and the generalized linear model appear to perform equally well. For the defect data, the generalized linear model was used because it is more natural to think of the defect detection process as a counting process, resembling a Poisson process more than the normal process needed by linear models. For the interval data, the linear model was used because the intervals approximated a normal distribution.
S offers two functions for specifying models, lm() for linear models and glm() for generalized linear models. Both take as basic parameters a model specification and the data for the model. In addition, in glm(), we can specify a distribution family (Poisson, Gaussian, etc.)
Model fitting is done by iteratively adding or dropping factors and adjusting the coefficients to give the best fit with the given data. In each iteration, a new factor is added to the model if it significantly reduces the residual variance. Conversely, a factor may be dropped if its removal does not significantly increase the residual variance.
While it is desirable to add as many explanatory factors in the model, there is the danger of adding too many factors. This is known as overfitting. The problem is that while the model might be a good fit to the data it is modeled on, it may be inexplainable or may not make physical sense. In addition, it cannot reliably characterize and predict additional data. One way to check this is to partition the data and build the model on one set and test its reliability on the other set. However, as in our case, there are too few data points to begin with.
We looked for a parsimonious model with the help of stepwise model selection. In stepwise model selection, we start with an existing model and iteratively add or drop one term, minimizing the number of parameters while maximizing the fit according to some specified criterion. In S, we used the function step(), increasing the scale parameter until the number of factors in the model are sufficiently reduced.
The model selection algorithm may not give the best model since it does not know the physical meaning of the factors it manipulates. At the end, we must use our prior knowledge of the process in order to fine-tune the model to one that is physically interpretable.
To calculate the significance of a factor's contribution into the model, we used the summary.aov() function to perform analysis of variance, passing the model specification into it, with the factor of interest at the end of the formula. For example, if we have a model , we perform the analysis of variance on , , and to calculate the significance of the contributions of a, b, and c to the model. Essentially, this is how step() determines which factor to retain and which to drop.
Once a model has been specified and fitted, it is checked to see if it is an adequate model. The model is adequate when it reasonably estimates y, i.e., there is a high linear correlation between y and , and has sufficiently explained the variance, i.e., the residuals are reduced to a patternless set of data as plotted against y and against .
Identifying the Mechanisms to Improve Code Inspection
Costs and Benefits
This document was generated using the LaTeX2HTML translator Version 96.1 (Feb 5, 1996) Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
The command line arguments were:
latex2html -split 0 main.tex.
The translation was initiated by 11265-Harvey Siy on Fri Aug 9 15:37:50 CDT 1996