Software Quality

Dr. Dobb's Journal March 1998

Measuring software quality to support testing

By Shari Lawrence Pfleeger

Shari is president of Systems/Software and author of Software Engineering: Theory and Practice (Prentice Hall, 1998). Shari can be contacted at s.pfleeger@ieee.org.

Sidebar: Hewlett-Packard's Fault Classification
Sidebar: Measuring Test Effectiveness

In an ideal situation, we become so good at our craft that every program works properly every time it is run. Unfortunately, this is not reality. For one thing, many software systems deal with large numbers of states and complex formulae, activities, and algorithms. In addition, we have to implement a customer's conception of a system even though the customer may be uncertain of exactly what is needed. Finally, the size of a project and number of people involved can add complexity. Thus, the presence of faults is not merely due to errors in the software, but also depends on user and customer expectations.

What does it mean when we say software has failed? Usually, it means that the software does not do what the requirements describe. For example, the specification may state that the system must respond to a particular query only when users are authorized to see the data. If the program responds to unauthorized users, we say it has failed. There can be several reasons for failure, including:

The specification may be wrong. It may not state exactly what the customer wants or needs. For example, the customer may actually want to have several categories of authorization, with each category having a different kind of access, but that requirement may not have been stated explicitly.
The specification may contain a requirement that is impossible to implement, given the prescribed hardware and software.
The system design may contain a fault. Perhaps the database and query language designs make it impossible to authorize users.
The program design may contain a fault. The component descriptions may contain an access-control algorithm that does not handle a case correctly.
The program code may be wrong. It may implement the algorithm improperly or incompletely.

No matter how capably you write programs, you should ensure that components are coded correctly. Many programmers view the object of testing as a demonstration of their programs performing properly. However, demonstrating correctness is actually antithetical to what testing is about. We test a program to demonstrate the existence of a fault. Because the goal is to discover faults, we consider a test successful only when a fault is discovered or a failure occurs. Fault identification is the process of determining what fault(s) cause a failure, and fault correction (or fault removal) is the process of making changes to the system so that the fault is removed.

Orthogonal Defect Classification

It is useful to categorize and track the types of faults you find, not just in code but anywhere in a software system. Historical information can help you predict what types of faults your code is likely to have (which helps direct testing efforts), and clusters of certain types of faults can warn you that it may be time to rethink designs or requirements. Many organizations perform statistical fault modeling and causal analysis, both of which depend on understanding the number and distribution of types of faults. For example, IBM's Defect Prevention Process (described in "Experiences with Defect Prevention," R. Mays et al., IBM Systems Journal, 29, 1990) seeks out and documents the root cause of every problem that occurs. The information, which is used to suggest the types of faults testers should look for, has reduced the number of faults injected in the software.

In "Orthogonal Defect Classification: A Concept for In-process Measurements" (IEEE Transactions on Software Engineering, November 1992), Ram Chillarege et al. at IBM developed an approach to fault tracking called "orthogonal defect classification," in which faults are placed in categories that collectively paint a picture of which parts of the development process need attention because they are responsible for spawning many faults. Thus, the classification scheme must be product- and organization-independent and must be applicable to all stages of development. Table 1 lists the types of faults that make up IBM's classification. When using the classification, developers identify not only the type of fault, but whether it involves something that is missing or is incorrect.

A classification scheme is orthogonal if any item being classified belongs to exactly one category. In other words, you want to track the faults in the system in an unambiguous way, so that the summary information about the number of faults in each class is meaningful. You lose the meaning of the measurements if a fault belongs to more than one class. Likewise, the fault classification must be clear, so that any two developers can classify a particular fault in the same way.

Fault classification, such as IBM's and Hewlett-Packard's (see the accompanying text box entitled "Hewlett-Packard's Fault Classification"), improves the entire development process by telling you which types of faults are found in which development activities. You can build a profile of the types of faults located by each of the fault-identification or testing techniques you used while building the system. It is likely that different methods will yield different profiles. Then you can build your fault prevention and detection strategy based on the kinds of faults you expect in the system and the activities that will root them out. Chillarege et al. illustrate IBM's use of this concept by showing that the fault profile for design review is very different from the fault profile for code inspection.

Software Quality

Software quality can be measured in many ways. One way to assess the "goodness" of a component is by the number of faults it contains. It seems natural to assume that the most elusive software faults are also the most difficult to correct. It also seems reasonable to believe that faults that are easy to fix are detected early on, while the more difficult faults are located later in the testing process.

However, M.L. Shooman and M. Bolsky found this is not the case (see "Types, Distribution and Test and Correction Times for Programming Errors," Proceedings of the 1975 International Conference on Reliable Software, IEEE Computer Society Press, 1975). It can take some time to find trivial faults, and many such problems are overlooked or do not appear until well into the testing process. Moreover, Glenford Myers reports in The Art of Software Testing (John Wiley & Sons, 1979) that as the number of detected faults increases, the probability of the existence of more undetected faults increases. If there are many faults in a component, you want to find them as early as possible in the testing process. However, if you find a large number of faults at the beginning, then you are likely to have a large number of yet undetected faults.

In addition to being contrary to intuition, these results also make it difficult to know when to stop looking for faults during testing. You must estimate the number of remaining faults, not only to know when to stop searching for more, but also to give some degree of confidence in the code you are producing. The number of faults also indicates the maintenance effort that will probably be needed if faults are left to be detected after the system is delivered.

Harlan Mills developed a technique known as "fault seeding" or "error seeding" to estimate the number of faults in a program (see "On the Statistical Validation of Computer Programs," Technical Report FSC-72-6015, IBM Federal Systems Division, 1972). His basic premise is that one member of the test team intentionally inserts ("seeds") a known number of faults into a program. Then, other team members locate as many faults as they can. Assuming that the ratio of seeded faults detected to total seeded faults is the same as the ratio of nonseeded faults detected to total nonseeded faults, we can estimate the number of indigenous faults remaining in the program. Thus, if a program is seeded with 100 faults and the test team finds only 70, it is likely that 30 percent of the indigenous faults remain in the code.

To express this ratio more formally, let S be the number of seeded faults placed in a program, and let N be the number of indigenous (nonseeded) faults. If n is the actual number of faults detected during testing, and s is the number of seeded faults detected during testing, then an estimate of the total number of indigenous faults is:

N=Sn/s

Although simple and useful, this approach assumes that the seeded faults are of the same kind and complexity as the actual faults in the program. But you do not know what the typical faults are before you've found them, so it is difficult to make the seeded faults representative of the actual ones. One way to increase the likelihood of representativeness is to base the seeded faults on historical records for code from similar past projects. However, this approach is useful only when you have previously built similar systems. And things that seem similar may, in fact, be quite different in unanticipated ways.

To overcome this obstacle, you can use two independent test groups to test the same program -- Test Group 1 and Test Group 2, for instance. Let x be the number of faults detected by Test Group 1, and y the number detected by Test Group 2. Some faults will be detected by both groups; call this number of faults q, so that q£x and q£y. Finally, let n be the total number of all faults in the program; you want to estimate n.

The effectiveness of each group's testing can be measured by calculating the fraction of faults found by each group (see the accompanying text box "Measuring Test Effectiveness"). Thus, the effectiveness E₁of Group 1 can be expressed as:

E₁=x/n

and the effectiveness E₂ of Group 2 as:

E₂=y/n

The group effectiveness measures the group's ability to detect faults from among a set of existing faults. Thus, if a group can find half of all faults in a program, its effectiveness is 0.5. Consider faults detected by both Group 1 and Group 2. Assuming that Group 1 is equally effective at finding faults in all parts of the program, you can look at the ratio of faults found by Group 1 from the set of faults found by Group 2. That is, Group 1 found q of the y faults that Group 2 found, so Group 1's effectiveness is q/y. In other words:

E₁=x/n=q/y

However, you know that E₂ is y/n, so you can derive the following formula for n:

n=q/(E₁E₂)

You have a known value for q, and you can use estimates of q/y for E₁ and q/x for E₂, so you have enough information to estimate n.

To see how this method works, suppose two groups test a program. Group 1 finds 25 faults. Group 2 finds 30 faults, 15 of which are faults also found by Group 1. So, you have:

x=25
y=30
q=15

The estimate, E₁, of Group 1's effectiveness is q/y, or 0.5, since Group 1 found 15 of the 30 faults found by Group 2. Similarly, the estimate, E₂, of Group 2's effectiveness is q/x, or 0.6. Thus, your estimate of n (the total number of faults in the program) is 15/(.5.6), or 50 faults.

The test strategy defined in the test plan can use this estimating technique to determine when testing is complete.

Confidence in the Software

You can use fault estimates to tell how much confidence you can place in the software being tested. Confidence, usually expressed as a percentage, indicates the likelihood that the software is free of faults. That is, if you say a program is fault free with a 95 percent level of confidence, then you mean that the probability that the software has no faults is 0.95.

Suppose you have seeded a program with S faults, and claim that the code has only N actual faults. You test the program until you've found all S of the seeded faults. If, as before, n is the number of actual faults discovered during testing, then the confidence level can be calculated as:

For example, suppose you claim that a component is fault free, meaning that N is zero. If you seed the code with ten faults and find all ten without uncovering an indigenous fault, then you can calculate the confidence level with S=10 and N=0. Thus, C is 10/11, for a confidence level of 91 percent. If the requirements or contract mandate a confidence level of 98 percent, you would need to seed S faults, where S/(S-0+1) = 98/100. Solving this equation, you see that you must use 49 seeded faults and continue testing until all 49 faults were found (with no indigenous faults discovered).

This approach presents a problem: You cannot predict the level of confidence until all seeded faults are detected. In his paper "Computer Software: Testing, Reliability Models, and Quality Assurance" (Technical Report NPS-55RH74071A, Naval Postgraduate School, 1974), F.R. Richards suggests a modification, where the confidence level can be estimated using the number of detected seeded faults, whether or not all have been located. In this case, C is:

These estimates assume that all faults have an equal probability of being detected, which is not likely to be true. However, many other estimates take these factors into account. Such estimation techniques not only give you some idea of the confidence you can place in your programs, but also provide a side benefit. Programmers may be tempted to conclude that each fault discovered is the last one. If you estimate the number of faults remaining, or if you know how many faults you must find to satisfy a confidence requirement, you have incentive to keep testing for one more fault.

These techniques are also useful in assessing confidence in components that are about to be reused. You can look at the fault history of a component, especially if fault seeding has taken place, and use such techniques to decide how much confidence to place in reusing the component without testing it again. Or, you can seed the component and use these techniques to establish a baseline level of confidence.

Identifying Fault-Prone Code

Many techniques for identifying fault-prone code use fault histories in similar applications. For example, some researchers track the number of faults found in each component during development and maintenance. They also collect measurements about each component, such as size, number of decisions, number of operators and operands, or number of modifications. Then, they generate equations to suggest the attributes of the most fault-prone modules. These equations can be used to suggest which of your components should be tested first and which should be given extra scrutiny.

In "Empirically Guided Software Development using Metric-based Classification Trees" (IEEE Software, March 1990), Adam Porter and Richard Selby suggest using classification trees to identify fault-prone components. Classification-tree analysis is a statistical technique that sorts through large arrays of measurement information, creating a decision tree to show which measurements are the best predictors of a particular attribute. For instance, suppose you collect measurement data about each component built in the organization. You include size (in lines of code), number of distinct paths through the code, number of operators, depth of nesting, degree of coupling and cohesion (rated on a scale from one as lowest to five as highest), time to code the component, number of faults found in the component, and more. You use a classification-tree analysis tool (such as C4, CART, or the Trellis graphics capabilities of S-Plus; see "Object-Oriented Programming in S," by Richard Calaway, DDJ, October 1995) to analyze the attributes of the components that had five or more faults, compared with those that had less than five faults. The result is a decision tree like that in Figure 3.

The tree is used to help you decide which components in your current system are likely to have a large number of faults. According to the tree, if a component has between 100 and 300 lines of code and has at least 15 decisions, then it may be fault prone. Or, if the component has over 300 lines of code, has not had a design review, and has been changed at least five times, then it, too, may be fault prone. You can use this type of analysis to help target testing when resources are limited. Or, you can schedule inspections for such components, to help catch problems before testing begins.

Measure for Measure

The metrics described here represent only some of the ways that measurement can assist you in testing your software. Measurement is useful in at least three ways:

Understanding your code and its design.
Understanding and controlling the testing process.
Predicting likely faults and failures.

You can measure processes, products, and resources to make you a better-educated tester. For example, measures of the effectiveness of the testing process suggest areas where there is room for improvement. Measures of code or design structure highlight areas of complexity where more understanding and testing may be needed. And measuring the time it takes to prepare for tests and run them helps you understand how to maximize your effectiveness by investing resources where they are most needed. You can use the goals of testing and development to suggest what to measure, and even to define new measurements tailored to your particular situation. The experience reported in the literature is clear: Measurement is essential to doing good testing. It is up to you to find the best ways and best measures to meet your project's objectives.

DDJ