The myth of the sample size for evaluations

October 25, 2009


Virtually any evaluation protocol has a recommended sample size, or at least a procedure for calculating a sample size. This entry explores some problems with sample sizes.

Assume that it is desired to calculate a sample size for an error grid evaluation (1). See reference one for details, but in an error grid evaluation, one performs a method comparison, plots the results in the grid and calculates the percentage of results in each error grid zone. In an error grid, there are least two areas of interest – the innermost zone (called “A” here) which contains most of the differences and an outer zone (called “C” here) which should contain no results as these differences have a high potential for serious patient harm.

I will skip the discussion of calculating the sample size for zone A. To calculate the sample size for zone C, one needs a goal – assume that one wishes less than one result per million in zone C. It can be shown (2) that the required sample size to prove with 95% confidence that less than one result per million is in zone C is to run 371,000,000 samples and observe no results in zone C. Additionally, the candidate assay has to be run in a representative way (with respect to routine use) in the method comparison. Since this number of samples is a bit much, how can one be confident that the goal for zone C will be achieved? The answer is using risk management techniques. The clinical laboratory has to perform FMEA/fault tree analysis (3) to ensure that user errors don’t cause zone C results and the manufacturer has to perform FMEA/fault tree analysis to ensure that the system itself doesn’t cause zone C results.


  1. CLSI/NCCLS. How to Construct and Interpret an Error Grid for Diagnostic Assays EP27 Proposed Guideline. CLSI/NCCLS document EP27-P. Wayne, PA: NCCLS; 2009.
  2. Hahn GJ and Meeker WQ. Statistical intervals. A guide for practitioners. Wiley: New York, 1991, pp 103-105.
  3. CLSI/NCCLS Risk Management Techniques to Identify and Control Laboratory Error Sources. Proposed Guideline –Third Edition CLSI/NCCLS document EP18-P3 Wayne, PA: NCCLS; 2009.

Biological Variation and Assay Performance Standards

October 10, 2009


I had occasion to read about the suggestion that some fraction (often 50%) of biological variation should play a role in setting assay performance standards. This makes no sense to me. Here’s why.

The most fundamental measure of assay performance is diagnostic accuracy. That is – sensitivity the percentage of people tested whose assay value is above the cutoff and who have the disease and, and specificity the percentage of people tested whose assay value is below the cutoff and who do not have the disease.

Biological variation serves to decrease diagnostic accuracy. If a person who does not have the disease has a spike in the assay due to biological variation and this elevates the value beyond the cutoff, a false positive is the result. The more biological variation, the more the decrease in diagnostic accuracy. Analytical error does the same thing – the more error, the lower the observed diagnostic accuracy. From a diagnostic accuracy standpoint, there is no difference between biological variation and analytical error. Thus, it makes no sense that the performance of an assay should be allowed to reach 50% of the biological variation.

How should performance standards be set?

Use error grids to define limits where no results should occur (e.g., errors large enough to have high potential to cause patient harm). These limits are called limits of erroneous results (LER) by the FDA. These are the most important limits and are set using clinical judgment.

The area in an error grid to contain most of the results (often 95%) is less important and can be set using performance achieved by existing technology, with the caveat that considerations must be given to special circumstances such as cost, turn around time when it’s important, and so on.