Outliers in Quality Control and Proficiency Testing

February 20, 2006
I have been interested in ways to compare different assays. The recent guest essay on the Westgard web site on this topic stimulated the following comments.

The title of the essay is: “The Quality Goal Index – Its Use in Benchmarking and Improving Sigma Quality Performance of Automated Analytic Tests” The essay starts with:

“The long-term goal of six sigma quality management is to achieve an error rate of 3.4 or less per million opportunities for all laboratory processes. In percent terms, that’s an error rate of less than 0.001%.”

A few sentences later, the author is describing how to calculate “sigma performance” as a measure of quality performance and states that for the CV estimate:

“Data integrity can be assured only when procedures are in place and rigorously practiced to exclude erroneous quality control results due to procedural blunders and statistical outliers.”

This sentence causes me to question whatever follows with suspicion. The author excludes two types of errors: blunders and statistical outliers. From a clinician point of view, interest is in obtaining the correct answer (meaning “correct enough”). If an incorrect answer is produced by a blunder, it is nevertheless wrong. So right off the bat, one knows that whatever is being measured by the author is a subset of real quality performance and one has no doubt heard that the majority of clinical laboratory errors come from pre or post analytical problems.

But perhaps the author intends to measure a subset of quality performance – that due to the analytical process. Then I have a problem with excluding statistical outliers. There is no justification for this exclusion. From a simple numbers view, assume that excluded data are greater than 3 standard deviations from the average. This means excluding about 0.3% of the data, if the data are normally distributed. Well, that’s one way to get to an error rate of less than 0.001%! One might argue that exclusion of a specific outlier is ok because the outlier must have been due to a blunder, but that is speculation and it is possible that the outlier occurred as part of the analytical process – one simply does not know.

From another point of view, if one constructs a Clarke or Parkes type of error grid (e.g., similar to that used for glucose), then dangerous errors will be (by definition of this grid) large errors in certain regions and are likely to be outliers (in a statistical sense) and could easily be excluded. But these are the very errors that should be measured (see essay on FDA’s new waiver guidance).

Along these lines, note that serious assay errors are associated with patient harm. These rare values:

  1. are likely to do the most harm
  2. are likely to be called a statistical outlier

The same arguments apply to proficiency testing which often have automated outlier rules that “clean” the data.

This same trend to exclude outliers has been prevalent in discussions to publish a CLSI standard for GUM (Guide to the Expression of Uncertainty in Measurement).

But not excluding outliers messes up our analysis

Welcome to the real world. That’s true. This is why I recommend to assess quality control data which:

  1. does not exclude data
  2. measures distance from target.
  3. the number of values outside of medically acceptable limits (either out low or out high)
  4. the estimated total analytical error
  5. the lower and upper 95% uncertainty intervals (non parametric estimation)
  6. the contribution of bias as a percent of total analytical error
  7. the contribution of imprecision as a percent of total analytical error

Even if outliers are included

Remember that even if outliers are included, the quality performance measured is still a subset of the analytical performance because random biases, especially those due to patient interferences will be missed because:

  1. quality control material instead of patient samples are being tested
  2. the usual frequency of quality control testing will miss some intermittent errors

This is described in more detail in another essay.

In the list of essays, I used Outlier……………..s. I first saw this in: Beckman, R. J., and R. D. Cook, (1983). Outlier…s. Technometrics, vol. 25, pp. 119-149.

FMEA and Validation – 2/2006

February 13, 2006
FMEA and Validation – 2/2006

In conducting a FMEA, one goes through the steps of

  • Modeling the process (often with help of process flowcharts)
  • Postulating all potential errors
  • Classifying all potential errors
  • Ranking the classified errors
  • Proposing mitigations for the top errors
  • Performing a FMEA on the new process (e.g., the process as changed by the mitigations)

If any of these steps is likely to be neglected, it is the last one – that of performing yet another FMEA! (sounds recursive too, since a subsequent FMEA can cause more changes). The purpose of this essay is to consider validation of a FMEA, which could be thought of as part of the task of performing a FMEA on the new process (e.g., as changed by mitigations).

An Example

Recall a model used in FMEA; namely, the error, detection, recovery model (see figure), where one is trying to prevent the effect of an error, given that an error has occurred. (see also the near miss essay).

For the example, consider the process steps when a sample arrives for analysis at a hospital laboratory (1). One of the steps is to examine the sample visually for lipemia, and if this condition is observed, to perform a “recovery”, often by notifying the source that sent the sample and or by further processing the sample. Assume that the original error occurred outside of the laboratory that is responsible for analyzing the sample. This is a common situation although it is also possible that the hospital laboratory that analyzes the sample may also be responsible for preparing it.

To put some numbers on this example, assume that the hospital laboratory receives 100,00 samples per year and that 1% of these samples should fail the criteria for lipemia. This means that 1,000 samples are lipemic. Now one may reason that all lipemic samples will be detected and a recovery performed because detection and recovery steps are in place. However, consider what would happen if these steps did not always work. Assume that the detection step was 95% effective and the recovery step was 99% effective. This means that of the 1,000 samples that are lipemic, 50 will not be detected and they will be analyzed in error. On the other hand, of the 950 samples that are detected, 9.5 will fail recovery, meaning that the total number of samples subject to the error effect is (on average) 50+9.5 = 59.5/100,000 or 0.0595%.

To summarize:

  • the error event frequency is 1% = 0.01×100,000 = 1,000, with the error event being a lipemic sample arrives for analysis
  • the error event effect frequency is 59.5/100,000 = 0.0595% with the error event effect being a lipemic sample is analyzed

Assume also, that the number of samples for which lipemia would cause a result error is 2%. This means that for the original 100,000 samples, a higher level observed error effect of wrong answer is the combined probability of ((59.5 / 100,000) x (2,000 / 100,000))*100,000 = 1.2 samples on average every year. This error could in turn result in the spectrum of no patient harm to a patient death but the point of this essay is to go back to the FMEA steps that have been put in place to detect and recover from the original error (rather than to focus on outcomes).


In this example, I arbitrarily set detection success at 95% and recovery success at 99%. The laboratory person responsible for quality might argue that both steps are failsafe and hence virtually 100% effective. If there is a valid criterion for lipemia it might be hard to imagine how one could miss detecting it or fail to initiate a recovery – nevertheless, validation provides objective evidence that detection and recovery goals meet objectives. To set up a validation experiment for detection, one might have an independent observer rate all samples for lipemia, in a way that does not interfere with the routine process in place for examining the sample and then one can tally results as:

Independent Observer Routine Observer – Lipemic Routine Observer – Not Lipemic
Lipemic Match Error
Not Lipemic Error Match

In this experiment, one is assuming that the independent observer is correct. An additional part of the validation experiment is the sample size. That is, say the independent observer has checked 100 consecutive samples and found no mismatches. The table might look like:

Independent Observer Routine Observer – Lipemic Routine Observer – Not Lipemic
Lipemic 1 0
Not Lipemic 0 99

The observed error rate for each of the two possible error types is zero but the 95% confidence interval (2) for the two mismatch error rates are:

Independent Observer Routine Observer – Lipemic Routine Observer – Not Lipemic
Lipemic 95%
Not Lipemic 2.98%

The problem is that there has only been 1 opportunity to misclassify a lipemic sample so the confidence interval actually says that this error rate could be as high as 95%! Say one goes back and rigs the experiment to include 10% lipemic samples and runs the experiment for 500 samples and gets the following results.

Independent Observer Routine Observer – Lipemic Routine Observer – Not Lipemic
Lipemic 50 0
Not Lipemic 0 450

The observed error rate for any error is again zero but the 95% confidence interval for the two mismatch error rates are now:

Independent Observer Routine Observer – Lipemic Routine Observer – Not Lipemic
Lipemic 5.8%
Not Lipemic 0.66%

So even with all of this work, one has only “proved” (e.g., with 95% confidence) that one has about a 94% or better error detection success rate of detecting all of the lipemic samples. Of course, it is also possible that mismatch rates will be non zero. The same arguments apply to recovery.

Errors and Outcomes

The initial error rate caused by missing detection and recovery was assumed by me to be 59.5 samples per year but this error rate leads to an outcome of a wrong result of only 1 sample per year which may lead the hospital laboratory into a false sense of security, meaning that their current process may be flawed but not lead to customer complaints. Hence, one should exclude outcomes from the analysis, since the hospital laboratory can only control their detection and recovery rate as a means to control the outcome rate.

Making up examples is difficult but there are real problems

Validation should lead to a case where no errors are found, which may make one exclaim they have been forced to do something for which they already knew the outcome. However, consider the following real cases:

Detection – Detection was missed when organs of the wrong blood type were selected to be transplanted, the transplant occurred and the patient died (3). Detection – Airline pilots repeat air traffic controller orders to detect miscommunication. Yet, miscommunication detection failed and caused one of the largest air disasters ever (4). Recovery – It was detected that the wrong leg was scheduled to be amputated but the recovery (change the operating room schedules) failed. Not all operating room schedules were changed (5) and the wrong leg was amputated.

Hence, even though it might be hard to envision how things can go wrong, there are real cases where seemingly simple detection and recovery process steps have failed. Validation is suggested as a means to help to ensure that new or existing mitigations work – and should be considered as a tool to help with performing a FMEA on mitigations.

The quality of validation – Equivalent QC

CMS has proposed equivalent QC for clinical laboratories. In changing the QC process, CMS requires validation (of use of equivalent QC). I have commented on the inadequacy of this validation (see equivalent QC essay). This leaves the question of what is an adequate validation. In some cases, people conducting a FMEA might assume perfect detection and recovery. Some level of validation beyond this assumption is warranted but must one conduct experiments that contain thousands of samples to prove that rare events haven’t happened? This topic will be pursued in a future essay.


  1. Application of a Quality System Model for Laboratory Services; Approved Guideline—Second Edition GP26-A3 NCCLS 2004 Wayne, PA.
  2. Hahn GJ and Meeker WQ. Statistical intervals. A guide for practitioners. Wiley: New York, 1991, p. 104
  3. See http://www.cbsnews.com/stories/2003/02/18/health/main540907.shtml
  4. Fatal Words: Communication Clashes and Aircraft Crashes by Steven Cushing University of Chicago Press, 1997, Chicago, IL
  5. Scott D. Preventing medical mistakes. RN. 2000 Aug;63(8):60-4