Just published

May 8, 2019

The article, “Getting More Information From Glucose Meter Evaluations” has just been published in the Journal of Diabetes Science and Technology.

Our article makes several points. In the ISO 15197 glucose meter standard (2013 edition), one is supposed to prepare a table showing the percentage of results in system accuracy within 5, 10, and 15 mg/dL. Our recommendation is to graph these results in a mountain plot – it is a  perfect example of when a mountain plot should be used.

Now I must confess that until we prepared this paper, I had not read ISO 15197 (2013). But based on some reviewer comments, it was clear that I had to bite the bullet, send money to ISO and get the standard. Reading it was an eye opener. The accuracy requirement is:

95% within ± 15 mg/dL (< 100 mg/dL) and within ± 15% (> 100 mg/dL) and
99% within the A and B zones of an error grid

I knew this. But what I didn’t know until I read the standard is user error from the intended population is excluded from this accuracy protocol. Moreover, even the healthcare professionals performing this study could exclude any result if they thought they made an error. I can imagine how this might work: That result can’t be right…

In any case, as previously mentioned in this blog, in the section when users are tested, the requirement for 99% of the results to be within the A and B zones of an error grid was dropped.

In the section where results may be excluded, failure to obtain a result is listed since if there’s no result, you can’t get a difference from reference. But there’s no requirement for the percentage of times a result can be obtained. This is ironic since section 5 is devoted to reliability. How can you have a section on reliability without a failure rate metric?


Big errors and little errors

May 27, 2018

In clinical assay evaluations, most of the time, focus is on “little” errors. What I mean by little errors are average bias and imprecision that exceed goals. Now I don’t mean to be pejorative about little errors since if bias or imprecision don’t meet goals, the assay is unsuitable. One of the reasons to distinguish between big and little errors is that often in evaluations, big errors are discarded as outliers. This is especially true in proficiency surveys but even for a simple method comparison, one is justified in discarding an outlier because the value would otherwise perturb the bias and imprecision estimates.

But big errors cause big problems and most evaluations focus on little errors, so how are big errors studied? Other than running thousands of samples, a valuable technique is to perform a FMEA (Failure Mode Effects Analysis). This can or should cover user error, software, interferences, besides the usual items. A FMEA study is often not very enthusiastically received but it is a necessary step in trying to ensure that an assay is free from both big and little errors. Of course, even with a completed FMEA, there are no guarantees.


New publication about interferences

April 20, 2018

My article “Interferences, a neglected error source for clinical assays” has been published. This article may be viewed using the following link https://rdcu.be/L6O2

Two examples of why interferences are important and a comment about a “novel approach” to interferences

September 29, 2017

I had occasion to read an open access paper “full method validation in clinical chemistry.” So with that title, one expects the big picture and this is what this paper has. But when it discusses analytical method validation, the concept of testing for interfering substances is missing. Precision, bias, and commutability are the topics covered. Now one can say that an interference will cause a bias and this is true but nowhere do these authors mention testing for interfering substances.

The problem is that eventually these papers are turned into guidelines, such as ISO 15197, which is the guideline for glucose meters. And this guideline allows 1% of the results to be unspecified (it used to be 5%). This means that an interfering substance could cause a large error resulting in serious harm in 1% of the results. Given the frequency of glucose meter testing, this translates to one potentially dangerous result per month for an acceptable (according to ISO 15197) glucose meter. If one paid more attention to interfering substances and the fact that they can be large and cause severe patient harm, the guideline may have not have allowed 1% of the results to remain unspecified.

I attended a local AACC talk given by Dr. Inker about GFR. The talk, which was very good had a slide about a paper about creatinine interferences. After the talk, I asked Dr. Inker how she dealt with creatinine interferences on a practical level. She said there was no way to deal with this issue, which was echoed by the lab people there.

Finally, there is a paper by Dr. Plebani, who cites the paper: Vogeser M, Seger C. Irregular analytical errors in diagnostic testing – a novel concept. (Clin Chem Lab Med 2017, ahead of print). Ok, since this is not an open access paper, I didn’t read it but what I can tell from Dr. Plebani comments, the cited authors have discovered the concept of interfering substances and think that people should devote attention to it. Duh! And particularly irksome is the suggestion by Vogeser and Seger of “we suggest the introduction of a new term called the irregular (individual) analytical error.” What’s wrong with interference?

Discrepant Analysis in Every Day Situations

January 18, 2006

Abnormal patient sample results are often repeated, whereas non abnormal results are usually not repeated. The same is true for QC; namely, that values that are “out” are often repeated, whereas values that are “in”, are usually not repeated. This essay considers some attributes of this practice.

Discrepant analysis for evaluations

Before considering the above cases, consider discrepant analysis for evaluations (1-2). For example, consider an assay evaluation, where one compares an assay result to a gold standard diagnosis (3). Usually, most assay results will agree with the gold standard diagnosis, but there will be a few exceptions. In discrepant analysis, assay results that don’t agree are rerun. This results in either confirmation or non confirmation of the discrepancy. This is shown graphically below.

Candidate assay positive Candidate assay negative
Reference Method positive Result agrees Discrepant
Reference Method negative Discrepant Result agrees
Repeat Discrepants only  
Candidate assay positive Candidate assay negative
Reference Method positive Result agrees Discrepant
Reference Method negative Discrepant Result agrees

In the second table, the number of discrepant results are either the same or lower than the first table (with the number of “result agrees” either the same or higher).

Before continuing, one may ask what is the root cause for the discrepant result. Whereas there can be many root causes, consider two generic top level effects – the result upon rerun either gives essentially the same value (e.g., remains discrepant) or gives a different value (for the sake of argument, assume the new value is no longer discrepant). The first case is consistent with a fixed bias, whereas the second case is consistent with a random error.

So far, there is nothing wrong with the above practice and it is natural to try to explore discrepant samples. Where people get into trouble is estimating things with the results of the study (which of course is almost always the case, or else why do the study). In such an evaluation, one wants to estimate the analytical sensitivity and analytical specificity of the assay. The problem is, constructing these estimates using the discrepancy procedure above (e.g., results from the second table) is biased, and results in diagnostic accuracy estimates that are too optimistic (1-2). One can think of this bias intuitively. That is, consider a sample that agrees with the gold standard diagnosis. Were this sample rerun, it might yield a discrepant result upon rerun (due to statistical variation and especially if the initial result were close to being a discrepant result in the first case). But there is no chance for this to occur because in the above procedure, only samples that are initially discrepant are rerun. Hence, the estimates are too optimistic.

To summarize, one could of course run replicates for all samples, but this might add too much expense to the study. It is reasonable to try to resolve only the discrepant results by rerunning them as long as one calculates sensitivity and specificity correctly (see references 1-2).

General Comment about Discrepant Analysis in Every Day Situations

In every day situations, discrepant analysis is use to inform about an action to be taken. For a patient sample result, the result of discrepant analysis will inform between two treatment alternatives (typically corresponding to those associated with a “normal” or “abnormal” result). In QC, the result of discrepant analysis will inform whether or not to rerun a block of patient samples and to troubleshoot the assay.

Discrepant analysis for Patient Samples

In routine practice, patient sample results are often repeated according to rules set up by the clinical laboratory and in a typical case, only abnormal results are rerun.

This is not the same as the discrepant analysis discussed in the section on evaluations because there is no reference method. However, the same arguments apply, because only selected patient sample results are repeated. One could perhaps consider the result slated to be repeated as discrepant from a working hypothesis that the result was “normal”.

The practice of repeating only selected samples was questioned in another essay with respect to troponin I.

To recall, in that study a point of care assay result for troponin I was repeated only if the result was above the cutoff. The study was used to support a reduced length of stay in the emergency department.

The bottom line is what is the performance of the assay (e.g., its analytical sensitivity and analytical specificity) given the specific set of clinical laboratory’s practice of only repeating selected samples. This is likely to be different than the analytical sensitivity and analytical specificity of the assay as determined by an evaluation.

Discrepant analysis for QC

The same arguments apply to QC. That is, when QC is out, one of the first (troubleshooting) steps is to repeat the QC to see if it is repeatedly out (see cases 1 and 2 below). Yet, a QC that is “in” is not repeated. Note, that for multiple rule QC programs, nothing really changes, because if one needs three observations to fulfill a criterion before QC is considered out, then one can simply consider that set of observations as a case. So, again, the bottom line is what is the performance of the QC procedure (e.g., the equivalent of analytical sensitivity and analytical specificity) given the practice of only repeating discrepant samples.

QC is different than the patient case above, because the result of QC affects a block of patient samples. For example, if a patient sample result was initially abnormal, but normal upon several repeats, then it is likely that random error caused the initial result to be abnormal and that the result is normal and can be so reported. This type of argument does not apply to QC. If a QC sample that is “out” does not repeat as “out”, one has no way of knowing whether the cause of the initial “out” result affected one or more patient sample results.

Moreover, with QC one can distinguish between the following cases:

Case 1 – QC is out. Rerun the QC sample. If the rerun is in, declare the run OK. This is the discrepant analysis case under discussion.

Case 2 – QC is out. Declare the run out and rerun the patient samples. Rerun the QC sample as part of a procedure to troubleshoot the assay. This is not discrepant analysis since the decision to take action (rerun the patient samples and troubleshoot the assay) has already been made.

Acknowledgement Helpful comments were provided by Sten Westgard.


1.       Quantifying the bias associated with use of discrepant analysis. Lipman HB, and Astles JR Clinical Chemistry 1998;44:108-115.

2.       User protocol for evaluation of qualitative test performance: Approved Guideline EP12A 2002 CLSI 940 West Valley Road, Suite 1400, Wayne, PA 19087.

3.       There are several variations to this scheme with respect to the accuracy of the gold standard diagnosis and whether the repeat assay uses a different (e.g., better reference procedure). These are beyond the scope of this essay.


How to specify and estimate outlier rates

July 17, 2004

Outliers are often distinguished from other error sources because the root cause of the outlier may differ from other error sources or because some authors recommend different disposition of outliers once they are detected (often as in “don’t worry about that result – it’s an outlier”). Unfortunately, some of these practices have lead to the neglect of outliers. Outliers are errors just like all other errors; just larger. Moreover, outliers are often the source of medical errors, since a large assay result error can lead to an incorrect medical treatment (1).

Setting outlier goals

An outlier goal is met if the number of observations in region A in Figure 1 is below a specified rate. A total error goal is met if the percentage of observations in region B is at or greater than a specified percentage (often 95% or 99%) – see for example the NCCLS standard  EP21A (2). The space that is between regions A and B is specified to contain the percentage of observations equal to B – A.

Figure 1. Outlier and Total Error Limits



Outlier limits
Total error


Estimating outlier rates

The difficulty in estimating outlier rates is that one is trying to prove that an unlikely event does not happen. There are two possible ways to do this and each have their advantages and disadvantages. Moreover, outliers are often the result of a different distribution than most of the other results. This makes it impossible to estimate outlier rates by simply assuming that all results come from a normal distribution.

Method Advantage Disadvantage
Modeling Requires fewer samples Modeling is difficult (and time consuming) – if wrong, the estimated outlier rate will be wrong
Counting No modeling is required Requires a huge number of samples



There are several types of modeling methods. One is to create a cause and effect or fishbone diagram of an assay and simulate assay results by selecting random observations from each assumed or observed distribution of assay variables to create an “assay result” and subtracting an assumed reference value from this result to obtain an assay error. The distribution of these differences allows one to estimate outlier rates.

The “GUM method” (guide to the expression of uncertainty in measurement) also starts with a cause and effect or fishbone diagram of an assay.  In the GUM method, a mathematical model is used to link all random and systematic errors sources . All systematic errors are either corrected by adjustment or can be converted into random errors when the error is unexplained. All (resulting) random errors are combined using the mathematical model, and following the rules of the propagation of error, to yield a standard deviation which expresses the combined uncertainty of all error sources. A multiple of this standard deviation (the coverage factor) provides a range for the differences between an assay and its reference for a percentage of the population of results. By selecting a suitable multiplier, one may estimate the magnitude of this range of differences (e.g., the outlier limits) for the desired percentage of the population (e.g., the outlier rate) that corresponds to the outlier goal. A concern with use of the GUM method is that it requires modeling all known errors. If an error is unknown, it won’t be modeled and the GUM standard deviation will be underestimated (3).


A FMEA (Failure Mode Effects Analysis) seeks to identify all possible failure modes and for those modes that are ranked as most important, mitigations are implemented to reduce risk. Thus, at the end of a FMEA, one has the potential to quantify outliers rates although in practice in clinical chemistry final outlier risk is rarely quantified. FMEA is important because risk is assessed for non continuous variables, such as the risk of reporting an assay value for the wrong patient.


In the counting method, outliers are considered as discrete events. That is, each assay result is judged independently from every other result to be either an outlier or not, based on the magnitude of the difference between the result and reference. Of course, the choice of reference method is important. If the reference method is not a true reference method but a comparison method (another field method), then there is no way to know that a large difference that is being called an outlier is due to the new method or existing method.

The rate of outliers is simply the numbers of outliers found divided by the total number of samples assayed and converted to a percent.

Outlier rate = (x/n) * 100

where   x = the numbers of outliers found

n = the total number of samples assayed

This rate is not exact because it is a sample. Hahn and Meeker present a method to account for this uncertainty (4). The table shows for various numbers of total observations and outliers found, the maximum percentage outlier rate with a stated level of confidence. This gives one an idea of sample sizes required to prove the maximum outlier rate.

Sample Size Number Outliers Found Maximum Percent Outlier Rate (95% Confidence) Maximum Percent Outlier Rate (99% Confidence) ppm Outlier Rate (95%) ppm Outlier Rate (99%)
10 0 25.9 36.9 259,000 369,000
100 0 3.0 4.5 30,000 45,000
1,000 0 0.3 0.5 3,000 5,000
1,000 1 0.5 0.7 5,000 7,000
10,000 0 0.03 0.05 300 500
10,000 1 0.05 0.07 500 700
10,000 10 0.2 0.2 2,000 2,000
The following entry is a “six sigma” process
881,000 0 3.40037E-04 5.23E-04 3.4 5.2


Understanding the table entries

Using the third row as an example, 1,000 samples have been run and no outliers have been found. The estimated outlier rate is zero. However, this is only a sample and subject to sampling variation. Using properties of the binomial distribution allows one to state with 95% confidence that there could be no more than 0.3% outliers for the true rate. This is equivalent to saying that in 1,000,000 samples there could be no more than 3,000 outliers. 

“Six sigma” and outliers

The popular six sigma paradigm assumes that if one has a process with a 1.5 standard deviation shift and variation of 6 standard deviations, the number of defects will be 3.4 per million. Defects per million for 1 to 6 sigma are shown on the following table.

SIGMA (SL) NORMSDIST(SL) SL+1.5 1.5-SL Prob. Good Prob. Defect Defects per million
1 0.84134474 0.99379 0.691462 0.302328 0.697672 697672.1
2 0.977249938 0.999767 0.308538 0.69123 0.30877 308770.2
3 0.998650033 0.999997 0.066807 0.933189 0.066811 66810.6
4 0.999968314 1 0.00621 0.99379 0.00621 6209.7
5 0.999999713 1 0.000233 0.999767 0.000233 232.7
6 0.999999999 1 3.4E-06 0.999997 3.4E-06 3.4


These results assume a normal distribution. In a diagnostic assay, it would be difficult if not impossible to prove that all results are normally distributed. However, the corresponding entry in the bottom of the first table corresponds to a six sigma process of 3.4 defects.


Laboratories are not going to run 10,000 samples (nor should they) to prove that there are no outliers. Unfortunately, there are proposals to get laboratories to perform a limited type of GUM modeling which is totally inadequate and would prove nothing (3). Manufacturers could (and do) run large numbers of samples during assay development but don’t want to include estimation of outlier rates in their product labeling.

Thus, outliers remain an ignored topic and only surface when they cause problems. One possible remedy would be a uniform way for manufacturers to report outlier studies as part of their product labeling.


  1. Cole LA, Rinne KM, Shahabi S, and Omrani A. False-Positive hCG Assay Results Leading to Unnecessary Surgery and Chemotherapy and Needless Occurrences of Diabetes and Coma Clin Chem 1999;45:313 – 314
  2. National Committee for Clinical Laboratory Standards. Estimation of total analytical error for clinical laboratory methods; approved guideline. NCCLS document E21-A 2003 NCCLS Villanova, PA
  3. Krouwer JS Critique of the Guide to the Expression of Uncertainty in Measurement Method of Estimating and Reporting Uncertainty in Diagnostic Assays Clin Chem 2003;49:1818-1821.
  4. Hahn GJ and Meeker WQ. Statistical intervals. A guide for practitioners. Wiley: New York, 1991, p. 104