Review of total error

February 6, 2019

History – Total error has probably been around for a long time but the first mention that I found is from Mandel (1). In talking about a measurement error, he wrote:

error = x – R = (x – mu) + (mu – R) where x=a measurement and R=reference

The term (x – mu) is the imprecision and (mu – R) is the inaccuracy. An implied assumption is that the errors are IIDN = independently and identically distributed in a normal distribution with mean zero and variance sigma squared. With laboratory assays of blood, this is almost never true.

Westgard model – The Westgard model of total error (2) is the same as Mandel; namely that

Total error TE = bias + 2 times imprecision.

The problem with this model is that it neglects other errors, with interfering substances affecting individual samples as perhaps the most important. Note that it is not just rare, large interferences that are missed in this model. I described a case where small interferences inflate the total error (3).

Lawton model – The Lawton model (4) adds interfering substances affecting individual samples.

Other factors – I added (5) to the Lawton model by including other factors such as drift, sample carryover, reagent carryover.

Here’s an example of a problem with the Westgard model. This model suggests that average bias accounts for systematic error and imprecision accounts for random error. Say you have an assay with linear drift between a 30 minute calibration cycle. The assay starts out with a negative bias, has 0 bias at 15 minutes, and ends with a positive bias. The Westgard model would estimate zero bias for the systematic error and assign imprecision for the random error. But this is not right. There is clearly systematic bias (as a function of time) and the calculated imprecision (the SD of the observations) is not equal to random error.

The problem with Bland Altman Limits of Agreement – In this method, one multiplies (usually x2) the SD of differences of the candidate method from reference. This is an improvement since interferences or other error sources are included in the SD of differences. But the differences must be normally distributed and outliers are allowed to be discarded. By discarding outliers, one can not claim total error.

The problem with measurement uncertainty – The GUM method (Guide to the Expression of Uncertainty in Measurement) is a bottoms up approach which adds all errors as sources of imprecision. I have critiqued this method (6) as bias is not allowed in the method, which does not seem to match what happens in the real world, and errors that cannot be modeled will not be captured.

The problem with probability models – Any one of the above models paradoxically cannot account for 100% of the results which makes the term “total” in total error meaningless. The above probability models will never account for 100% of the results as the 100% probability error limits stretch from minus infinity to plus infinity (7).

Errors that cannot be modeled – An additional problem is that there are errors that can occur but really can’t be modeled, such as user errors, software errors, manufacturing mistakes, and so on (7). The Bland Altman method does not suffer from this problem while all of the above other methods do.

A method to account for all results – The mountain plot (8) is simply a plot (or table) of differences of the candidate method from reference. No data are discarded. This is a nonparametric estimate of total error. A limitation is that error sources that are not part of the experiment may lead to an underestimate of total error.

Error Grid Analysis – One overlays a scatterplot from a method comparison on an error grid. The analysis is simply to tally the proportions of observations in each error grid zone. This analysis also accounts for all results.

The CLSI EP21 story – The original CLSI total error standard used the Westgard model but had a requirement that outliers could not be discarded and thus if outliers were present that exceeded limits, the assay would fail the total error requirement – 100% of the results had to meet goals. In the revision of EP21, the statements about outliers were dropped and this simply became the Westgard model. The mountain plot, which was an alternative method in EP21 was dropped in the revision.

Moreover, I argued that user error had to be included in the experimental setup. This too was rejected and the proposed title change from total analytical error to total error was rejected.


  1. Mandel J. The statistical analysis of experimental data Dover, New York 1964 p 105.
  2. Westgard, JO, Carey, RN, Wold, S. Criteria for judging precision and accuracy in method development and evaluation. Clin Chem. 1974;20:825-833
  3. Lawton, WH, Sylvester, EA, Young-Ferraro, BJ. Statistical comparison of multiple analytic procedures: application to clinical chemistry. Technometrics. 1979;21:397-409.
  4. Krouwer JS The danger of using total error models to compare glucose meter performance. Journal of Diabetes Science and Technology, 2014;8:419-421
  5. Krouwer JS Setting Performance Goals and Evaluating Total Analytical Error for Diagnostic Assays. Clin. Chem., 48: 919-927 (2002).
  6. Krouwer JS A Critique of the GUM Method of Estimating and Reporting Uncertainty in Diagnostic Assays Clin. Chem., 49:1818-1821 (2003)
  7. Krouwer JS The problem with total error models in establishing performance specifications and a simple remedy. Clinical Chemistry and Laboratory Medicine, 2016;54:1299-1301.
  8. Krouwer JS and Monti KL A Simple Graphical Method to Evaluate Laboratory Assays, Eur. J. Clin. Chem. and Clin. Biochem., 33, 525-527 (1995)

Calculating measurement uncertainty and GUM

October 16, 2017

A recent article (subscription required) suggests how to estimate measurement uncertainty for an assay to satisfy the requirements of ISO 15189.

As readers may know, I am neither a fan of ISO nor measurement uncertainty. The formal document, GUM – The Guide to the Expression of Uncertainty in Measurement will make most clinical chemists heads spin. Let’s review how to estimate uncertainty according to GUM.

  1. Identify each item in an assay that can cause uncertainty and estimate its imprecision. For example a probe picks up some patient sample. The amount of sample taken varies due to imprecision of the sampling mechanism.
  2. Any bias found must be eliminated. There is imprecision in the elimination of the bias. Hence bias has been transformed into imprecision.
  3. Combine all sources of imprecision into a BHE (big hairy equation – my term, not GUMs).
  4. The final estimate of uncertainty is governed by a coverage factor. Thus, an uncertainty interval for 99% is wider than one for 95%. Remember that an uncertainty interval for 100% is minus infinity to plus infinity.

The above Clin Chem Lab Med article calculates uncertainty by mathematically summing imprecision of controls and bias from external surveys. This is of course light years away from GUM. The fact that the authors call this measurement uncertainty could confuse some to think that this is the same as GUM.

Remember that in the authors’ approach, there are no patient samples. Thus, the opportunity for errors due to interferences has been eliminated. Moreover, patient samples can have errors that controls do not. Measurement uncertainty must include errors from the entire measurement process, not just the analytical error.

Perhaps the biggest problem is that a clinician may look at such an uncertainty interval as truth, when the likely true interval will be wider and sometimes much wider.

Antwerp talk about total error

March 12, 2017

Looking at my blog stats, I see that a lot of people are reading the total analytical error vs. total error post. So, below are the slides from a talk that I gave at a conference in Antwerp in 2016 called The “total” in total error. The slides have been updated. Because it is a talk, the slides are not as effective as the talk.




Published – my one man Milan Conference

March 23, 2016


Having read the consensus statement and all the papers from the Milan conference (available without subscription), I prepared my version of this for the Antwerp conference. This talk contained the following:

  • A description of why the Westgard model for total error is incomplete (with of course Jim Westgard sitting in the audience)
  • A description of why expanded total error models are nevertheless also incomplete
  • A critique of Boyd and Bruns’ glucose meter performance simulations using the Westgard model
  • A critique of the ISO and CLSI glucose meter specifications, both based on total error
  • A description of what the companies with most of the market share in glucose meters did, when they started to lose market share
  • How Ciba Corning specified and evaluated performance
  • What I currently recommend

I submitted a written version of this talk to Clin Chem and Lab Medicine, with recommended reviewers being Milan authors with whom I disagreed. (The journal asks authors to recommend reviewers). Now I don’t know who the reviewers were, but suffice it to say that they didn’t like my paper at all. So after several revisions, I scaled back my paper to its current version, which is here (subscription required).

Whining rewarded?

August 29, 2014


Looking at the table of contents of Clinical Chemistry for September, there is a list of the most downloaded point / counterpoint articles and I am number one on this list for my discussion of GUM (The guide to the expression of uncertainty in measurement):

Why GUM will never be enough

April 7, 2014


I occasionally come across articles that describe a method evaluation using GUM (Guide to the expression of Uncertainty in Measurement). These papers can be quite impressive with respect to the modeling that occurs. However, there is often a statement that relates the results to clinical acceptability. Here’s why there is a problem.

Clinically acceptability is usually not defined but often implied to be a method’s performance that will not cause patient harm due to assay error.

A GUM analysis usually specifies the location for 95% of the results. But if the analysis shows that the assay just meets limits, then 5% of the results will cause patient harm. Now according to GUM models, the 5% will be close to limits because the data are assumed to be Gaussian so this is a minor problem.

A bigger problem is that GUM analysis often ignores rare but large errors such as a rare interference or something more insidious such a user error that results in a large assay error. (Often GUM analyses don’t assess user error at all). These large errors, while rare, are associated with major harm or death.

The remedy is to conduct a FMEA or fault tree in addition to GUM to try to brainstorm how large errors could occur and whether mitigations are in place to reduce their likelihood. Unless risk analysis is added to GUM, talking about clinical acceptability is misleading.

AACC 2012

July 19, 2012

I went early to AACC 2012 on a consulting assignment which ended two days before AACC. So the highlight of my trip was a helicopter tour of LA in an R22, a tour of Warner Bros. studios, and a walk on the Santa Monica pier.

At the meeting, I got to hear two plenary lectures, both on genomics. They were very interesting talks (Eric Green and Robert Roberts) and it was humbling to realize how little I know about genomics.

I also attended the Evaluations Protocol Area Committee, although I think it is now called a consensus committee. There were two projects that I had started EP27 (error grids) and the revision to EP21A2 (total error) – I proposed and completed EP21A. As of January, I had been unexpectedly and rather unceremoniously kicked off both projects. There was little discussion about EP27 – it should be available around September 2012 and little changed. It will be interesting to see who the authors will be.

There was more discussion about EP21, including a proposal to drop it completely in favor of a GUM uncertainty analysis, which is a CLSI document (C51). This is about when I had enough and bolted from the meeting. Yet, I did have the comfort in knowing that the financial way projects are valued is something I put in place a while back and probably unknown to this current group. 

I went to a talk about medical error. One thing that is always missing from these talks is a measure of overall patient harm – maybe the subject of a future blog entry.

I also talked with a very nice Spanish lady about her poster. She assayed a sample on each of two analyzers of the same type and compared the results to various total error goals including biological variation, CLIA, and others. The results were often outside of goals, especially biological variation goals which makes one wonder if such goals are meaningful.