Minimum system accuracy performance criteria – part 2

February 13, 2019

I had occasion to read the ISO 15197:2013 standard about blood glucose meters Section 6.3.3 “minimum system accuracy performance criteria.

Note that this accuracy requirement is what is typically cited as the accuracy requirement for glucose meters.

But the two Notes in this section say that testing meters with actual users is tested elsewhere in the document (section 8). Thus, because of the protocol used, the system accuracy estimate does not account for all errors since user errors are excluded. Hence, the system accuracy requirement is not the total error of the meter but rather a subset of total error.

Moreover, in the user test section, the acceptance goals are different from the system accuracy section!

Ok, I get it. The authors of the standard want to separate two major error sources: error from the instrument and reagents (the system error) and errors caused by users.

But there is no attempt to reconcile the two estimates. And if one considers the user test as a total error test, which is reasonable (e.g., it includes system accuracy and user error), then the percentage of results that must meet goals is 95%. The 99% requirement went poof.



Minimum system accuracy performance criteria

February 13, 2019

I had occasion to read the ISO 15197:2013 standard about blood glucose meters and was struck by the words “minimum system accuracy performance criteria” (6.3.3).

This reminds me of the movie “Office Space”, where Jennifer Anniston, who plays a waitress, is being chastised for wearing just the minimum number of pieces of flair (buttons on her uniform). Sorry if you haven’t seen the movie.

Or when I participated in an earlier version of the CLSI method comparison standard EP9. The discussion at the time was to arrive at a minimum sample size. The A3 version says at least 40 samples should be run. I pointed out that 40 would become the default sample size.

Back to glucose meters. No one will report that they have met the minimum accuracy requirements. They will always report they have exceeded the accuracy requirements.


Review of setting goals (to determine if the estimated total error is acceptable)

February 7, 2019

The last post described ways to estimate total error. But the reason total error is estimated is to determine if it meets goals. This post describes how to set goals.

Consider the following scenario. A clinician is deciding on a treatment for a patient. Among the criteria used to make that decision are the patient’s history, the physical exam, and one or more blood tests or images. Given the other criteria and a specific blood test with value A, the clinician will decide on a treatment (which may include no treatment). Now assume the blood test’s value keeps diverging from value A. At some point, call it value B, the clinician will make a different treatment decision. If the value B is an error, then it is reasonable to assume that the magnitude of error (B-A) is enough to cause the wrong medical decision by the clinician based on test error. Thus, just under the magnitude B-A is a reasonable error limit. There are a bunch of other assumptions…

  1. The clinician’s decision conforms to acceptable medical practice.
  2. A wrong decision usually causes harm to the patient.
  3. Larger errors may cause different decisions leading to greater harm to the patient.
  4. Although all patients are unique, one can describe a “typical” patient for a disease.
  5. Although all clinicians are unique, most clinicians will make the same decision within a narrow enough distribution of errors so that one can use the average error as the limit.
  6. Given the X-Y space for the range of the test, where X=truth and Y=the candidate medical test, the entire space can be designated with error limits.
  7. It is common (given #6) that there will be multiple limits with different levels of patient harm throughout the range of the medical test.

All of the above can be satisfied by an error grid such as the glucose meter error grid. The error grid should work for any assay.

Note that many conventional error limits are not as comprehensive because …

  1. They use one limit for the entire range of the assay
  2. They do not take into account greater harm for larger errors.
  3. They are not always based on patient results but on controls (e.g., CLIA limits).

Given the above discussion, setting limits using biological variability or state of the art is not relevant to answering the question of what magnitude of error will cause a clinician to make an incorrect medical decision. The only reasonable way to answer the question is to ask clinicians. An example of this was done for glucose meters (1).

A total error specification could easily be improved by adding to it:

  1. A limit for the average bias (2)
  2. A limit (greater than the total error limit) where there should be no observations, making the total error specification similar to an error grid.

Adding a limit for the average bias would also improve an error grid (3).


  1. Klonoff DC, Lias C, Vigersky R, et al The surveillance error grid. J Diabetes Sci Technol. 2014;8:658-672.
  2. Klee GG, Schryver PG, Kisbeth RM. Analytic bias specifications based on the analysis of effects on performance of medical guidelines. Scand J Clin Lab Invest. 1999;59:509-512.
  3. Jan S Krouwer and George S. Cembrowski: The chronic injury glucose error grid. A tool to reduce diabetes complications. Journal of Diabetes Science and Technology, 2015;9:149-152.

Review of total error

February 6, 2019

History – Total error has probably been around for a long time but the first mention that I found is from Mandel (1). In talking about a measurement error, he wrote:

error = x – R = (x – mu) + (mu – R) where x=a measurement and R=reference

The term (x – mu) is the imprecision and (mu – R) is the inaccuracy. An implied assumption is that the errors are IIDN = independently and identically distributed in a normal distribution with mean zero and variance sigma squared. With laboratory assays of blood, this is almost never true.

Westgard model – The Westgard model of total error (2) is the same as Mandel; namely that

Total error TE = bias + 2 times imprecision.

The problem with this model is that it neglects other errors, with interfering substances affecting individual samples as perhaps the most important. Note that it is not just rare, large interferences that are missed in this model. I described a case where small interferences inflate the total error (3).

Lawton model – The Lawton model (4) adds interfering substances affecting individual samples.

Other factors – I added (5) to the Lawton model by including other factors such as drift, sample carryover, reagent carryover.

Here’s an example of a problem with the Westgard model. This model suggests that average bias accounts for systematic error and imprecision accounts for random error. Say you have an assay with linear drift between a 30 minute calibration cycle. The assay starts out with a negative bias, has 0 bias at 15 minutes, and ends with a positive bias. The Westgard model would estimate zero bias for the systematic error and assign imprecision for the random error. But this is not right. There is clearly systematic bias (as a function of time) and the calculated imprecision (the SD of the observations) is not equal to random error.

The problem with Bland Altman Limits of Agreement – In this method, one multiplies (usually x2) the SD of differences of the candidate method from reference. This is an improvement since interferences or other error sources are included in the SD of differences. But the differences must be normally distributed and outliers are allowed to be discarded. By discarding outliers, one can not claim total error.

The problem with measurement uncertainty – The GUM method (Guide to the Expression of Uncertainty in Measurement) is a bottoms up approach which adds all errors as sources of imprecision. I have critiqued this method (6) as bias is not allowed in the method, which does not seem to match what happens in the real world, and errors that cannot be modeled will not be captured.

The problem with probability models – Any one of the above models paradoxically cannot account for 100% of the results which makes the term “total” in total error meaningless. The above probability models will never account for 100% of the results as the 100% probability error limits stretch from minus infinity to plus infinity (7).

Errors that cannot be modeled – An additional problem is that there are errors that can occur but really can’t be modeled, such as user errors, software errors, manufacturing mistakes, and so on (7). The Bland Altman method does not suffer from this problem while all of the above other methods do.

A method to account for all results – The mountain plot (8) is simply a plot (or table) of differences of the candidate method from reference. No data are discarded. This is a nonparametric estimate of total error. A limitation is that error sources that are not part of the experiment may lead to an underestimate of total error.

Error Grid Analysis – One overlays a scatterplot from a method comparison on an error grid. The analysis is simply to tally the proportions of observations in each error grid zone. This analysis also accounts for all results.

The CLSI EP21 story – The original CLSI total error standard used the Westgard model but had a requirement that outliers could not be discarded and thus if outliers were present that exceeded limits, the assay would fail the total error requirement – 100% of the results had to meet goals. In the revision of EP21, the statements about outliers were dropped and this simply became the Westgard model. The mountain plot, which was an alternative method in EP21 was dropped in the revision.

Moreover, I argued that user error had to be included in the experimental setup. This too was rejected and the proposed title change from total analytical error to total error was rejected.


  1. Mandel J. The statistical analysis of experimental data Dover, New York 1964 p 105.
  2. Westgard, JO, Carey, RN, Wold, S. Criteria for judging precision and accuracy in method development and evaluation. Clin Chem. 1974;20:825-833
  3. Lawton, WH, Sylvester, EA, Young-Ferraro, BJ. Statistical comparison of multiple analytic procedures: application to clinical chemistry. Technometrics. 1979;21:397-409.
  4. Krouwer JS The danger of using total error models to compare glucose meter performance. Journal of Diabetes Science and Technology, 2014;8:419-421
  5. Krouwer JS Setting Performance Goals and Evaluating Total Analytical Error for Diagnostic Assays. Clin. Chem., 48: 919-927 (2002).
  6. Krouwer JS A Critique of the GUM Method of Estimating and Reporting Uncertainty in Diagnostic Assays Clin. Chem., 49:1818-1821 (2003)
  7. Krouwer JS The problem with total error models in establishing performance specifications and a simple remedy. Clinical Chemistry and Laboratory Medicine, 2016;54:1299-1301.
  8. Krouwer JS and Monti KL A Simple Graphical Method to Evaluate Laboratory Assays, Eur. J. Clin. Chem. and Clin. Biochem., 33, 525-527 (1995)

New FDA Glucose meter draft guidelines (November 2018)

January 31, 2019

The FDA continues to dis the ISO 15197 standard in both their POC and lay user (over the counter) proposed guidelines:

POC“Although many manufacturers design their BGMS validation studies based on the International Standards Organizations document 15197: In vitro diagnostic test systems—Requirements for blood glucose monitoring systems for self-testing in managing diabetes mellitus, FDA believes that the criteria set forth in the ISO 15197 standard do not adequately protect patients using BGMSs in professional settings, and does not recommend using the criteria in ISO 15197 for BGMSs.”

The POC accuracy criteria are:

95% within +/- 12 <75 mg/dL and within +/- 12% >75 mg/dL
98% within +/- 15 <75 mg/dL and within +/- 15% >75 mg/dL

Over the counter“FDA believes that the criteria set forth in the ISO 15197 standard are not sufficient to adequately protect lay-users using SMBGs; therefore, FDA recommends performing studies to support 510(k) clearance of a SMBG according to the recommendations below.”

The over the counter accuracy criteria are:

95% within +/- 15% over the entire claimed range
99% within +/- 20% over the entire claimed range

To recall, ISO 15197 2013 accuracy criteria are:

95% within ± 15 mg/dl <100 mg/dL

95% within ± 15% >100 mg/dL
99% within A and B zones of a glucose meter error grid

Establishing QC mean on multiple instruments

January 24, 2019

I sent my first post to the AACC artery on the title’s topic. My post concerns some of the comments I saw, such as: “QC limits were established based on total allowable error for each analyte.”

Two important questions for a lab are:

  1. Is the process in control?
  2. Are the patient results medically acceptable?

QC can answer the first question, it cannot answer the second question.

Results can be viewed in a 2×2 table

  Patient results are
medically acceptable
Patient results are not
medically acceptable
The process is in control



The process is not in control



An example of case 2 is the first generation of troponin assays in the 90s. The American and European cardiology societies specified performance for troponin assays and no one met the performance. Hence these were assays in control but failed to provide medically acceptable results.

Changing QC limits does nothing.

Review of why FMEA is so difficult

January 20, 2019

FMEA stands for Failure Mode Effects Analysis
FRACAS stands for Failure Reporting and Corrective Action System

Definitions – FMEA is a process to reduce the risk of undesirable events. FRACAS is a process to reduce the rate of undesirable events. Hence FMEA failures haven’t happened while FRACAS failures have. FMEA failures are often much more severe than FRACAS failures.

Not a FMEA – When someone says they have reduced the rate of failures for a process, they have performed FRACAS not FMEA.

Lack of management support – Companies make money by selling products and services and also by reducing costs. FMEA does neither. It consumes resources while reducing the risk of costs (caused when failures occur). A nonprofit has to worry about the same issues. If they lose money, they will have to reduce services.

The following is not a compelling appeal to management for resources – “There are some catastrophic failures that have never happened. We can spend some money to make these events even less likely”.

I participated in a company FMEA that was always held during lunch. The reason given by management was that they have more important tasks to do during regular working hours.

Lack of a facilitator – The purpose of a FMEA is to question the design of a product or process. Often, the designer is present. A facilitator can prevent an adversarial confrontation.

Lack of interest – People who design medical instruments like to design. It’s a challenge to motivate them to perform tasks other than design.

Insufficient detail – FMEA is a bottoms up approach and requires listing all process steps. (Fault trees are a top down approach). Mapping out a process requires inclusion of all relevant events. Providing insufficient detail is a problem. Example, a technician examines a clinical sample before it is analyzed. An additional branch could be how is this person hired, trained, etc.

Stopping at the status quo – As one conducts a FMEA, one lists the possible failure modes for each step, the effect of each failure, and the mitigation for each failure. Example: a process step is to test whether a sample is hemolyzed, the effect (of a failure) is potassium is artificially elevated, and the existing mitigation is to rely on the instrument’s automatic flagging of hemolyzed samples. One might conclude that’s it – no need to do anything further. But all that’s been done is to describe the existing process. This is not FMEA. One must ask questions such as how can the instrument’s flagging system fail.

Acceptable risk is hard to quantify – One can never have a zero failure risk. For example, for blood gas, which is an emergent assay, one remedy to mitigate against an instrument failure is to use two instruments. One can estimate the probability of both instruments failing at the same time. One can add a third machine and so on, always lowering the risk, but it is never zero. Mitigations cost money so one must make a tradeoff between cost and risk.