The title of this post has been accepted for publication in the journal: Accreditation and Quality Assurance. The article describes common biases and how they might be avoided.
I had a chance to look at the revision of EP21 – the document about total error that I proposed and chaired. So after 12 years, here are the major changes.
In the original EP21, I realized that even if 95% of the results met goals, the remaining 5% might not, so there was a table which accounted for this. An acceptable assay had to have 100% of its results within goals. The revised EP21 – call it A2 – only talks about 95% of results (similar to the 2003 ISO glucose meter standard). There is no longer any mention of the remaining 5% – these remaining results are unspecified. This goes along with my thinking that manufacturers will refuse to talk about assay results that can cause severe injury or death. Thus, if 95% of the results just meet goals, a portion of the remaining 5% could cause severe injury or death and this portion even for a small percentage could be a big number (as one example, there are 8 billion glucose meter results each year in the US).
The mountain plot and all references to it are gone in A2. To recall, the mountain plot is ideal at visualizing outlier observations. In fact, there could be 10,000 observations but if there were 5 outliers, they would be clearly visible. In place of the mountain plot, there is a histogram with an example with normal looking results – the example that had outliers is gone. And the histogram has only 9 bins so if there were outliers, they would disappear. So again, this is a way to minimize talking about results which can cause major problems.
Somehow, sigma metrics have become part of A2. How this happens is a mystery. Perhaps someone can explain it to me, since whereas I understand the equation: Total error = |bias| + 2 x imprecision, the total error in EP21 is the difference between candidate and comparison assays and this difference can’t be separated into bias and imprecision.
And then there is the section on distinguishing between total error and total analytical error. This is part of the reason I was booted out of CLSI. A2 is constrained to include only analytical error.
Total error, including all sources of variation is the only thing that matters to clinicians. The total error experiment (e.g., EP21) will include errors from only those sources that are sampled. Practically speaking, the sources will be limited, even for analytical error. For example, even if more than one reagent is used, this is not the same as randomly sampling from the population of all reagents during the lifetime of the device – impossible since this involves future reagents that don’t yet exist. The same is true for pre- and post-analytical error but the point is one should not exclude pre- and post-analytical error sources from the experiment.
There is a section on various ways to establish goals. Examples shown are the ISO, CLSI, and NACB glucose meter standards, which have performance goals for glucose meters. A2 talks about the strengths and weaknesses of using expert bodies to create these standards. Now A2 has a reference from May of 2015, but somehow they missed the FDA draft guidance on glucose meters (January 2014) which unlike the examples cited in A2 wants evaluators to account for 100% of the data. And, FDA’s opinion about the ISO glucose meter standard is pretty clear:
Although many manufacturers design their BGMS validation studies based on the International Standards Organizations document 15197, FDA believes that the criteria set forth in the ISO 15197 standard do not adequately protect patients using BGMS devices in professional settings, and does not recommend using these criteria for BGMS devices.
I have published a critique of the CLSI glucose meter standard, which is available here.
When I was chair holder of the Evaluations Protocol Committee, there were battles between regulatory affairs people, who populated the manufacturing contingent and the rest of the committee. For example, I remember one such battle over EP6, the linearity document. The proposed new version finally had a sensible statistical method to evaluate nonlinearity but one regulatory affairs member insisted on having an optional procedure where one could just graph the data and look at it to declare whether it was linear. After many delays, this optional procedure was rejected.
By looking at the new version of EP21, my sense is that the regulatory affairs view now dominates these committees.
Having mentioned in my first blog entry “Total Error and Milan”, the fact that clinician surveys were dropped as a means of constructing performance specifications, I looked at the published paper on this topic. Many of the citations are from the 80s – there’s nothing wrong with that but I was surprised to see that a recent paper on glucose meter performance specifications, which is here and available before the Milan conference was not cited. In this glucose paper, 206 clinicians were surveyed using 4 scenarios and the range of glucose levels that would correspond to one of 5 types of actions: (A) emergency treatment for low BG; (B) take oral glucose; (C) no action needed; (D) take insulin; and (E) emergency treatment for high BG.
Maybe if the Milan conference were aware of this work, they would have added clinician surveys as a primary means to establish performance specifications.
So one of the articles of interest to me, was the one that describes using simulation to set performance goals. It is here.
And sure enough, this article refers to the glucose meter simulations originally published by Boyd and Bruns and continued by them and others which I have critiqued over the years.
An article that I wrote which shows why such a model can be misleading is now available without subscription and is here.
And another letter by me – published after the Milan conference – is here (subscription required). This makes three articles I published showing that the Boyd Bruns model is incomplete and misleading.
Recently, I talked to someone who attended a conference on total error in Milan. Had I known about the conference or been invited, I would have attended. Searching the web, the Westgard web site has summaries and links about this conference. So here are my comments:
- The use of allowable performance specifications implies a set of limits that demarcate no harm from harm. This further implies that for many analytes, results that just exceed the limit will cause minor harm. But for many analytes, harm increases as the error increases (such as for glucose meter errors). Thus, small errors may result in minor harm and large errors can be fatal. This can be accounted for by using an error grid (such as a glucose meter error grid) which has separate zones for increasing error and harm.
- The allowable performance specifications are for analytical performance. Although pre- and post-analytical errors are mentioned, there is no attempt to present allowable performance specifications that include all sources of error. Thus, in the consensus statement, “The SPC encourages users to expand those specifications [referring to analytical performance specifications] to the total examination process.” This is not something that should be a user exercise.
- The primary method for establishing allowable analytical performance specifications is: “Based on the effect of analytical performance on clinical outcomes.” It is interesting to compare for this item, the unofficial summary from the Westgard site, to the official summary. Note that IMHO, the most important method, a clinician survey, has been dropped in the official version.
- Also problematic is the suggestion of #2 below of using simulations. In glucose meter modeling, I have published on how misleading these simulations have been.
In order to develop quality specifications using outcomes, you must complete one of the following:
- an Outcome study investigating the impact of analytical performance on clinical outcomes
- a Simulation study investigating the impact of analytical performance on the probability of clinical outcomes
- a Survey of clinicians’ and/or experts’ opinion investigating the impact of analytical performance on medical decisions
This can, in principle, be done using different types of studies:
- Direct outcome studies – investigating the impact of analytical performance of the test on clinical outcomes;
- Indirect outcome studies – investigating the impact of analytical performance of the test on clinical classifications or decisions and thereby on the probability of patient outcomes, e.g., by simulation or decision analysis.