FDA has issued a guidance document on waiver applications. This is still a proposed guideline – however, it is in many ways a radical departure from previous regulations. Overall it is a tremendous guideline. One simple comment is that it could be envisioned to apply to all diagnostic assays, not just waiver assays.
Note: The released version of the guidance is here.
Regression analysis is basically relegated to a section that comes right after descriptive statistics. So rather than regression being the main analysis method, it is simply part of the background.
Total error and outlier analysis
This is the main addition between this and previous guidance and follows my recommendations (1) and well as relying on a CLSI (formerly NCCLS) standard EP21A (2).
CLSI does not like authors to be listed with their standards because the standards are consensus based. I proposed the EP21A standard and was the primary author. The EP21 standard was improved by the consensus process over something that I could have written by myself.
Basically, I have argued that asking for estimates of average bias (from regression) and imprecision are not sufficient to evaluate assay performance since there can be large positive and negative differences from reference which will average to little bias and hence these errors will not be detected. Since these erroneous results may replicate well, imprecision also does not detect these large errors. The use of total analytical error and outlier analysis does capture these problems.
FDA calls allowable total error limits ATE and the limits for outliers LER (limits for erroneous results).
FDA go one step further than my recommendations; namely, to request a Parks (or Clarke) type glucose error grid, for the assay that is being evaluated. This makes the LER zones specific to those cases that are likely to cause patient harm.
The guidance could be made a little more clear by noting that besides the ATE and LER zones, other zones are implied; namely, the zones that are neither ATE nor LER zones. In the glucose example in the guidance document, these are zones B and C. Thus, one wants a high proportion of results (often 95%) in the ATE zone and a low or zero proportion of results in the LER zone. This means that about 5% of results could fall in this unnamed zone (B and C in the glucose figure). Whereas these are errors that fall outside of the medically acceptable limits of the ATE zone, the errors are close enough to the ATE zone so harm to the patient should be limited.
Some problems with the guidance document
Specimen Collection and Sample Preparation p20
Although it is desirable and logical to use an intended population of patients, this can also cause a poor distribution of results. The guidance says (paraphrased but almost verbatim):
Get 360 samples from consecutive patients with the patient samples spanning the measuring range of the device such that when samples are divided into low, medium and high medically relevant intervals (according to the CM [reference] values) there are approximately the same number of patients in each range. Actual patient specimens provide the best assessment. However, in some cases one may substitute, or supplement, up to 60 actual patient specimens with spiked or otherwise contrived matrix-specific specimens.
This makes little sense as “unspiked samples” and “spanning the range” are often mutually exclusive. I think it could be argued that few assays provide the FDA desired distribution of concentrations with unaltered patient specimens. It is much more common that most samples will be below the medical decision point accompanied by a few high samples. Spiking is only allowed for 60/360 = 16.7% of the samples and in many cases, this won’t be enough to get the desired distribution.
Spiking is either ok or not, meaning that spiking either does or does not cause bias in the study. (If spiking causes bias no spiking should be allowed). If spiking is ok, FDA should allow enough spiking to produce the desired distribution, for example the allowed level of spiking should be at least 60% or 216 samples out of 360 total.
The guidance talks about “contrived matrix-specific specimens.” I don’t know what this means. To me, spiking means taking individual patient samples and adding – in a suitable manner – concentration of analyte. In this way, any interferences that are in individual patient samples will still have the same chance to influence results as if there were no spiking.
This section ends with – if more than 60 samples need to be spiked, contact the FDA – which seems burdensome.
(d) For each interval (low, medium, and high), calculate total the mean of these differences. p 22
Whereas total error (ATE) and outlier percentages (LER) are estimated, this is one missing element; namely, that even if an assay passes total error (ATE) and outlier analysis (LER), as Klee has shown (3), small shifts in average bias can cause diagnostic misclassifications. Therefore, a more complete guidance would ask for average difference goals and use the mean difference calculated here to determine if those goals are met.
Establishing the ATE zone p24
The guidance document cites CLIA proficiency survey limits as a means for establishing the ATE limits. These limits may or may not be applicable. For those assays that have CLIA limits stated as target value ± 3 SD (such as TSH), using the CLIA limits makes no sense. This is because the limits depend on the precision of the assay, not some medical criteria. From a proficiency survey standpoint, this is not a problem since the goal of a proficiency survey is to identify outlier labs from peers. Putting things another way, the ATE limits are not statistical limits, they are medically acceptable limits and could be based on many criteria and actual assay data might not belong to those criteria. This has been described in process capability terms (4), whereby for some assays, regardless of the distribution of assay results, there can be a significant percentage of assay results that are medically unacceptable. Note that troponin assays, for example, fail ESC / ACC guidelines for imprecision.
ATE reports p24
It is requested to report the percentage of observations for the WM (waiver method)) that fall within the ATE zone for all data with an exact 95% lower confidence bound.
The FDA suggests that the 95% lower confidence bound should exceed 93%. Actually, using exact methods (5), if the estimated proportion is 0.95, and there are 360 observations, I get:
342/360 = 0.95000%, lower 95% confidence bound = 92.7% 343/360 = 0.95278% lower 95% confidence bound = 93.0%
So 92.7% rounds to 93%, but why is this calculation needed – the FDA has already done this and presumably this is part of the reason why 360 samples are requested.
LER reports p25
The guidance document is at best confusing here. Unlike the ATE zone, where one wants a high percentage of results to fall, in the LER zone one want zero or as few as possible results to fall. Therefore the estimated percentage of results within the LER zone will often be zero. For a confidence bound, one wants an upper not a lower bound!
Alternatively, based on what the guidance says, the document should ask for the percentage of results that fall into all zones except the LER zone. Then one would want a lower bound as one would often have 100%.
A final comment about LER results
The guidance document really does not protect against serious harm to patients from incorrect assay results. This is not a flaw in the document but rather a consequence of the fact that it is extremely difficult to prove that rare events don’t happen. Thus, for 360 samples, one may have shown (estimated) that the rate of results that fall within the LER zone is zero. However, the upper 95% confidence bound for this result is 1%. This means that for each million results reported, it is possible that up to 10,000 results will be in the region that is likely to cause patient harm (1). Some samples sizes needed to reduce this number of 10,000 are in reference 6.
Total error and GUM
There is much talk about GUM – the guide to the expression of uncertainty in measurement – including using it for commercial diagnostic assays to which I have objected (7). Estimating uncertainty intervals using GUM is extremely difficult yet what the FDA is asking for; namely, the proportion of results that fall within the ATE zone is the same as a GUM uncertainty interval although the calculation methods of arriving at that proportion are different. Not only is the method of calculating the proportion of results that fall within the ATE zone by the FDA method easy, it requires no modeling and assumptions that are used in the GUM method and may be wrong.
- Krouwer JS. Setting Performance Goals and Evaluating Total Analytical Error for Diagnostic Assays. Clin Chem 2002;48:919-927.
- Estimation of Total Analytical Error for Clinical Laboratory Methods; Approved Guideline NCCLS EP21A, NCCLS, 940 West Valley Road Suite 1400 Wayne, PA., 2003.
- Klee GG, Schryver PG, Kisbeth RM. Analytic bias specifications based on the analysis of effects on performance of medical guidelines. Scand J Clin Lab Invest 1999;59:509-512.
- Assay Development and Evaluation: A Manufacturer’s Perspective. Jan S. Krouwer, AACC Press, Washington DC, 2002 pp 96-100.
- Hahn GJ, Meeker WQ. Statistical intervals, a guide for practitioners 1991: Wiley New York. pp104-105.
- How to specify and estimate outlier rates. Web essay at http://krouwerconsulting.com/Essays/Outliers.htm
- Krouwer JS A Critique of the GUM Method of Estimating and Reporting Uncertainty in Diagnostic Assays Clin Chem 2003;49:1218-1221.