Ideally, the goal in evaluating an assay is to determine the population of differences between the candidate assay and truth for the analyte over the life of the candidate assay. This is not attainable directly because it would mean to assay each patient sample with a definitive reference method. So one takes a small sample (say 100 patient samples) to estimate these differences. And one usually uses a comparative assay rather than a definitive reference assay.
So far there is nothing wrong with the above, but here’s where things go bad. In many cases, people run the evaluation experiment far from the way that the assay will be run routinely. Note that this is always unavoidable to a certain extent. For example, the results of an evaluation experiment are not sent to clinicians, because the assay is not in use. However, one can easily not perform the evaluation in ways that could match routine use. For example, a glucose meter that is designed to have nurses perform a fingerstick, might instead be run with venous samples, perhaps because the fingerstick procedure would cause more error and one wishes to observe just the “analytical” properties of the assay. But the experiment no longer answers the question set forth in the goal. This is because a potential source of error has been removed from the evaluation.
Another problem is how the results will be handled. I have argued that the only meaningful analysis is an error grid analysis, yet other analyses persist such as estimating total error by adding 2 times imprecision to average bias, or calculating six sigma metrics.
However, there is even a bigger issue. Say one runs 100 patient samples and it is estimated that the candidate assay will be used for one million patient samples. This experiment samples 0.01% of the population. The issue is how to interpret the results of this 100 sample experiment. If the results are bad, then one should question the acceptability of the assay. However, if the results are good, one cannot say much. Again, the experiment should be done and it is nice to know the results are good, but more is needed.
To understand what else is needed, consider elements that have either definitely, or probably not been tested in the 100 sample experiment, using glucose as an example:
- Different interfering substances (some may have been present) including extremes of hematocrit
- Different lots of reagents, age of reagents, storage of reagents
- Different environmental conditions (temperature, humidity)
- Different operators with representative skill levels
- Evaluating the software
- Determining the percentage of times a result is failed to be provided
- And so on
There are two ways this information can be assessed. The first is by the manufacturer, by performing special studies such as factorial experiments, software evaluation, FMEA, FRACAS, and so on.
Since 85% of laboratory error is due to pre and post analytical error and not analytical error, one can’t underestimate the effect of laboratory procedures. The second way is for the clinical laboratory to perform their own FMEA and FRACAS to deal with conditions in their laboratory, since the manufacturer cannot anticipate all laboratory procedures.
- The 100 sample evaluation (often less samples) performed by the clinical laboratory is not much more than a cursory check to make sure nothing has gone wrong with the assay in the hands of the laboratory.
- The manufacturer performs most of the analytical validation of the assay and some (often simulated) user validation with the FDA evaluating the results.
- The laboratory performs FMEA and FRACAS in the context of their procedures.
CLSI documents that support this approach are EP27 (error grids) and EP18 (risk management).