Calculating measurement uncertainty and GUM

October 16, 2017

A recent article (subscription required) suggests how to estimate measurement uncertainty for an assay to satisfy the requirements of ISO 15189.

As readers may know, I am neither a fan of ISO nor measurement uncertainty. The formal document, GUM – The Guide to the Expression of Uncertainty in Measurement will make most clinical chemists heads spin. Let’s review how to estimate uncertainty according to GUM.

  1. Identify each item in an assay that can cause uncertainty and estimate its imprecision. For example a probe picks up some patient sample. The amount of sample taken varies due to imprecision of the sampling mechanism.
  2. Any bias found must be eliminated. There is imprecision in the elimination of the bias. Hence bias has been transformed into imprecision.
  3. Combine all sources of imprecision into a BHE (big hairy equation – my term, not GUMs).
  4. The final estimate of uncertainty is governed by a coverage factor. Thus, an uncertainty interval for 99% is wider than one for 95%. Remember that an uncertainty interval for 100% is minus infinity to plus infinity.

The above Clin Chem Lab Med article calculates uncertainty by mathematically summing imprecision of controls and bias from external surveys. This is of course light years away from GUM. The fact that the authors call this measurement uncertainty could confuse some to think that this is the same as GUM.

Remember that in the authors’ approach, there are no patient samples. Thus, the opportunity for errors due to interferences has been eliminated. Moreover, patient samples can have errors that controls do not. Measurement uncertainty must include errors from the entire measurement process, not just the analytical error.

Perhaps the biggest problem is that a clinician may look at such an uncertainty interval as truth, when the likely true interval will be wider and sometimes much wider.

Advertisements

Comparison of company vs. standards organization specifications

April 11, 2017

For almost all of my career, I’ve been working to determine performance specifications for assays, including the protocol and data analysis methods to see if performance has been met. This work has been performed mainly for companies but occasionally also for standards groups. There are some big differences.

Within a company, the specifications are very important:

If the product is released too soon, before the required performance has been met, the product may be recalled, patients may suffer harm, and overall the company may suffer financially.

If the product is released too late, the company will definitely suffer financially as “time to market” has been shown in financial models to be a key success factor in achieving profit goals.

Company specifications are built around two main factors – what performance is competitive and how can the company be sure that no patients will be harmed. In my experience this has simply led to two goals – 95% of the differences between the company assay and reference should be within limits which guarantee a competitive assay and no differences should be large enough to cause patient harm (a clinical standard).

Standards groups seem to have a different outlook. Without being overly cynical, the standards adopted are often to guarantee that no company’s assay will fail the specification. Thus, 95% of differences between the assay and reference should be within these limits. There is almost never a mention about larger errors which may cause patient harm.

Thus, it is somewhat ironic that company specifications are usually more difficult to achieve then specifications published by the standards organizations.


EFLM – after three years it’s disappointing

February 15, 2017

dsc_0828edp

Thanks to Sten Westgard, whose website alerted me to an article about analytical performance specifications. Thanks also to Clin Chem Lab Med for making this article available without a subscription.

To recall, the EFLM task group was going to fill in details about performance specifications as initially described by the Milan conference held in 2014.

Basically, what this paper does is to assign analytes (not all analytes that can be measured but a subset) to one of three categories for how to arrive at analytical performance specifications: clinical outcomes, biological variation, or state of the art. Note that no specifications are provided – only which analytes are in which categories. Doesn’t seem like this should take three years.

And I don’t agree with this paper.

For one, talking about “analytical” performance specifications implies that user error or other mishaps that cause errors are not part of the deal. This is crazy because the preferred option is the effect of assay error on clinical outcomes. It makes no sense to exclude errors just because their source is not analytical.

I don’t agree with the second and third options ever playing a role (biological variation and state of the art). My reasoning follows:

If a clinician orders an assay, the test must have some use for the clinician to decide on treatment. If this is not the case, the only reason a clinician would order such an assay is that he has to make a boat payment and needs the funds.

So, for example say the clinician will provide treatment A (often no treatment) if the result falls within X1-X2. If the result is greater than X2, then the clinician will provide treatment B. Of course this is oversimplified since other factors are involved besides the assay result. But if the assay is 10 times X2 but truth is between X1 and X2, then the clinician will make the wrong treatment decision based on laboratory error. I submit this model applies to all assays and that if one assembles clinician opinion, one can construct error specifications (see last sentence at bottom).

Other comments:

In the event that outcome studies do not exist, authors encourage double-blind randomized controlled trials. Get real people – these studies would never be approved! (e.g., feeding clinicians the wrong answer to see what happens).

The authors also suggest simulation studies which I have previously commented that their premier simulation study which was cited was flawed (Boyd Bruns glucose meter simulations).

The Milan 2014 conference rejected the use of clinician opinion to establish performance specifications. I don’t see how clinical chemists and pathologists trump clinicians.


Revisiting Bland Altman plots and a paranoia

February 13, 2017

feeder6936edp

Over 10 years ago I submitted a paper critiquing Bland Altman plots. Since the original publication of Bland Altman plots was the most cited paper ever in The Lancet, I submitted my paper with some temerity.

Briefly, the issue is this. When one is comparing two methods, Bland Altman suggest plotting the difference (Y-X) vs. the average of the two methods (Y+X)/2. Bland Altman also stated in a later paper (1) that even if the X method is a reference method (they use the term gold standard) one should still plot the difference against the average and not doing so is misguided and will lead to correlations. They attempted to prove this with formulas.

Not being so great in math, but doubting their premise, I did some simulations. The results are shown in the table below. Basically, this says that when you have two field methods you should plot the difference vs. (Y+X)/2 as Bland Altman suggest. But when you have field and a reference method, you should plot the difference vs. X. The values in the table are the correlation coefficients for Y-X vs. (Y-X)/2 and Y-X vs. X (after repeated simulations where Y is always a field method and X is either a field method or a reference method).

 

Case X=X X=(X+Y)/2
X=Reference method ~0 ~0.1
X=Field method ~-0.12 ~0

 

The paranoia

I submitted my paper as a technical brief to Clin Chem and included my simulation program as an appendix. After being told to recast the paper as a Letter, it was rejected. I submitted it to another journal (I think it was Clin Chem Lab Med) and it was also rejected. I then submitted my letter to Statistics in Medicine (2) where it was accepted.

Now in the lab medicine field, I am known by the other statisticians, and sometimes have published papers not to their liking. Regarding Statistics in Medicine, I am an unknown and lab medicine is a small part of Statistics in Medicine. So maybe, my paper was judged solely on merit or maybe I’m just paranoid.

References

  1. Bland JM, Altman DG. (1995) Comparing methods of measurement – why plotting difference against standard method is misleading. Lancet, 346, 1085-1087.
  1. Krouwer JS Why Bland-Altman plots should use X, not (Y+X)/2 when X is a reference method. Statistics in Medicine, 2008;27:778-780.

Letter to be published

November 15, 2016

dsc_1420edp

Recently, I alerted readers to the fact that the updated FDA POCT glucose meter standard no longer specifies 100% of the results.

So I submitted a letter to the editor to the Journal of Diabetes Science and Technology.

This letter has been accepted – It seemed to take a long time for the editors to decide about my letter. I can think of several possible reasons:

  1. I was just impatient – the time to reach a decision was average
  2. The editors were exceptionally busy due to their annual conference which just took place.
  3. By waiting until the conference, the editors could ask the FDA if they wanted to respond to my letter.

I’m hoping that #3 is the reason so I can understand why the FDA changed things.


Glucose meter QC – too much?

November 13, 2016

dsc_1445edp

A colleague has been corresponding with me about glucose meter QC and recently sent me this paper. Basically, the authors are concerned about the high cost of QC for glucose meters and have data in their paper to show that their glucose meters are very reliable.

Actually, they state their meters are better than highly reliable because to quote from their paper: “no meter failure is detected by current QC testing procedure”. Well, not so fast. In some cases a repeated QC failure was corrected by using a new vial of strips. To me, this indicates a basic misunderstanding of glucose meters. One can think of the glucose meter testing process as having three components:

 

  1. The user – who must correctly use the meter
  2. The reagent strip – this is where the chemistry occurs
  3. The meter – hardware and software with the outcome being a glucose result

 

It seems as if these authors consider the meter as the item to be controlled. Yet it is highly unlikely that the meter could provide incorrect results – certainly no results if a meter’s hardware failed. But the reagent strip is where the action occurs and a bad strip could cause wrong answers, so in the authors’ study, QC did detect bad strips and presumably prevented wrong results.

I will comment at a later date about QC and user error.

What if the authors had shown no failures due to QC. Does that justify reducing QC to perhaps a monthly (as suggested by CMS) or less frequency? Cost is an important issue. But the purpose of QC is to detect errors. QC is not useless if no errors are found.

The purpose of QC is to detect errors to prevent an incorrect result to be reported. This is to prevent a clinician from making the wrong medical decision – based on test error, which causes harm to the patient. Hence, an assumption is that a correct result is needed to prevent patient harm. (If this is not the case, then one can argue that no QC is needed nor is the test needed in the first place).

But the frequency of QC actually detecting errors is not important as long as it can been shown that QC can detect errors. If the system is reliable, the error rate will be low.

The message is one would never remove safeguards just because of low error rates. For example, in hospitals and nuclear power plants, monitoring for radiation is a QC like practice and costs money. The frequency of problems is not relevant.


Comparison of Parkes glucose meter error grid with CLSI POCT 12-A3 Standard

November 8, 2016

dsc_1623edp

The graph below shows the Parkes error grid in blue. Each zone in the Parkes error grid shows increasing patient harm with the innermost zone A having no harm. The zones (unlabeled) start with A (innermost) and go to D or E.

The red lines are the POCT 12-A3 standard. The innermost line should contain 95% of the results. Since no more than 2% can be outside of the outermost red lines, these outermost red lines should contain 98% of the data.

comp

The red lines correspond roughly with the A zone of the Parkes error grid – the region of no patient harm.

Of course the problem is that in the CLSI guideline, 2% of the results are allowed to occur in the higher zones of the Parkes error grid corresponding to severe patent harm.