New book update

April 24, 2024

Revision 12 of my book https://www.amazon.com/world-laboratory-testing-statistical-consultant/dp/B09S86RJMF is now available. One of the main changes is the detailed explanation of Bland-Altman which was in my last blog entry is now in the book.


Bland Altman plots (from a recent revision to my book – #11*)

March 27, 2024

In 2008, I submitted a paper critiquing Bland Altman plots (1). Since the original publication of Bland Altman plots was one of the most cited papers ever in The Lancet, I submitted my paper (2) with some temerity.

Briefly, the issue is this. When one is comparing two methods, Bland-Altman suggest plotting the difference (Y-X) on the Y axis vs. the average of the two methods (Y+X)/2 on the X axis. Bland Altman stated that even if the X method is a reference method (they use the term gold standard) one should still plot the difference against the average and not doing so is misguided and will lead to spurious correlations. They attempted to prove this with formulas. Doubting their premise, I did some simulations. But first, some background.

There are three types of clinical chemistry methods: definitive, reference, and field. In method comparisons, one compares either two field methods or a field method and a reference method. The later comparison is performed to evaluate the accuracy of the field method. A difference between a field and reference method is that the field method is usually less precise than the reference method. Sometimes, the reference method precision is improved by replication. In clinical use, field methods are run just once.

A method comparison analysis tool is the difference plot. The Bland Altman suggestion to use the average of the two methods (Y+X)/2 on the X axis for the field vs. reference case did not seem logical so I explored this through simulations in Excel using visual Basic for applications.

I used a sodium assay with a variable labeled truth going from 100 to 200 with increments of 1. I made eight cases. Four cases were two field assays with different precision. Both assays had the same precision which ranged from 2 to 10 (mmol/L). Four cases were for field vs. reference with the same field assay precision but in each case, the reference assay had very good precision (0.5 mmol/L). Thus, in the field vs. field comparison, the variance difference between the two field assays was close to zero. In the field vs. reference assays, the variance difference ranged from 2 to 100.

To construct each dataset, I used the RANNOR function in Excel to create 100 observations each of X and Y data with the standard deviation chosen for that case. This was replicated 40 times giving 320 datasets.

I performed 3 linear regressions on each dataset:

Y vs. X
Y-X vs. (X+Y)/2 (BA=Bland Altman)
Y-X vs. X (No BA=no Bland Altman)

In each regression, X=either field or reference, Y is always field.

The data are summarized in the tables and plots. For the plots: the Y axis is the variance difference, the X axis the correlation coefficient for the regression Y vs. X. And the contours are the correlation coefficients for either Y-X vs. X or Y-X vs. (X+Y)/2.

  1. The ellipse on the lower right shows that for two field methods, BA has no correlation.
  2. The ellipse on the upper right shows that for a field vs. reference, BA has a correlation.
  3. The ellipse on the lower left shows that for two field methods no BA has a correlation.
  4. The ellipse on the upper left shows that for a field vs. reference no BA has no correlation.

Cases 1 and 3 confirm the original Bland Altman argument – plot Y-X vs. (X+Y)/2.
Cases 2 and 4 confirm the argument – plot Y-X vs. X.

The results are shown in Table 2 and Figure 1. Basically, this says that when you have two field methods you should plot the difference vs. (Y+X)/2 as Bland Altman suggest. But when you have field and a reference method, you should plot the difference vs. X. The values in the table are the correlation coefficients for Y-X vs. (Y-X)/2 and Y-X vs. X (after repeated simulations where Y is always a field method and X is either a field method or a reference method).

Table 2
Correlation coefficient of Y-X vs. either X or (X+Y)/2 for errors due to imprecision in X and Y

CaseX axis = XX axis = (X+Y)/2
X=Reference Method~0~0.1
X=Field Method~0.12~0

Figure 1 – Click to enlarge.

  1. Bland JM, Altman DG. (1995) Comparing methods of measurement – why plotting difference against standard method is misleading. Lancet, 346, 1085-1087.
  2. Krouwer JS Why Bland-Altman plots should use X, not (Y+X)/2 when X is a reference method. Statistics in Medicine, 2008;27:778-780.

*Revision 11 is available here.


Adverse event data informs on clinical outcomes when test results are bad

December 15, 2023
Tufted Titmouse

In a clinical trial (1) for a continuous glucose monitor (CGM), the data had two notable attributes:

  1. The data were mainly in the “A” zone of an error grid and
  2. The data covered most of the glucose range.

The trial was conducted with people with diabetes – they were wearing the CGMs – with medical decisions made many times over the course of the trial. But since most of the data were in the A zone (meaning that the difference between CGM glucose and reference was small), one can say that the medical decisions were based on good glucose data.

In CGM adverse event data where inaccuracy was a complaint, most data were in higher error grid zones with almost no data in the A zone (2). Here, medical decisions were based on bad glucose data (CGM and reference don’t agree by a lot).

This is one of the values of adverse event data. One can observe clinical outcomes with medical decisions based on bad glucose data.

References

  1. Shah, VN, Laffel, LM, Wadwa, RP, and GargSK. Performance of a Factory-Calibrated Real-Time Continuous Glucose Monitoring System Utilizing an Automated Sensor Applicator Diabetes Technol Ther. 2018;20(6):428–433.
  2. Krouwer JS. An Analysis of 2019 FDA Adverse Events for Two Insulin Pumps and Two Continuous Glucose Monitors. Journal of Diabetes Science and Technology. 2022;16(1):228-232. doi:10.1177/1932296820951872

A surprising diabetes result

November 28, 2023

I have been analyzing continuous glucose monitor (CGM) adverse events. By querying the database, I found a little over 18,000 adverse events where the complaint contained some form of the word inaccurate. I was able to extract 4,483 pairs of data (Y=CGM, X=glucose meter) where the result is in the Parkes error grid zone B or higher. The following table shows this breakdown and includes the event types (M=malfunction, IN=injury).

zonecountEVENT=MEvent=INPct MPct IN
B2,4182,3932599.0%1.0%
C1,0501,0391199.0%1.0%
D1,00485614685.3%14.5%
E11110100.0%0.0%
      
Totals4,4834,299182  
Pct 95.9%4.1%  

What is surprising is that so many events in the C zone or higher were listed as malfunction and not injury. I thought that a D zone result would almost always cause a medical decision that led to a bad clinical outcome.


The data is gone

November 23, 2023

I wanted to look at some continuous glucose monitor (CGM) vs glucose meter data in the adverse event database. But, in the most recent database (2022), there is no comparison data. The events are there, just without the comparison data.

Possibly, this is my fault! In adverse event databases in previous years, there were thousands of comparison data available. I published a graph of this data on an error grid, (1) which doesn’t look very good (e.g., almost no data in the A zone).

It is possible that since this graph appeared, the manufacturers decided to stop providing comparison data.

Reference

  1. Krouwer JS. An Analysis of 2019 FDA Adverse Events for Two Insulin Pumps and Two Continuous Glucose Monitors. Journal of Diabetes Science and Technology. 2022;16(1):228-232. doi:10.1177/1932296820951872

CLIA Waiver requirements for Error Grids

October 20, 2023

Last night I attended an NEAACC lecture in Boston about regulatory affairs. Soon, it will be NEADLM but not yet.

I asked the question, does the CLIA waiver guidance still require error grid analysis. Since the speaker didn’t seem to know what an error grid was, I got my answer. So, I looked at the most current FDA guidance on CLIA waiver requirements – it is here. I searched this document for the term “error grid.” It appeared only once, as a footnote for the title of the CLIS document, EP27, which is about error grids. The previous guidance required an error grid to be prepared.

This is a step backward, IMHO. For many assays, error size matters. My colleague and I wrote about this (1).

References

  1. Krouwer, Jan S. and Cembrowski, George S.. “Towards more complete specifications for acceptable analytical performance – a plea for error grid analysis” Clinical Chemistry and Laboratory Medicine, vol. 49, no. 7, 2011, pp. 1127-1130. https://doi.org/10.1515/CCLM.2011.610

Comments about risk analysis for quality control

January 9, 2023

Portsmouth Naval Prison Portsmouth, NH

The January 2023 issue of J Applied Lab Med contains three articles about “risk analysis for quality control.”

One feature of these article is that there is a link to a slide show presentation of the concepts in the article. This is a fabulous idea – the slide show is well done – and the concept of these slide shows would be a great addition to articles. The link is here: ARUP Scientific Resource for Research and Education: QC Optimization | University of Utah

I went through the first slide show – about concepts and have two comments.

The authors spend some time talking about acceptance limits and loss. I won’t try to reproduce the slide show but briefly, the authors talk about a “threshold” loss function where any value inside the limit has no loss and every value outside the limit has equal loss. They suggest a quadratic loss function is better. But this is not new! These concepts are in the glucose meter error grid, the FDA guidance on CLIA waived assays, the CLSI document on error grids (EP27). A reference to prior work would have been nice.

The second comment refers to their block diagram, which I will paraphrase as follows: Assays have variation that lead to shifts in the mean, which is an out-of-control condition. These shifts have an error distribution. QC can detect some of these shifts which can result in a narrower error distribution. The error distribution and the impact of error inform about the level of patient risk. Thus, it is implied that patient risk is a function of QC. While this statement is true, it is misleading. Patient risk not just affected by QC.

I have mentioned before that QC helps to determine whether a process is or is not in control. The problem is that a process in control can still lead to bad outcomes regardless of the QC program. For example, QC is blind to patient interferences and most preanalytical errors. Both of which occur for a process that is in control. One would like to know how much risk to patients exist for an in-control condition.


More thoughts about Six Sigma for clinical lab tests – updated

November 14, 2022

From the cited reference, the following adjectives are attached to various sigma values with the number defects per million samples.

Sigma         Quality           Defects per million samples

4                 Good               6,210

5                 Excellent         233

6                 World Class    4

Additionally, the defects are described as clinical defects implying that a clinician will make an incorrect medical decision due to a defect in the reported result. Here are the problems with this:

  1. It is not uncommon for a lab to report a million results a year for an assay. This means that a “good” assay will have 17 defects per day! And an excellent assay, 19 defects per month! This can’t be right. A clinical defect is serious and either rate (17 per day or 19 per month) is frequent. Frequent events with high severity would not be tolerated. Even if the number of assays were less, the defect rate would still be too high.
  2. One possible explanation is that the total allowable error goals are not right. For example, if TEa=12 bias=0 and SD=3, sigma is 4, but if TEa=18. Sigma=6. One can ask, who is setting the goals. The only assay goals set by clinicians that I am aware of is the glucose meter surveillance error grid. In the preliminary Milan conference recommendation, it was stated that goals should be based on clinical outcomes as surveyed by clinicians. But in the final Milan conference recommendation, the “surveyed by clinicians” phrase was dropped. In an accompanying paper, one sees why as the paper said that “clinicians seem “uninformed” regarding important principles.” An example: on the Westgard web (Consolidated Comparison of Chemistry Performance Specifications – Westgard) glucose allowable total error ranges from 7% to 13%, with 8% being common. But for a Parkes glucose meter error grid, values from 77 to 127 for a target of 100 will not cause patient harm. This is a
     -23% to +27% total allowable error.
  3. Not all defects are clinical defects! If an analyzer does not produce a result due to failure of a system, this is a defect, but not a clinical defect (assuming that this is not a stat assay).
  4. And defects such as #4 or preanalytical or postanalytical errors will not be captured by the sigma metric (TEa – bias)/precision.
  5. One can also argue that whatever the goals are, the reported sigma metric’s claim about clinical defects is unverified. Thus, the sigma model has never been verified. How would one do this? One would have to review a sample of patient records to determine if incorrect treatment decisions were made.

Reference

Sten Westgard, Hassan Bayat, James O Westgard Analytical Sigma metrics: A review of Six Sigma implementation tools for medical laboratories Biochem Med (Zagreb) 2018;28(2):020502 https://doi.org/10.11613/BM.2018.020502


Difference in performance between clinical trials and real-world use for continuous glucose monitors

June 30, 2022

This is the paper I mentioned in the last post.

Abstract

Background Recent clinical trials have been performed to evaluate the safety and accuracy of continuous glucose monitors (CGMs).

Methods Adverse event data from the FDA website was downloaded into a database and queries performed to produce adverse event for the two CGMs studies in the clinical trials

Results The clinical trial method comparison compared CGM glucose results vs. the YSI reference. Adverse event data compared CGM glucose results vs. a glucose meter as reference. The clinical trial data were mainly in the “A” zone of a glucose meter error grid, whereas the adverse event data were mainly in higher zones.

Differences in results can be partly explained by carefully controlled clinical trials vs. real world use where user error is possible. This same difference in results was also seen for clinical trials with marketed products. Whereas the rate of CGM adverse events is small compared to the number of glucose determinations, the rate of CGM adverse events for people using insulin is 5%.

Conclusions For marketed products, clinical trials should also examine the adverse event database, which is a valuable source of performance data, and programs should be developed to try to reduce the rate of adverse events.

Background

New diabetes devices are offered on a regular basis, including continuous glucose monitors (CGM). This device lies under the user’s skin and measures glucose every 5 minutes in interstitial fluid. Method comparison evaluations are used to establish the safety and accuracy of CGMs.

The purpose of this paper is to show that the results in these studies often differ from real world use. Here, real world use is defined as the CGM device being used by people with diabetes to manage their glucose levels (e.g., not as part of an experiment). Two recent clinical trials evaluated the Decom G6 (NCT02880267) (1) and Senseonics Eversense (NCT03808376) (2). The Dexcom G6 sensor is inserted by the user and replaced by the user every 10 days (3). The Eversense sensor is inserted by a physician and replaced by a physician every 180 days (4).

One can view these trials as carefully controlled method comparison studies taking a very small sample intended to be representative from the population of possible CGM results. The reference method in these studies was the YSI 2300 glucose analyzer (Yellow Springs Instrument). The Dexcom G6 trial had 3,532 CGM YSI pairs whereas the Eversense trial had 49,613 CGM YSI pairs. Once the devices are available to the public, another set of data becomes available – adverse events that have been reported to the FDA. The adverse events are a very small percentage of the population of CGM results. These adverse events are reported by users and represent real world user data.

Methods

Adverse event data from 2021 were downloaded from the FDA website (5) into a SQL Server database. The key fields in the database are GENERIC_NAME (used to separate CGMs from other devices), BRAND_NAME, EVENT_TYPE (malfunction, injury, or death), and FOI_TEXT (a description of the event). The total number of adverse events across all devices for 2021 was 2,030,159. Queries were performed on the field GENERIC_NAME to isolate CGM events from other devices. There were 349,672 CGM adverse events in 2021 (17.4% of all devices). Queries on the field BRAND_NAME and EVENT_TYPE produced the results shown in Table 1.

Table 1
Events for two CGM devices for 2021

CGMEvent_TYPETotals
 MalfunctionInjuryDeath 
Dexcom G6286,4016,2023292,606
Eversense72540126

Results

Dexcom G6

For the Dexcom G6 records, a further query on the field FOI_TEXT produced records where any form of the word inaccurate appeared, which resulted in 18,630 records (7% of all Dexcom G6 adverse events). These results were exported to Excel, where 4,611 pairs of CGM and matched glucose meter results were extracted using visual basic for applications. In this case, the reference method was a blood glucose meter, not the YSI.

A plot of the Dexcom G6 adverse event data on a Parkes error grid is shown in Figure 1. The Parkes error grid (developed for glucose meters) displays method comparison data into zones. The A zone represents clinically unimportant differences. Higher zones represent increasing clinical harm with the highest zones life threatening.

Figure 1
Parkes error grid of Dexcom G6 adverse event data

The percentages of data in each zone is shown for the adverse event and clinical trial data in Table 2.

Table 2
percentage of data in each zone

ZoneAdverse EventsClinical Trial AdultsClinical Trial Children
A4.6%91.995.7
B25.6%8.054.1
C31.8%0.050.2
D34.9%
E3.1%

The adverse event data is for a Parkes error grid and the clinical trial data (1) is for a Clarke error grid.

Eversense

As for the Dexcom G6, a query on the FOI_TEXT field for the Eversense records yielded 19 records which contained a form of the word inaccuracy (15.1% of all Eversense records). Unlike the Dexcom G6 study, the Eversense CGM YSI data pairs were not plotted. The percentage of differences less than 15% and 20% were reported with the average 20% or less difference being 92.9%. A stark difference between the Dexcom G6 and Eversense records was that the Dexcom G6 had a much more detailed description of each adverse event. The Eversense records were often just one sentence.

Limitations

The adverse event data are unverified. Rarely were the units returned and analyzed by the manufacturer. Although the results are matched pairs of CGM and glucose meter data, there could have been a time delay between the two readings. There is no guarantee that the glucose meter values were correct. There is no way to know that all adverse events that should have been reported were reported and correctly classified. One should not try to compare the absolute numbers of adverse events. There is no information about usage which is required for rate estimation.

Discussion

Clearly, the results of the clinical trial and adverse event data differ. One can view the adverse event data with large inaccuracies as events that could have resulted in injury or events that did result in injury (usually hospitalization). Both the clinical trial results and the Dexcom G6 adverse events are very small percentages of the total number of yearly CGM glucose determinations.

To estimate the number of glucose CGM determinations in the US, assuming 48% of people use CGMs (4) out of 7.4 million people who use insulin (5) gives 373 billion glucose results each year. The number of Dexcom G6 adverse events in 2021 was 252,646. The clinical trial results are thus a tiny sample of the population of glucose results. The Dexcom G6 adverse events are not a sample – they are the population of adverse events within the population of all 373 billion glucose results. Thus, it is not surprising that no serious adverse events were observed in the clinical trial. It would be like one needle in a haystack finding another needle.

The clinical trials, which are sponsored by the manufacturer are carefully controlled as expected, whereas the adverse event data represent real world use, which includes user error (and user misuse). For example, Dexcom G6 specifies approved locations to place their CGM, but some users do not follow the recommendation. Note that an adverse event caused by user error must nevertheless be reported to the FDA (as either a malfunction, injury, or death). The adverse event database is an outcome based database not a database of causes.

When product evaluations are carried out with marketed products, the same differences were seen between those evaluations and the adverse event database (6,7).

The conclusion of the clinical trial was that the device is safe and accurate. This is a valid conclusion based on the trial results but does not consider real world use.

Adverse events occur with all diabetes devices and the harm should these products not be available is likely much greater than the harm due to the adverse events. The rate of CGM adverse events for glucose determinations is very small (hundreds of thousands of adverse events divided by hundreds of billions of glucose determinations) but the rate of total CGM adverse events (349,672) divided by the number of people using CGM (3.6 million) = 9.8% is not so small.

Conclusions

The adverse event database is a neglected, useful source of performance data. For marketed products, clinical trials should also examine the adverse event database, and programs should be developed to try to reduce the rate of adverse events (9).

References

  1. Shah, VN, Laffel, LM, Wadwa, RP, and GargSK. Performance of a Factory-Calibrated Real-Time Continuous Glucose Monitoring System Utilizing an Automated Sensor Applicator Diabetes Technol Ther. 2018;20(6):428–433.
  2. Satish K. Garg, David Liljenquist, Bruce Bode, Mark P. Christiansen, Timothy S. Bailey, Ronald L. Brazg, Douglas S. Denham, Anna R. Chang, Halis Kaan Akturk, Andrew Dehennis, Katherine S. Tweden, and Francine R. Kaufman. Evaluation of Accuracy and Safety of the Next-Generation Up to 180-Day Long-Term Implantable Eversense Continuous Glucose Monitoring System: The PROMISE Study Diabetes Technology & Therapeutics 2022 24:2, 84-92
  3. Dexcom G6 system Dexcom G6 Continuous Glucose Monitoring (CGM) System | Zero Fingersticks accessed March 11, 2022
  4. Eversense System Home | Senseonics (eversensediabetes.com) accessed March 11, 2022.
  5. About Manufacturer and User Facility Device Experience (MAUDE) | FDA accessed March 11, 2022.
  6. Freckmann G, Link M, Kamecke U, Haug C, Baumgartner B, Weitgasser R. Performance and Usability of Three Systems for Continuous Glucose Monitoring in Direct Comparison. Journal of Diabetes Science and Technology. 2019;13(5):890-898.
  7. Krouwer JS. The world of laboratory testing as seen by a statistical consultant. Kindle Direct Publishing Seattle, Washington 2022 pp 109-110.
  8. Cefalu WT, Dawes DE, Gavlak G, Goldman D, Herman WH, et. Al, Insulin Access and Affordability Working Group: Conclusions and Recommendations Diabetes Care 2018;41(6):1299–1311. https://doi.org/10.2337/dci18-0019
  9. Krouwer JS Reducing Glucose Meter Adverse Events by Using Reliability Growth with the FDA MAUDE Database Journal of Diabetes Science and Technology, 2019;13:959-962.

New book is now available

February 14, 2022

My new book is now available at Amazon.com. It comes in two versions:  a Kindle eBook and a paperback. Initially for a short time (this week), the eBook is available at no charge.

The title is: The world of laboratory testing as seen by a statistical consultant
Most lab tests are accurate – why a few are highly inaccurate and what to do about it.

Description

This book describes the laboratory testing world as seen through the eyes of a statistical consultant. How does industry set product specifications? Why most evaluations are biased. How to work efficiently with the FDA to get your product approved. When things go wrong and why projects are almost always late. Standards have a hierarchy and the question of whether to accept or reject a new product is complicated. The CLSI standards are influenced by industry – the Uniformity of Claims standard was cancelled due to industry pressure. The total error standard was revised according to industry wishes. Other standards such as ISO 9001 and ISO 15189 promote quality improvement but are mainly about documentation. The guide to the expression of uncertainty in measurement has been simplified for clinical chemists to the extent that it has little value. Likewise, a commutability standard, based on bad experiments is suspect. The Milan conference proposed a hierarchy for setting performance standards but fell short. Quality control based on risk management has a long way to go. AACC got in bed with Theranos – this marriage was doomed from the start. There are ways to improve evaluations: multifactor designs, mountain plots, and improving instrument reliability. The FDA adverse advent database is a valuable source of information about diabetes device failures.

After going through the self-publishing process, I can now appreciate the work that goes into publishing!