Review of why FMEA is so difficult

January 20, 2019

FMEA stands for Failure Mode Effects Analysis
FRACAS stands for Failure Reporting and Corrective Action System

Definitions – FMEA is a process to reduce the risk of undesirable events. FRACAS is a process to reduce the rate of undesirable events. Hence FMEA failures haven’t happened while FRACAS failures have. FMEA failures are often much more severe than FRACAS failures.

Not a FMEA – When someone says they have reduced the rate of failures for a process, they have performed FRACAS not FMEA.

Lack of management support – Companies make money by selling products and services and also by reducing costs. FMEA does neither. It consumes resources while reducing the risk of costs (caused when failures occur). A nonprofit has to worry about the same issues. If they lose money, they will have to reduce services.

The following is not a compelling appeal to management for resources – “There are some catastrophic failures that have never happened. We can spend some money to make these events even less likely”.

I participated in a company FMEA that was always held during lunch. The reason given by management was that they have more important tasks to do during regular working hours.

Lack of a facilitator – The purpose of a FMEA is to question the design of a product or process. Often, the designer is present. A facilitator can prevent an adversarial confrontation.

Lack of interest – People who design medical instruments like to design. It’s a challenge to motivate them to perform tasks other than design.

Insufficient detail – FMEA is a bottoms up approach and requires listing all process steps. (Fault trees are a top down approach). Mapping out a process requires inclusion of all relevant events. Providing insufficient detail is a problem. Example, a technician examines a clinical sample before it is analyzed. An additional branch could be how is this person hired, trained, etc.

Stopping at the status quo – As one conducts a FMEA, one lists the possible failure modes for each step, the effect of each failure, and the mitigation for each failure. Example: a process step is to test whether a sample is hemolyzed, the effect (of a failure) is potassium is artificially elevated, and the existing mitigation is to rely on the instrument’s automatic flagging of hemolyzed samples. One might conclude that’s it – no need to do anything further. But all that’s been done is to describe the existing process. This is not FMEA. One must ask questions such as how can the instrument’s flagging system fail.

Acceptable risk is hard to quantify – One can never have a zero failure risk. For example, for blood gas, which is an emergent assay, one remedy to mitigate against an instrument failure is to use two instruments. One can estimate the probability of both instruments failing at the same time. One can add a third machine and so on, always lowering the risk, but it is never zero. Mitigations cost money so one must make a tradeoff between cost and risk.


IHI revisited with respect to FMEA

August 20, 2018

Some years ago, I suggested that the FMEA tool in use at the Institute for Healthcare Improvement was not very good. So I went back to see if anything has changed. The answer is no.

Somewhere on the IHI site, you can find (you probably have to be logged in) a success story: East Alabama Medical Center Opelika, Alabama, USA. If you click on this link you will see that the “RPN” for a chemotherapy medication process has been greatly reduced. Here’s the problem:

To review: in a FMEA process, one lists potential failure modes as:

Failure Mode – a description of the failure
Cause – a description of the cause
Effects – a description of the effect

For each item, one lists the following in a traditional FMEA

Probability – the likelihood that the failure will occur
Severity – the consequences should the failure occur

Probability and severity are each given numerical values on a scale of 1-10 (typically) and multiplied together to get a risk. All of this is explained in the ISO document on risk management for medical devices ISO 14971.

There are certain features of FMEA which readily become apparent. The failure modes with the highest severity usually have the lowest probability of occurrence (10×1) = 10. If one concentrates on the total of these high severity low probability items, one sees that there is no way to reduce the number – severity will always be at 10 and probability is already at 1.

IHI introduces a third term, R the likelihood that the failure will not be detected and gives this a number on the 1-10 scale so the “RPN” number is the multiplication of three values. This is totally bogus because detection is already contained in likelihood of occurrence. But with the IHI scheme, one can now get a reduction in R and hence a reduction in the total value and claim that the process has been improved.

Big errors and little errors

May 27, 2018

In clinical assay evaluations, most of the time, focus is on “little” errors. What I mean by little errors are average bias and imprecision that exceed goals. Now I don’t mean to be pejorative about little errors since if bias or imprecision don’t meet goals, the assay is unsuitable. One of the reasons to distinguish between big and little errors is that often in evaluations, big errors are discarded as outliers. This is especially true in proficiency surveys but even for a simple method comparison, one is justified in discarding an outlier because the value would otherwise perturb the bias and imprecision estimates.

But big errors cause big problems and most evaluations focus on little errors, so how are big errors studied? Other than running thousands of samples, a valuable technique is to perform a FMEA (Failure Mode Effects Analysis). This can or should cover user error, software, interferences, besides the usual items. A FMEA study is often not very enthusiastically received but it is a necessary step in trying to ensure that an assay is free from both big and little errors. Of course, even with a completed FMEA, there are no guarantees.


Consulting and loss of control

March 22, 2018

For most of my career, I’ve been either an internal or external consultant. Consultants are always trying to get work by convincing someone that the consultant can solve a problem. Whereas this might seem to be a win / win situation, to some clients, using a consultant can be frightening as the client fears loss of control. Here’s a real example.

At Ciba Corning in the 90s, our instrument reliability was not very good. Our group thought that we could help using data analysis methods. A key success factor was someone we hired – without him the project would not have been successful.

But an additional problem was that engineering management didn’t want us to help. From their perspective, it’s easy to see why. We proposed using some of their staff to collect data, the engineers would have to attend meetings that we ran, we would report to management on instrument reliability, direct which problems to work on, and advise management as to when the instrument reliability goal would be met… All of this amounts to a partial loss of control for the engineering manager.

Of course, loosing control was never discussed. Their objections came in the form of resistance. “That’s a great idea, let’s try it on the next project.” Or, “yes let’s do it.” But the first meeting could never be scheduled (e.g., when yes means no). And so on.

The project was allowed to proceed because the engineering manager and I had the same boss, who overruled the engineering manager’s attempts to decline participating.

The project was a big success and the engineering manager’s response was to take back control. Thus, with things now in place he could now run the new programs. The only thing that irked me was not only did he never credit us for the work but it was suggested that the success occurred in spite of our group. But this is simply an example of the successful consulting cycle.

IQCP – waste of time? No surprise

July 30, 2016


Having looked at a blog entry by the Westgards, which is always interesting, here are my thoughts.

Regarding IQCP, they say it’s mostly been a “waste of time”, an exercise of paperwork to justify current practices, with very little change occurring in QC practices.

This is no surprise to me – here’s why.

There are two ways to reduce errors.

FMEA (or similar programs) reduces the likelihood of rare but severe errors.

FRACAS (or similar programs) reduces the error rate of actual errors, some of which may be severe.

Here are the challenges with FMEA

  1. It takes time and personnel. There’s no way around this. If sufficient time is not provided with all of the relevant personnel present, the results will suffer. When the Joint Commission required every hospital to perform at least one FMEA per year, people complained that performing a FMEA took too much time.
  2. Management must be committed. (I was asked to facilitate a FMEA for a company – the meetings were scheduled during lunch. I asked why and was told they had more important things to do). Management wasn’t committed. The only reason this group was doing the FMEA was to satisfy a requirement.
  3. FMEA requires a facilitator. The purpose of FMEA is to challenge the ways things are done. Often, this means challenging people in the room (e.g., who have put systems in place or manage the ways things are done). This can create an adversarial situation where subordinates will not speak up. Without a good facilitator, results will suffer.
  4. The guidance to perform a FMEA (such as EP23) is not very good. Example: Failure mode is a short sample. The mitigation is to have someone examine each tube to ensure the sample volume is adequate. The group moves on to the next failure mode. The problem is that the mitigation is not new – it’s existing laboratory practice. Thus, as the Westgards say – all that has happened is the existing process has been documented. That is not FMEA. (A FMEA would enumerate the many ways that someone examining each sample could fail to detect the short sample).
  5. Pareto charts are absent in the guidance. But real FMEAs require Pareto charts.
  6. I have seen reports where people say their error rate has been reduced after they conducted a FMEA. But there are no error rates in a FMEA (errors rates are in a FRACAS). So this means no FMEA was carried out.
  7. And how anyone could say they have conducted a FMEA and conclude that it is ok to run QC monthly.

Here are the challenges with FRACAS

  1. FRACAS requires a process where errors are counted in a structured way (severity and frequency) and reports issued on a periodic basis. This requires knowledge and commitment.
  2. FRACAS also requires periodic meetings to review errors whereby problems are assigned to corrective action teams. Again, this requires knowledge and commitment.
  3. Absence of a Pareto chart is a flag that something is missing (no severity classification, for example).
  4. People don’t like to see their error rates.
  5. FRACAS requires a realistic (error rate) goal.

There are FRACAS success stories:

Dr. Peter Pronovost performed a FRACAS type approach on placing central lines and dropped the infection rate from 10% to 0 by the use of checklists.

In the 70s, the use of a FRACAS type approach reduced the error rate in anesthesiology instruments.

And FMEA failures

A Mexican teenager came to the US for a heart lung transplant. The donated organs were not checked to see if they were the right type. The patient died.

More Comments about IQCP

August 27, 2015


The Westgard web has some comments about IQCP.

Here are mine.

  1. There is no distinction between potential errors and errors that have occurred. This is non-standard. In traditional risk management different methods are used for potential errors vs. errors that have occurred. For example on page 12 of the IQCP book which focuses on specimen risks, “Kim” reviewed log books and noted errors. Yet on the same page, Kim is instructed to ask “What could go wrong.” The problem is that there are clearly errors that have occurred yet there could be potential new errors that have never occurred.
  2. The mitigation steps to reduce errors look phony. For example, an error source is: “Kim noted some specimens remained unprocessed for more than 60 minutes without being properly stored.” The suggested mitigation is: Train testing personnel to verify and document: Collection time and time of receipt in laboratory and proper storage and processing of specimen. The reason the mitigation sounds phony is that most labs would already have this training in place. The whole point of risk management is to put in place mitigations that don’t already exist.
  3. There is no measurement of error rates. Because there is no distinction between potential errors vs. errors that have occurred, there is a missed opportunity to measure error rates. In the real world, when errors occur and mitigations are put in place, the error rate is measured to determine the effectiveness of the mitigations.
  4. The word “Pareto” cannot be found in IQCP. Here is why this is a problem. In IQCP, for each section, a few errors are mentioned. In the real world, for either potential errors or those that have occurred, the number of errors is much larger. So much larger that there are not enough resources to deal with all errors. That is why the errors are classified and ranked (the ranking is often displayed as a Pareto chart). The errors at the top of the chart are dealt with. In the naïve IQCP, there is no need to classify or rank errors because all are dealt with. The same problem occurs in CLSI EP23 and ISO 14197.

Conclusion: One might infer that no one who participated in the writing of IQCP has ever performed actual risk management using standard methods or perhaps any methods.

Six comments about risk management for labs

March 24, 2014


Inspired by a post by Sten Westgard, here is my list on risk management for labs.

  1. One can apply simple risk management to before and after EQC. Before, many patient results were protected from many process faults because twice daily QC would pick up the fault in time for the results to be repeated. After EQC, the risk of reported wrong patient results was higher because there could be a month before a fault was detected. Thus, EQC never made sense.
  2. The comments in Sten’s posting that “it’s up to the lab director” are similar to CLSI statements about requirements in many of their evaluation protocol documents.
  3. The CLSI EP23 document about risk management for the lab was written by a group that was largely untrained in risk management. (This group had high expertise in other areas). Hence, the document is non-standard with respect FMEA and fault trees. Moreover, it focuses on analytical faults that have been largely validated by the manufacturer but the document neglects lab user error.
  4. Hospitals are required (at least they used to be) to perform at least one FMEA per year. In my experience in trying to provide software for this, the hospitals had little interest in actually performing a FMEA. Without guidance, training, and some prescriptive methods, risk management in labs is suspect.
  5. The situation wasn’t much different for in vitro diagnostic manufacturers. I’ve never met an engineer who willingly participated in risk management activities.
  6. The IHI (Institute for Healthcare Improvement) has a method for implementing FMEAs that is almost guaranteed to cause problems since it looks for a numerical reduction in “risk”. Take surgery as an example and I simplify things for illustration. You score severity and probability of occurrence of each event, multiply the severity x probability and add up for all events. For example, wrong site surgery would get severity=5 (the highest), probability=1 (the lowest) for a 5. Waiting more than an hour for an appointment would get severity=1 (the lowest), probability=5 (the highest) for a 5. BUT, in general you can’t change severity, only probability so in this case, you would try to change the appointment process and ignore the wrong site surgery. (The wrong site surgery probability is already at the lowest value of 1.) Your overall number would improve (in this case the initial 10 would be reduced) and you would declare victory. But in spite of the universal protocol (to prevent wrong site surgery), there is still room for improvement, so this IHI program focuses on less severe items and ignores the important ones.

What’s needed is training on standard methods in risk management for labs.