Review of why FMEA is so difficult

January 20, 2019

FMEA stands for Failure Mode Effects Analysis
FRACAS stands for Failure Reporting and Corrective Action System

Definitions – FMEA is a process to reduce the risk of undesirable events. FRACAS is a process to reduce the rate of undesirable events. Hence FMEA failures haven’t happened while FRACAS failures have. FMEA failures are often much more severe than FRACAS failures.

Not a FMEA – When someone says they have reduced the rate of failures for a process, they have performed FRACAS not FMEA.

Lack of management support – Companies make money by selling products and services and also by reducing costs. FMEA does neither. It consumes resources while reducing the risk of costs (caused when failures occur). A nonprofit has to worry about the same issues. If they lose money, they will have to reduce services.

The following is not a compelling appeal to management for resources – “There are some catastrophic failures that have never happened. We can spend some money to make these events even less likely”.

I participated in a company FMEA that was always held during lunch. The reason given by management was that they have more important tasks to do during regular working hours.

Lack of a facilitator – The purpose of a FMEA is to question the design of a product or process. Often, the designer is present. A facilitator can prevent an adversarial confrontation.

Lack of interest – People who design medical instruments like to design. It’s a challenge to motivate them to perform tasks other than design.

Insufficient detail – FMEA is a bottoms up approach and requires listing all process steps. (Fault trees are a top down approach). Mapping out a process requires inclusion of all relevant events. Providing insufficient detail is a problem. Example, a technician examines a clinical sample before it is analyzed. An additional branch could be how is this person hired, trained, etc.

Stopping at the status quo – As one conducts a FMEA, one lists the possible failure modes for each step, the effect of each failure, and the mitigation for each failure. Example: a process step is to test whether a sample is hemolyzed, the effect (of a failure) is potassium is artificially elevated, and the existing mitigation is to rely on the instrument’s automatic flagging of hemolyzed samples. One might conclude that’s it – no need to do anything further. But all that’s been done is to describe the existing process. This is not FMEA. One must ask questions such as how can the instrument’s flagging system fail.

Acceptable risk is hard to quantify – One can never have a zero failure risk. For example, for blood gas, which is an emergent assay, one remedy to mitigate against an instrument failure is to use two instruments. One can estimate the probability of both instruments failing at the same time. One can add a third machine and so on, always lowering the risk, but it is never zero. Mitigations cost money so one must make a tradeoff between cost and risk.

Dealing with user error is not new

June 25, 2018

A few blogs ago, I reported that a committee had suggested that total error include all phases of testing. I had battled (and was unsuccessful) during the revision of EP27 to include user error as an error source.

Back in 1978, or 40 years ago, anesthesiology was beset with many serious injuries and deaths. One of the causes was user error. But as a classic paper showed, many of the user errors were in part caused by bad design of the instrumentation. (In flying, on some planes in the past, the gear and flap levers were next to each other which resulted in pilots raising the gear instead of flaps after landing). So there are ways to change the design to decrease the rate of user error.

The process used in improving anesthesiology was a FRACAS like process (Failure reporting and corrective action system) – the term FRACAS is not used in the article.

Consulting and loss of control

March 22, 2018

For most of my career, I’ve been either an internal or external consultant. Consultants are always trying to get work by convincing someone that the consultant can solve a problem. Whereas this might seem to be a win / win situation, to some clients, using a consultant can be frightening as the client fears loss of control. Here’s a real example.

At Ciba Corning in the 90s, our instrument reliability was not very good. Our group thought that we could help using data analysis methods. A key success factor was someone we hired – without him the project would not have been successful.

But an additional problem was that engineering management didn’t want us to help. From their perspective, it’s easy to see why. We proposed using some of their staff to collect data, the engineers would have to attend meetings that we ran, we would report to management on instrument reliability, direct which problems to work on, and advise management as to when the instrument reliability goal would be met… All of this amounts to a partial loss of control for the engineering manager.

Of course, loosing control was never discussed. Their objections came in the form of resistance. “That’s a great idea, let’s try it on the next project.” Or, “yes let’s do it.” But the first meeting could never be scheduled (e.g., when yes means no). And so on.

The project was allowed to proceed because the engineering manager and I had the same boss, who overruled the engineering manager’s attempts to decline participating.

The project was a big success and the engineering manager’s response was to take back control. Thus, with things now in place he could now run the new programs. The only thing that irked me was not only did he never credit us for the work but it was suggested that the success occurred in spite of our group. But this is simply an example of the successful consulting cycle.

Do it right the first time – not always the best strategy

December 14, 2017

Watching a remarkable video about wing suit flyers jumping into an open door of descending plane, it appears that they had tried to accomplish this feat 100 times before having success.

On page four of a document that summarizes the quality gurus: Crosby, Deming and Juran, Crosby’s “Do it right the first time” appears. Clearly, this would have been a problem for the wing suit flyers. Crosby’s suggestion is appropriate if the state of knowledge is high. For the wing suit flyers, there were many unknowns, hence the state of knowledge was low. When the state of knowledge is meager, as it was at Ciba Corning when we were designing in vitro diagnostic instruments, we used the test analyze and fix strategy (TAAF) as part of reliability growth management and FRACAS. This sounds like the opposite of a sane quality strategy but in fact was the fastest way to achieve reliability goals for our instruments.

IQCP – waste of time? No surprise

July 30, 2016


Having looked at a blog entry by the Westgards, which is always interesting, here are my thoughts.

Regarding IQCP, they say it’s mostly been a “waste of time”, an exercise of paperwork to justify current practices, with very little change occurring in QC practices.

This is no surprise to me – here’s why.

There are two ways to reduce errors.

FMEA (or similar programs) reduces the likelihood of rare but severe errors.

FRACAS (or similar programs) reduces the error rate of actual errors, some of which may be severe.

Here are the challenges with FMEA

  1. It takes time and personnel. There’s no way around this. If sufficient time is not provided with all of the relevant personnel present, the results will suffer. When the Joint Commission required every hospital to perform at least one FMEA per year, people complained that performing a FMEA took too much time.
  2. Management must be committed. (I was asked to facilitate a FMEA for a company – the meetings were scheduled during lunch. I asked why and was told they had more important things to do). Management wasn’t committed. The only reason this group was doing the FMEA was to satisfy a requirement.
  3. FMEA requires a facilitator. The purpose of FMEA is to challenge the ways things are done. Often, this means challenging people in the room (e.g., who have put systems in place or manage the ways things are done). This can create an adversarial situation where subordinates will not speak up. Without a good facilitator, results will suffer.
  4. The guidance to perform a FMEA (such as EP23) is not very good. Example: Failure mode is a short sample. The mitigation is to have someone examine each tube to ensure the sample volume is adequate. The group moves on to the next failure mode. The problem is that the mitigation is not new – it’s existing laboratory practice. Thus, as the Westgards say – all that has happened is the existing process has been documented. That is not FMEA. (A FMEA would enumerate the many ways that someone examining each sample could fail to detect the short sample).
  5. Pareto charts are absent in the guidance. But real FMEAs require Pareto charts.
  6. I have seen reports where people say their error rate has been reduced after they conducted a FMEA. But there are no error rates in a FMEA (errors rates are in a FRACAS). So this means no FMEA was carried out.
  7. And how anyone could say they have conducted a FMEA and conclude that it is ok to run QC monthly.

Here are the challenges with FRACAS

  1. FRACAS requires a process where errors are counted in a structured way (severity and frequency) and reports issued on a periodic basis. This requires knowledge and commitment.
  2. FRACAS also requires periodic meetings to review errors whereby problems are assigned to corrective action teams. Again, this requires knowledge and commitment.
  3. Absence of a Pareto chart is a flag that something is missing (no severity classification, for example).
  4. People don’t like to see their error rates.
  5. FRACAS requires a realistic (error rate) goal.

There are FRACAS success stories:

Dr. Peter Pronovost performed a FRACAS type approach on placing central lines and dropped the infection rate from 10% to 0 by the use of checklists.

In the 70s, the use of a FRACAS type approach reduced the error rate in anesthesiology instruments.

And FMEA failures

A Mexican teenager came to the US for a heart lung transplant. The donated organs were not checked to see if they were the right type. The patient died.

More Comments about IQCP

August 27, 2015


The Westgard web has some comments about IQCP.

Here are mine.

  1. There is no distinction between potential errors and errors that have occurred. This is non-standard. In traditional risk management different methods are used for potential errors vs. errors that have occurred. For example on page 12 of the IQCP book which focuses on specimen risks, “Kim” reviewed log books and noted errors. Yet on the same page, Kim is instructed to ask “What could go wrong.” The problem is that there are clearly errors that have occurred yet there could be potential new errors that have never occurred.
  2. The mitigation steps to reduce errors look phony. For example, an error source is: “Kim noted some specimens remained unprocessed for more than 60 minutes without being properly stored.” The suggested mitigation is: Train testing personnel to verify and document: Collection time and time of receipt in laboratory and proper storage and processing of specimen. The reason the mitigation sounds phony is that most labs would already have this training in place. The whole point of risk management is to put in place mitigations that don’t already exist.
  3. There is no measurement of error rates. Because there is no distinction between potential errors vs. errors that have occurred, there is a missed opportunity to measure error rates. In the real world, when errors occur and mitigations are put in place, the error rate is measured to determine the effectiveness of the mitigations.
  4. The word “Pareto” cannot be found in IQCP. Here is why this is a problem. In IQCP, for each section, a few errors are mentioned. In the real world, for either potential errors or those that have occurred, the number of errors is much larger. So much larger that there are not enough resources to deal with all errors. That is why the errors are classified and ranked (the ranking is often displayed as a Pareto chart). The errors at the top of the chart are dealt with. In the naïve IQCP, there is no need to classify or rank errors because all are dealt with. The same problem occurs in CLSI EP23 and ISO 14197.

Conclusion: One might infer that no one who participated in the writing of IQCP has ever performed actual risk management using standard methods or perhaps any methods.

I’m not an expert in risk management

April 4, 2012

I was at the Quality in the Spotlight conference in Antwerp, Belgium, which as always was enjoyable. There were several talks about risk management, and the CLSI guideline EP23. (My main talk was about error grids). I found it strange that several people referred to me as the expert in risk management. Now I have studied risk management techniques such as FMEA, fault trees, and FRACAS, attended conferences such as RAMS (Reliability and Maintainability Symposium), practiced all of these techniques for years, and consider myself competent, but not an expert.

I think the problem is that many people in clinical chemistry have little knowledge or experience with formal risk management techniques so relatively speaking I appear as an expert to them.

This reminds me of an EP23 phone conference meeting several years ago, where one of the subcommittee members said “now let me get this straight, when you’re performing risk management, you’re ….” and this person tried to go through the steps of a FMEA pretty much like a person trying to understand football – so if you make 10 yards, then you keep the ball, right? Of course, the problem was that this person was a member of the committee – most of the other members were at a similar knowledge level – but committees are generically called a committee of experts.

If there is a subcommittee on a statistical topic, then it is understandable that not all committee members are competent in the statistics at hand but risk management is different. There is nothing complicated about risk management – anyone can learn it.

But anyone can also not learn it with the result that a committee can easily go astray. So the CLSI risk management documents are:

EP18A2 – the formal techniques of FMEA, fault trees and FRACAS. The examples in EP18 are poor because no one could contribute real examples and that it what is needed.

EP23A – is a deviation from the formal techniques, IMHO because no one knew enough about the formal techniques and hence they did what thought seemed ok. The example was also poor because it was constructed.

And now there is the EP23 workbook – the book to explain the book – always a bad sign, although I have not seen this yet.