Featured Product
This Week in Quality Digest Live
Statistics Features
Donald J. Wheeler
Guard-bands may cost more than they help
Scott A. Hindle
For instance: Should I use my new peak flow meter?
William A. Levinson
Calculate Ppk and control limits for processes with nondetects
Fred Schenkelberg
Beware the type III error
Adam Conner-Simons
An open-source system makes it possible to create interactive scatterplots of large datasets

More Features

Statistics News
Collect measurements, visual defect information, simple Go/No-Go situations from any online device
Good quality is adding an average of 11 percent to organizations’ revenue growth
Ability to subscribe with single-user minimum, floating license, and no long-term commitment
A guide for practitioners and managers
Gain visibility into real-time quality data to improve manufacturing process efficiency, quality, and profits
Tool for nonstatisticians automatically generates models that glean insights from complex data sets
Version 3.1 increases flexibility and ease of use with expanded data formatting features
Provides accurate visual representations of the plan-do-study-act cycle
SQCpack and GAGEpack offer a comprehensive approach to improving product quality and consistency

More News

John Flaig

Statistics

Rethinking Failure Mode and Effects Analysis

A better method takes into account economic loss

Published: Wednesday, June 24, 2015 - 14:54

The classic version of FMEA is an engineering tool for quality and reliability improvement through project prioritization. It was formally released by the U.S. government with MIL-P-1629 in 1949 and updated in 1980 as MIL-STD-1629A. The classic FMEA methodology has proven to be reasonably effective tool for product, service, and process improvement over the years, but it’s by no means optimal.

Each part of the system, subsystem, or component is analyzed for potential failure modes, possible causes, and the possible effects. The possible failure mode is given a rank score from 1 to 10 in three categories: severity, occurrence, and detection. Multiplying these three category ranks together will yield a number called the risk priority number, or RPN, which is between 1 and 1,000. The RPN results are reviewed for each failure mode, and corrective action projects are prioritized based on the RPN (i.e., the higher the RPN, the higher the corrective action priority).

Below we will explore some of the deficiencies in classic FMEA to see where improvements might be made.

Issue 1:  Bad math
The rank scores used in FMEA as measures of severity (S), occurrence (O), and detection (D) are subjectively generated ordinal numbers. Multiplying ordinal scores together is an invalid mathematical operation because multiplication assumes that a distance metric is defined on the space, and for rank scores no such metric exists (see Donald Wheeler’s article “Problems With Risk Priority Numbers”). This means that the resultant RPN score has no invariant meaning; there’s no distance metric that can be defined on rank scores because the “distance” between the scores is purely subjective.

Issue 2:  RPN prioritization ambiguity
When analyzing a system, you can get several failure modes with exactly the same RPNs, but your common sense tells you that they should have different corrective action priorities. For example, RPN (10, 1, 9) = RPN (1, 10, 9) = 90, but are they of equal priority? Hence, there are serious concerns by some people that one factor may be more important than another. For example, does (10, 1, 9) have a higher corrective action priority than (1, 10, 9) because it has a higher severity score even though the other failure mode has a higher occurrence rate?

Issue 3: The weakness of detection
Detection is really composed of two components: control and containment. Controls are tools and techniques to prevent the failure from being created, and containment are tools and techniques to prevent the failure from going downstream or out to the customer. However, the meaning of the word “detection” in the language of common discourse seems to be linked more to inspection (containment) than to prevention (control). So when asked to think about reducing risk by improving detection, many FMEA teams think only about containment techniques and not at all about control. Breaking out detection into its two components forces the team to think about both issues and thus ensures a more robust analysis, leading to greater failure rate reduction.

Issue 4: Lack of independence
The factors in the RPN model should be independent. However, when we examine the model, we find that D = f1(S), O = f2(D), and D = f3(C1, C2). For example:
• A very severe scratch is easier to detect than a light one.
• The performance of the detection system determines our estimate of the occurrence frequency.
• The level of control and containment activity establishes the detection capability.

Issue 5: Model problems
The RPN model claims to estimate risk, but the form of the equation doesn’t appear to do this in a common sense way. The criticality of the failure is given by S x O, and then to get the RPN, we multiply by D to estimate engineering “risk.” This doesn’t make sense because we should actually be dividing by the complement of D, denoted DC (i.e., DC = 11 – D). The actual risk estimation equation should be S x O/DC.

For example, if there’s no control or containment, then D = 10 and DC = 1, from which we see that the engineering risk is S x O/1. Now, if changes are made to increase detection to, say, D = 5, then DC = 6 and the RPN = S x O / 6, thus reducing the engineering “risk” based on the effect of the mitigation. As D goes from 10 to 1, DC goes from 1 to 10. So if D = 1, then DC = 10, and the estimated risk is reduced to one-tenth its former value. This formula seems like a much more realistic way to estimate risk and risk reduction.

Issue 6: Economic risk
As indicated above, there are many issues with the classic RPN approach to risk assessment, including the fact that the formula just doesn’t seem to make mathematical and intuitive sense. There are several factors that generate our personal sense of risk, and an important one is the possibility of economic loss. However, as we have seen, the RPN risk metric doesn’t include this component in its formulation. So it’s probably time to consider how FMEA can be improved.

The expected cost of failure is estimated by multiplying the cost of failure by the probability of failure occurrence. Then the economic risk metric is defined by C x P/DC, where C = cost of failure, P = the estimated probability of failure, D = detection, and DC = (11 – D). Recall that the engineering risk (RPN) will be an integer between 1 and 1,000, whereas the economic risk is the adjusted expected cost of failure and can be any real number dollar amount greater than or equal to $0.

Improving FMEA

There are two good reasons for switching from the engineering rank score for severity to the economic estimate of failure cost. First, it eliminates the issues of bad math and RPN prioritization ambiguity discussed above. This is because cost and probability are real numbers that have a distance metric defined on them so they can be meaningfully multiplied together. Also, since the product and quotient of real numbers is a real number, the chance of getting equal economic risk results becomes far less likely, thus reducing the priority ambiguity issues.

Second, management will be able to appreciate the value of FMEA much better if it is presented in the language they understand, i.e., the language of finance. For example, if there is no control or containment action applied, then D = 10, DC = 1, and the expected cost of failure is the economic risk estimate: C x P/DC = C x P/1 = C x P. When control or containment actions are taken, the expected cost of failure (i.e., economic risk) is reduced by the effect these actions have on the probability of failure occurrence. The first order approximation model for economic risk seems to align well with our intuitive feelings of how we assess risk in our daily lives.

Of course, the first order approximation can be improved by adjusting the part failure rates that are assumed to be constants to age-specific rates. In addition, the cost of failure could be discounted to present value, i.e., risk = E (cost of system failure), discounted to present value. In any case, the economic risk model seems to be a more accurate platform on which to build our FMEA analysis than the classic RPN model.

Discuss

About The Author

John Flaig’s picture

John Flaig

John J. Flaig, Ph.D., is a fellow of the American Society for Quality and is managing director of Applied Technology at www.e-at-usa.com, a training and consulting company. Flaig has given lectures and seminars in Europe, Asia, and throughout the United States. His special interests are in statistical process control, process capability analysis, supplier management, design of experiments, and process optimization. He was formerly a member of the Editorial Board of Quality Engineering, a journal of the ASQ, and associate editor of Quality Technology and Quantitative Management, a journal of the International Chinese Association of Quantitative Management.

Comments

Problems with Occurrence

This was a nice article John.  I especially liked the idea of linking risk to economic consequences.  One other problem I have run into over the years lies with the determination of occurrence (O).  When evaluating a new process or system, the true rate of failure is unknown.  If the team is composed of individuals with prior understanding of similar systems or historic performance, it is easy to estimate the new occurrence rating by examining historic experience.  Often though, the default is to rank occurrence with a high number due to lack of experience.  In a risk adverse environment, the result is that high risk is everywhere due to high occurrence or detection numbers.  The FMEA then becomes unmanageable and loses any advantage of prioritization.

Very interesting!

I have long shared this unease with the ordinal scales used in FMEA, and in many other risk/contingency plans. You make a great case, and ypur proposition sounds very workable.

I have used probability estimation instead of the occurrence scale, but had never made the translation you make to get to the "DC" factor for the detection scale.

My one concern is in trying to universally equate severity with cost in dollars. Cost of production, rework, scrap, etc. can probably be estimated with reasonable precision--so boiling things down to purely economic terms probably works well in process FMEA and maybe in Project FMEA. For other types of hazard (personnel safety, for example), the calculations would get much more complex, I would think, to estimate in dollars.

Safety

If the failure mode is a safety issue, the severity goes to 10 and action must be taken, regardless of occurrence and detection numbers.  The cost calculation is for prioritizing everything else.

The New FMEA

RIP,

Thank you for your very thoughtful comments. I agree with you that estimating cost of failure can sometimes be difficult. Typically Cost of Poor Quality reporting is a good start but as you point out there are many costs that are unknown and unknowable. Many times people just think about the cost to their company, however in many cases the cost to the customer can be several times the the cost to the sipplier. 

John

Prevention Control Vs Occurrence

Hello John:

Please can you further explain the difference between Detection Control (prevention) and Occurrence? I am confused as to what is the difference. I always looked at Detection as the ability to stop escapes and Occurence as the likelihood of the failure happening including the effect of prevention controls.

Quality Digest: It would be great if you could link Dr. Wheeler's article on the problems with RPN calculations with Dr. Flaig's artilce. They compliment each other nicely. 

Thank you, Dirk

The New FMEA

Dirk,

Detection has two components Control and Containment. Controls are tools or techniques that prevent failures from being generated (e.g., Poka Yoka, PIDs, Control Charts, MSA, SOP, PM, etc.). Containment is tools or techniques that prevent failures from going downstream or out to the customer (e.g., inspection, test, etc.). Occurrence is the relative frequency of failures observed (i.e., the probability).

John

Still Confused

Hello Dr. Flaig:

I am still confused. It seems to me that tools or techniques that prevent failures from being generated would affect the relative frequency of failures.

And if failures are being prevented from being generated, how is that related to detecting them? There isn't anything to detect. But the occurence has been reduced or eliminated. 

I am nt a fan of the AIAG FMEA method. I am not defending it. I am trying to learn and break through my mental models. 

Thank you, Dirk

Never Mind!

Quality Digest: The link to Dr. Wheeler's article is right in Dr. Flaig's artilce. I have to adjust my Detection score from a 3 to an 8! :-)