The classic version of FMEA is an engineering tool for quality and reliability improvement through project prioritization. It was formally released by the U.S. government with MIL-P-1629 in 1949 and updated in 1980 as MIL-STD-1629A. The classic FMEA methodology has proven to be reasonably effective tool for product, service, and process improvement over the years, but it’s by no means optimal.
ADVERTISEMENT |
Each part of the system, subsystem, or component is analyzed for potential failure modes, possible causes, and the possible effects. The possible failure mode is given a rank score from 1 to 10 in three categories: severity, occurrence, and detection. Multiplying these three category ranks together will yield a number called the risk priority number, or RPN, which is between 1 and 1,000. The RPN results are reviewed for each failure mode, and corrective action projects are prioritized based on the RPN (i.e., the higher the RPN, the higher the corrective action priority).
Below we will explore some of the deficiencies in classic FMEA to see where improvements might be made.
Issue 1: Bad math
The rank scores used in FMEA as measures of severity (S), occurrence (O), and detection (D) are subjectively generated ordinal numbers. Multiplying ordinal scores together is an invalid mathematical operation because multiplication assumes that a distance metric is defined on the space, and for rank scores no such metric exists (see Donald Wheeler’s article “Problems With Risk Priority Numbers”). This means that the resultant RPN score has no invariant meaning; there’s no distance metric that can be defined on rank scores because the “distance” between the scores is purely subjective.
Issue 2: RPN prioritization ambiguity
When analyzing a system, you can get several failure modes with exactly the same RPNs, but your common sense tells you that they should have different corrective action priorities. For example, RPN (10, 1, 9) = RPN (1, 10, 9) = 90, but are they of equal priority? Hence, there are serious concerns by some people that one factor may be more important than another. For example, does (10, 1, 9) have a higher corrective action priority than (1, 10, 9) because it has a higher severity score even though the other failure mode has a higher occurrence rate?
Issue 3: The weakness of detection
Detection is really composed of two components: control and containment. Controls are tools and techniques to prevent the failure from being created, and containment are tools and techniques to prevent the failure from going downstream or out to the customer. However, the meaning of the word “detection” in the language of common discourse seems to be linked more to inspection (containment) than to prevention (control). So when asked to think about reducing risk by improving detection, many FMEA teams think only about containment techniques and not at all about control. Breaking out detection into its two components forces the team to think about both issues and thus ensures a more robust analysis, leading to greater failure rate reduction.
Issue 4: Lack of independence
The factors in the RPN model should be independent. However, when we examine the model, we find that D = f1(S), O = f2(D), and D = f3(C1, C2). For example:
• A very severe scratch is easier to detect than a light one.
• The performance of the detection system determines our estimate of the occurrence frequency.
• The level of control and containment activity establishes the detection capability.
Issue 5: Model problems
The RPN model claims to estimate risk, but the form of the equation doesn’t appear to do this in a common sense way. The criticality of the failure is given by S x O, and then to get the RPN, we multiply by D to estimate engineering “risk.” This doesn’t make sense because we should actually be dividing by the complement of D, denoted DC (i.e., DC = 11 – D). The actual risk estimation equation should be S x O/DC.
For example, if there’s no control or containment, then D = 10 and DC = 1, from which we see that the engineering risk is S x O/1. Now, if changes are made to increase detection to, say, D = 5, then DC = 6 and the RPN = S x O / 6, thus reducing the engineering “risk” based on the effect of the mitigation. As D goes from 10 to 1, DC goes from 1 to 10. So if D = 1, then DC = 10, and the estimated risk is reduced to one-tenth its former value. This formula seems like a much more realistic way to estimate risk and risk reduction.
Issue 6: Economic risk
As indicated above, there are many issues with the classic RPN approach to risk assessment, including the fact that the formula just doesn’t seem to make mathematical and intuitive sense. There are several factors that generate our personal sense of risk, and an important one is the possibility of economic loss. However, as we have seen, the RPN risk metric doesn’t include this component in its formulation. So it’s probably time to consider how FMEA can be improved.
The expected cost of failure is estimated by multiplying the cost of failure by the probability of failure occurrence. Then the economic risk metric is defined by C x P/DC, where C = cost of failure, P = the estimated probability of failure, D = detection, and DC = (11 – D). Recall that the engineering risk (RPN) will be an integer between 1 and 1,000, whereas the economic risk is the adjusted expected cost of failure and can be any real number dollar amount greater than or equal to $0.
Improving FMEA
There are two good reasons for switching from the engineering rank score for severity to the economic estimate of failure cost. First, it eliminates the issues of bad math and RPN prioritization ambiguity discussed above. This is because cost and probability are real numbers that have a distance metric defined on them so they can be meaningfully multiplied together. Also, since the product and quotient of real numbers is a real number, the chance of getting equal economic risk results becomes far less likely, thus reducing the priority ambiguity issues.
Second, management will be able to appreciate the value of FMEA much better if it is presented in the language they understand, i.e., the language of finance. For example, if there is no control or containment action applied, then D = 10, DC = 1, and the expected cost of failure is the economic risk estimate: C x P/DC = C x P/1 = C x P. When control or containment actions are taken, the expected cost of failure (i.e., economic risk) is reduced by the effect these actions have on the probability of failure occurrence. The first order approximation model for economic risk seems to align well with our intuitive feelings of how we assess risk in our daily lives.
Of course, the first order approximation can be improved by adjusting the part failure rates that are assumed to be constants to age-specific rates. In addition, the cost of failure could be discounted to present value, i.e., risk = E (cost of system failure), discounted to present value. In any case, the economic risk model seems to be a more accurate platform on which to build our FMEA analysis than the classic RPN model.
Comments
Very interesting!
I have long shared this unease with the ordinal scales used in FMEA, and in many other risk/contingency plans. You make a great case, and ypur proposition sounds very workable.
I have used probability estimation instead of the occurrence scale, but had never made the translation you make to get to the "DC" factor for the detection scale.
My one concern is in trying to universally equate severity with cost in dollars. Cost of production, rework, scrap, etc. can probably be estimated with reasonable precision--so boiling things down to purely economic terms probably works well in process FMEA and maybe in Project FMEA. For other types of hazard (personnel safety, for example), the calculations would get much more complex, I would think, to estimate in dollars.
The New FMEA
RIP,
Thank you for your very thoughtful comments. I agree with you that estimating cost of failure can sometimes be difficult. Typically Cost of Poor Quality reporting is a good start but as you point out there are many costs that are unknown and unknowable. Many times people just think about the cost to their company, however in many cases the cost to the customer can be several times the the cost to the sipplier.
John
Prevention Control Vs Occurrence
Hello John:
Please can you further explain the difference between Detection Control (prevention) and Occurrence? I am confused as to what is the difference. I always looked at Detection as the ability to stop escapes and Occurence as the likelihood of the failure happening including the effect of prevention controls.
Quality Digest: It would be great if you could link Dr. Wheeler's article on the problems with RPN calculations with Dr. Flaig's artilce. They compliment each other nicely.
Thank you, Dirk
Never Mind!
Quality Digest: The link to Dr. Wheeler's article is right in Dr. Flaig's artilce. I have to adjust my Detection score from a 3 to an 8! :-)
The New FMEA
Dirk,
Detection has two components Control and Containment. Controls are tools or techniques that prevent failures from being generated (e.g., Poka Yoka, PIDs, Control Charts, MSA, SOP, PM, etc.). Containment is tools or techniques that prevent failures from going downstream or out to the customer (e.g., inspection, test, etc.). Occurrence is the relative frequency of failures observed (i.e., the probability).
John
Still Confused
Hello Dr. Flaig:
I am still confused. It seems to me that tools or techniques that prevent failures from being generated would affect the relative frequency of failures.
And if failures are being prevented from being generated, how is that related to detecting them? There isn't anything to detect. But the occurence has been reduced or eliminated.
I am nt a fan of the AIAG FMEA method. I am not defending it. I am trying to learn and break through my mental models.
Thank you, Dirk
Safety
If the failure mode is a safety issue, the severity goes to 10 and action must be taken, regardless of occurrence and detection numbers. The cost calculation is for prioritizing everything else.
Problems with Occurrence
This was a nice article John. I especially liked the idea of linking risk to economic consequences. One other problem I have run into over the years lies with the determination of occurrence (O). When evaluating a new process or system, the true rate of failure is unknown. If the team is composed of individuals with prior understanding of similar systems or historic performance, it is easy to estimate the new occurrence rating by examining historic experience. Often though, the default is to rank occurrence with a high number due to lack of experience. In a risk adverse environment, the result is that high risk is everywhere due to high occurrence or detection numbers. The FMEA then becomes unmanageable and loses any advantage of prioritization.
Add new comment