Featured Product
This Week in Quality Digest Live
Risk Management Features
Ryan McKenna
Guest author from academia walks us through considerations and tools available to conduct this type of analysis with differential privacy
Arron Angle
Behavior-based quality: How do you act when no one is looking?
Christopher Allan Smith
Let that sink in—or how planning for disaster sometimes isn’t enough
James Wells
If the solution is obvious, just do it!
Christopher Allan Smith
Disaster is coming. These lessons can help you manage and survive.

More Features

Risk Management News
Address equipment issues before a catastrophic failure occurs
Higher quality contributes to higher efficiency and less downtime
Design, develop, implement, continually improve risk management in systems and software engineering
ISO/IEC/IEEE 16085 has just been updated
Galileo’s Telescope describes how to measure success at the top of the organization, translate down to every level of supervision
NSF-funded project is developing a model to help manufacturers pivot and produce personal protective equipment
How to develop an effective strategic plan and make the best major decisions in the context of uncertainty and ambiguity
What continual improvement, change, and innovation are, and how they apply to performance improvement
Good quality is adding an average of 11 percent to organizations’ revenue growth

More News

William A. Levinson

Risk Management

Replacing the Risk Priority Number

A risk assessment should clearly account for the frequency of exposure

Published: Tuesday, June 20, 2017 - 12:03

‘We’ve always done it that way” explains why many suboptimal and even obsolete methods are taken for granted. The range chart for statistical process control (SPC) is, for example, somewhat inferior to the sample standard deviation chart, and it is almost certainly a holdover from when all calculations had to be done manually.

It is far easier to subtract the sample’s smallest measurement from the largest one than it is to compute the sample’s standard deviation. Risk priority number (RPN) for failure mode and effects analysis (FMEA) could similarly be a suboptimal approach to risk analysis.

Risk priority number’s inherent drawbacks

The RPN is part of the body of knowledge for many ASQ certifications, and it appears in authoritative textbooks on FMEA and advanced quality planning as well as an Automotive Industry Action Group (AIAG) manual. AIAG may, however, be contemplating a replacement. Its nature is currently unknown but this is a good opportunity to discuss RPN and some existing alternatives.

The RPN is the product of a failure mode’s severity (S), occurrence (O), and detection (D) ratings, all of which are expressed on 1 to 10 scales with 10 being the worst. In “Problems With Risk Priority,” Donald J. Wheeler correctly points out that the RPN is the product of three ordinal numbers. Ordinal means, for example, that a 10 severity rating (which is reserved for failure modes that endanger human life or result in noncompliance with government regulations) is not 25-percent worse than an 8 severity rating, which means the product or service is inoperable. It is thousands of times worse because an inoperable product or unavailable service won’t kill anybody, as long as it does not apply to emergency services or equipment. Severity ratings of 9 and 10 should however be contemplated for inoperable fire engines, ambulances, police cars, and similar equipment.

Various fixes have been applied to the severity rating to reflect its ordinal nature. Richard Harpster defines the “legal zone” to include any failure mode with a 9 or 10 severity in “How to Get More Out of your FMEAs.” Such a failure mode requires attention even if the RPN is as low as 9, which is possible with occurrence and detection ratings of 1.

The U.S. Army’s risk assessment matrix (in figure 1) makes it clear that a failure or accident with a catastrophic severity rating can have no less than a medium (M) risk rating on a scale of low, medium, high, and extremely high regardless of the frequency or likelihood of occurrence.



Figure 1: Risk assessment matrix (Click for larger image)

The issue of ordinal rather than rational numbers also applies to occurrence ratings. An occurrence rating of 10 is, for example, not twice as bad as one of 5; it is 200 times as bad. The AIAG manual, “Potential Failure Mode and Effects Analysis,” defines a 10 as relating to a 50-percent or higher chance of occurrence, while 5 indicates 1 in 400. Figure 2 shows the failure probabilities versus occurrence ratings, both from the AIAG manual, on a semi-logarithmic plot to emphasize what the occurrence ratings really mean.


Figure 2: Failure probabilities vs. occurrence ratings

Return to the basic definition of risk

This suggests a return to the basic definition of risk, which is simply the chance of a negative occurrence multiplied by its consequences. Health insurers, for example, use methods similar to reliability statistics to estimate how many people in a given age group will develop different diseases (failure mode), multiply by the expected cost of treating each disease, and come up with a total cost. If there are n possible diseases, each with probability p and cost C, then:

The same concept applies to automobile insurance and homeowner’s insurance where, for a sufficiently large population, the expected cost per insured vehicle and home can be estimated. Quantitative criticality analysis also breaks down risk by categories where appropriate.

Quantitative criticality analysis

As described in government technical manual TM 5-698-4, “Failure Modes, Effects And Criticality Analysis (Fmeca) For Command, Control, Communications, Computer, Intelligence, Surveillance, And Reconnaissance (C4isr) Facilities,” quantitative criticality analysis uses a fairly sophisticated assessment of the likelihood of occurrence. The failure mode criticality number is:

where:

• α = Failure mode ratio, or modal probability. This is the chance that a failure will result in a certain consequence. A failure in an air handler, for example, may result in its delivery of too much air, too little air, or no air at all. The three modal probabilities must of course add to one (100%).
• ß = Conditional probability of the current failure mode’s failure effect. Beta represents “the conditional probability or likelihood that the described failure effect will result in the identified criticality classification, given that the failure mode occurs.” A generator shutdown will always result in loss of electrical power, in which case beta equals 1. If there is however an 80-percent chance that the generator’s motor is degraded, and a 20-percent chance that it burns out, these are the respective beta probabilities, all of which must add to 1.
• λp = Item failure rate, which apparently assumes an exponential distribution with a constant hazard rate
• t = Duration of applicable mission phase (expressed in hours or operating cycles)

The item criticality number is then:

The TM 5-698-4 reference explains explicitly, “The item criticality number is a relative measure of the consequences and frequency of an item failure. This number is determined by totaling all of the failure mode criticality numbers of an item with the same severity level. The severity level was determined in the FMEA.”

Separate item criticality numbers are assigned for failure modes with different severity ratings. The reference elaborates, “If an item has three different failure modes, two of which have a severity classification of 3 and one with a classification of 5, the sum of the two ‘failure mode criticality numbers’ (Cm) with the severity classification of 3 would be one ‘item criticality number’ (Cr). The failure mode with the severity classification of 5 would have an ‘item criticality number’ equal to its ‘failure mode criticality number.’”

It is particularly noteworthy that the failure mode criticality number accounts for the duration of use, which translates into frequency of exposure to the risk. The U.S. Army Techniques Publication, ATP 5-19 Risk Management (April 2014), which is public domain as a U.S. government publication, similarly underscores the role of frequency of exposure: “Exposure is the frequency and length of time personnel and equipment are subjected to a hazard or hazards.”

Traditional FMEA, on the other hand, defines occurrence ratings for individual items (or actions), but does not account for the number of items to be produced or the number of times a service is to be delivered. Suppose for example that the occurrence rating for “wrong medication,” a 10-severity failure, is 2. This corresponds to 1 in 150,000, which doesn’t sound too bad. If however a hospital administers a million medications annually, multiple 10-severity failures are almost a certainty. A normally-distributed 5-sigma manufacturing process that is centered on its nominal will generate 1 defect or nonconformance per 1.67 million opportunities, which delivers an occurrence rating of 1. If the output involves hundreds of millions of tire valve stems whose failure can endanger vehicle safety, there is nonetheless a serious problem.

The takeaway is that jobs that rely on worker vigilance will eventually go wrong regardless of the workers’ skill and commitment. (Whenever a Shigeo Shingo case study says that the job relied on “worker vigilance” to prevent defects, you can be sure that the job produced defects.) This leads in turn to the need for error-proofing or poka-yoke in life-critical applications, and preferably in all critical to quality activities. In addition, a risk assessment should clearly account for the frequency of exposure (in terms of number of parts produced or service actions delivered) or duration of exposure (time or cycles in reliability applications).

The F/N diagram

The F/N diagram from the article, “Understand Your Vulnerabilities with Quantitative Risk Analysis” by Neil Prophet (Chemical Engineering Progress, July 2016), is another risk assessment approach that uses a log-log plot of the frequency (F) vs. N or more fatalities per year. The application to chemical process safety looks primarily applicable to one-time events such as loss of containment of hazardous chemicals, but the takeaway of risk as a function of consequences and likelihood still applies. Tolerability criteria are defined by diagonal lines that account for both frequency (F) and consequences (N) as shown in figure 3.



Figure 3: F/N Diagram


As described in Prophet’s article, a 2×10–5 chance of killing one person per year is considered “low risk,” and requires no further study noting that the mean time between failures (or fatalities) is 50,000 years. On the other hand, the chance of killing 100 people in a given year must be about 10–8 to qualify for “low risk.”

Do we need a detection rating?

We have seen so far that FMEA is the only risk assessment approach that uses separate occurrence and detection ratings. This leads to the question as to whether detection ratings are even necessary, or if detection methods can be rolled up into the controls whose purpose is to mitigate the failure modes.

It does not matter to an internal or external customer whether poka-yoke or error-proofing controls prevent generation of a defect or a mistake, or jidoka or autonomation intercepts the defect or mistake before it goes to the next operation. It is better, of course, to avoid the problem entirely rather than intercept it and possibly stop the line to correct the root cause, but the problem will not reach the customer either way. This suggests that, at least for the purpose of risk assessment, we can define the individual chance of occurrence as the result of the series reliability calculation, “Probability of creation of the defect or nonconformance times the probability of non-detection or escape.”

This suggests that a replacement for RPN should reflect the cost of the failure mode as determined, first, from its severity, as reflected by likely cost and, second, by its frequency of occurrence, which is in turn a function of the individual chance of occurrence and number of chances. The severity can often be quantified from information on warranty costs, customer dissatisfaction and, in the case of actual harm to customers or others, liability costs. This is not to say that a monetary value can be placed on human life, but rather that an extremely high dollar value must be used to quantify the risk. Off-the-shelf methods for quantification of the chance of occurrence include reliability statistics for finished items and process performance indices for product realization, although we always need to recognize impacts of special or assignable causes as well.

Conclusion

The course of action is likely to depend on whether the frequency of occurrence is quantifiable. If only a subjective estimate is available, the Army’s risk assessment matrix looks ideal, and similar approaches appear elsewhere. All cross reference the severity of the failure with the likelihood of occurrence. Remember, however, that we also need to account for the frequency of occurrence; a function of the individual probability of occurrence and the exposure in terms of duration (reliability) or number of items or service actions (opportunities for nonconformity). It is often possible, however, to quantify the probability of occurrence with reliability data or process capability data. In this case, we can define a risk priority metric with basic decision theory, in which risk equals the frequency of occurrence multiplied by the likely consequences.

Discuss

About The Author

William A. Levinson’s picture

William A. Levinson

William A. Levinson, P.E., FASQ, CQE, CMQOE is the principal of Levinson Productivity Systems P.C. and the author of the book The Expanded and Annotated My Life and Work: Henry Ford’s Universal Code for World-Class Success (Productivity Press, 2013).

Comments

Priority Medium

“Priority-Medium,” action to improve prevention and/or detection controls (or justification on why current controls are adequate) SHOULD be taken.

What could be the criterion for taken action, why some of them you take action, and for another no? What criteria to use behind these words SHOULD? When is must and when is NO in this case, is not very clear to me.

Thank you!

Risk Priority

DD2977, Deliberate Risk Assessment Worksheet, presents only the ratings, and not what criterion should be used to make action mandatory. The ratings, like RPNs, allow the organization to prioritize the activities. Action must obviously be taken on Extremely High and High risks, and the corresponding principle for FMEA, that action must be taken on 9 and 10 severity failure modes--the "legal zone" defined by Harpster--suggests that action must be taken on a Medium risk that comes from a Catastrophic severity and an Unlikely probability rating (just as an RPN from Severity 10, Occurrence 2, and Detection 1, for a total of 40, would require attention). In any event, a Medium risk rating would take priority over a Low one, noting especially that no Catastrophic failure mode can get a Low rating.

Replacing the RPN article very well done

Bill; you have done some extensive research into the different approaches to risk for this article. I can tell you put a lot of work into providing your readers with good information. Thank you. 

Critical Issue Missing From Specified Approach

When managing risk, it is critical to consider the products intended use.  As an example, I have recently worked on both spinal implants and a late stage cancer treatment.  In both cases, there were five categories of harm considered: Death, Permanent Injury, Injury Requiring Medical Attention, Injury Not Requiring Medical Attention and Inconvenience or Temporary Discomfort. Because of the differences in intended use, the definition of acceptable risk was different.  Although attempts were made to reduce all risk, the definition of acceptable risk was much higher for the late stage cancer treatment than for the spinal implant.  

The combination of the Severity/Occurrence Risk Matrix and Intended Use should be used to determine what must be worked on and acceptable levels of risk.  The resultant strategy is known as a company's "Risk Policy" for the product.

Misquoting of "How to Get More Out of your FMEAs"

In the article you state “Various fixes have been applied to the severity rating to reflect its ordinal nature.  Richard Harpster defines the “legal zone” to include any failure mode with a 9 or 10 severity in “How to Get More Out of your FMEAs”.

The statement misrepresents what the article actually said.  The article states “Those who use FMEAs need to learn that the class value is more important than the RPN.  The class value is derived from the severity/occurrence matrix shown in Figure 2 below, puts the RPN into its proper perspective.”  When I wrote the article in 1999, the AIAG FMEA Manual 2nd Version was in effect.  In the manual, the occurrence rating of a 1 indicated a failure of at least 1 in 1.5 million.  If a company was designing automobiles and 1 out of every 1.5 million of their customers was going to be severely injured or die due to a known design issue, I believe most people would identify this combination of severity of harm and probability of exposure to indicate an unacceptable level of risk.  If one were to look at it from a financial perspective, I am quite sure the financial penalties would be quite high in a court of law if it came out in testimony that an automotive company was knowingly shipping vehicles with a design flaw that they believed would result in the death or injury at a probability of 1 in 1.5 million.  Consequently, given the limitation of using the Occurrence Rating Table in the AIAG FMEA Manual 2nd Version, the paper put the combination of Severity of 9 or 10 and Occurrence of 1 in the “legal zone” indicating unacceptable risk.  If you had included a copy of the entire matrix from the 1999 article people would be able to see the important relationship that the article defined between severity of harm, probability of exposure to harm and risk.

One of the most important improvements in the AIAG 4th Edition FMEA manual was the changing of the definition of the Occurrence Rating of 1 in the Design and Process FMEA from a numerical value of “</= .01 per 1 thousand” to the statement that “The failure is eliminated through prevention control”.  This allowed the creator of the Design and/or Process FMEA to properly identify the level of harm as a Safety or Regulatory Issue while assigning an occurrence rating of 1 that indicated that despite the high level of harm the risk is acceptable because of the actions the company had taken to reduce the probability of exposure.

 

One should never just use the Severity Rating to make any decisions to determine what must be worked on in any Design or Process FMEA.

 

Richard Harpster

Class Matrix

Richard,

The entire class matrix is definitely worth attention (bottom of https://www.qualitydigest.com/june99/html/body_fmea.html at the bottom). The takeaway I got from it is, in particular, that anything within the legal zone requires attention regardless of the occurrence rating, which is also the takeaway from the Army's Risk Management process; anything with a Catastrophic severity cannot have less than a medium risk level even if the occurrence rating is the best possible. The Warranty Zone, which I didn't mention, is definitely worth attention as well.

The point you make above about a design flaw with a 1 in 1.5 million chance of killing somebody being unacceptable also ties in with mine about the need to account for frequency of exposure to the risk.  A 1 in 1.5 million chance of death, on a 1-time basis, might be considered acceptable as shown by the F/N diagram (e.g. 2E-5 of causing one death in any given year with a MTBF of roughly 625 times the human life expectancy) but, when multiplied by thousands or millions of opportunities (e.g. several million vehicles), it is not.