Our PROMISE: Our ads will never cover up content.

Our children thank you.

Reliability activities serve one purpose: to support better decision making. That is all they do. Reliability work may reveal design weaknesses, which we can decide to address. Reliability work may estimate the longevity of a device, allowing decisions when compared to objectives for reliability.

Creating a report that no one reads is not the purpose of reliability. Running a test or analysis to simply “do reliability” is not helpful to anyone. Anything with MTBF (mean time between failures) involved... well, you know how I feel about that.

A common problem in engineering work is the desire to solve the wrong problem. I know I am guilty of working on the issues that hold my interest rather than on the challenges requiring action. A Type III error is when you solve the wrong problem.

We only have so much time and resources for reliability work. There are plenty of challenges and interesting aspects of sorting out the reliability of a system. However. it is the focus on solving the right problems that matter. It is solving the right issues that provide value to the team and organization.

Mean time between failures (MTBF) is a symptom of a bigger problem. It’s possibly a lack of interest in reliability (which I doubt is the case). Or it’s a bit of fear of reliability.

Many shy away from the statistics involved. Some simply don’t want to know the currently unknown. It could be the fear of potential bad news that the design isn’t reliable enough. Some don’t care to know about problems that will requiring solving.

Whatever the source of the uneasiness, you may know one or more co-workers who would rather not deal with reliability in any direct manner.

Maybe not directly, yet the symptoms seem to be there. A mindset of avoidance concerning the topic, the lack of focus to understand or improve reliability, the dismissal of estimates or test results, the rush to “put right” any life limiting problems.

The general desire to move on or away from detailed discussions concerning reliability is a clue. This may be difficult for reliability professionals to grasp because we tend to enjoy understanding failure mechanisms. We tend to work to estimate or analyze reliability performance. It is what we do.

MTBF use and thinking is still rampant. It affects how our peers and colleagues approach solving problems, and there is a full range of problems that come from using the “mean time between failure” (MTBF) metric.

So, how do you spot the signs of MTBF thinking even when MTBF is not mentioned? Let’s explore some approaches that you can use to ferret out MTBF thinking and move your organization toward making informed decisions concerning reliability.

Really, it is just that simple. Ask what it is you really mean about durability or how long the item should work or something similar. If someone asks for MTBF, they often are interested in the probability of failure over some duration within some set of conditions.

Asking for MTBF provides an inverse of the average failure rate—not at all what they may really have wanted to know.

If they really want the average inverse failure rate, ask them why? What decision are they going to make using that information? Is knowing MTBF the right information to inform the pending decision? If not—and MTBF, as you know, is not generally informative at all—suggest using reliability (i.e., probability of failure over a time period).

The term “Weibull” in some ways has become a synonym for reliability. Weibull analysis = life data (or reliability) analysis. The Weibull distribution has the capability to describe a changing failure rate, which is lacking when using just mean time between failures (MTBF). Yet, is it suitable to use Weibull as a metric?

Use reliability, the probability of successful operation over a defined duration. This typically includes a defined environment as well. It’s the definition of reliability as we use it in reliability engineering.

Instead of saying, “We want a 50,000-hour MTBF for the new system,” say instead, “We want 98 percent to survive two years of use without failure.” Be specific and include as many couplets of probability and duration as is necessary and useful for your situation. For example, you may want to specify that 99.5 percent should survive the first month of use, and that 95 percent should survive five years of use.

Weibull, lognormal, normal, exponential, and many others are names of statistical distributions. They are formulas that describe the pattern formed by time to failure data (e.g., repair times, and many other groups or types of data).

Our customers, suppliers, and peers seem to confuse reliability information with mean time between failure (MTBF). Why is that?

Is it a convenient shorthand? Maybe I’m the one confused, maybe those asking or expecting MTBF really want to use an inverse of a failure rate. Maybe they aren’t interested in reliability.

MTBF is in military standards. It is in textbooks and journals and component data sheets. MTBF is prevalent.

If one wants to use an inverse simple average to represent the information desired, maybe I have been asking for the wrong information. Given the number of references and formulas using MTBF, from availability to spares stocking, maybe people ask for MTBF because it is necessary for all these other uses.

When someone asks me for the MTBF, I ask them what they want to know.

The standard answer is they want to know the chance an item will survive over some duration. Or they say they want to know the reliability. They ask for MTBF expecting to learn something about an item’s reliability.

A conversation the other day involved how or why someone would use the mean of a set of data described by a Weibull distribution.

The Weibull distribution is great at describing a dataset that has a decreasing or increasing hazard rate over time. Using the distribution we also do not need to determine the mean time between failures (MTBF)—which is not all that useful, of course.

Walking up the stairs today, I wondered if the arithmetic mean of the time-to-failure data, commonly used to estimate MTBF, is the same as the mean of the Weibull distribution. Doesn’t everyone think about such things?

So, I thought, I’d check. Set up some data with an increasing failure rate, and calculate the arithmetic mean and the Weibull distribution mean.

I opened R and using the random number-generating function, rweibull, created 50 data points from a Weibull distribution with a shape (β) of 7 and scale (η) of 1,000.

Here’s a histogram of the data.

If you have been a reliability engineer for a week or more, or worked with a reliability engineer for a day or more, someone has asked about testing planning. The questions often include, “How many samples?” and, “How long will the test take?” No doubt you’ve heard the sample-size question.

What I continue to hear is the mistaken idea that adding another sample extends the effective time the testing represents in normal use. If I have a 1,000-hour test and add another unit, that doesn’t mean the results represent reliability for an additional 1,000 hours of use time.

The problem stems from exponential distribution, where the chance of failure each hour for each unit is the same. There is no change to the hazard rate over time, therefore accumulating more individual hours provides additional information about how the system will behave throughout an hour of use.

This rarely if ever is true. Hazard rates change as different failure mechanisms evolve, as materials settle or wear, as damage accumulates, and as the environment changes. If you check, you will find assuming a content failure rate is invalid for your product or system.

But you know that.

We establish reliability goals and measure reliability performance. Goals and measures can be related; however, they’re not the same, and neither do they serve the same purpose.

Recently, I’ve seen a few statements that seem to confuse the role of statistical confidence when establishing a goal. Thus, I’d like to relate how I think about the difference between goals and statistical confidence, along with how they are related.

Setting any goal provides tangible direction or a meaningful target for a team. A reliability goal is a balance of:

• What the customer expects

• What is technically possible

• An expression of business objectives

A reliability goal establishes the probability that a function will successfully operate over a specific duration, given a specific use and environment. For example, my smart phone will make and receive calls in Northern California with a 99-percent probability of successful operation over two years.

Spending too much on reliability and not getting the results you expect? Just getting started and not sure where to focus your reliability program? Or, just looking for ways to improve your program?

There’s not one way to build an effective reliability program. The variations in industries, expectations, technology, and the many constraints shape each program. Here are three suggestions you can apply to any program at any time. These are not quick-fix solutions, and neither will you see immediate results, yet each will significantly improve your reliability program and help you achieve the results you and your customers expect.

I recommend not using mean time between failures (MTBF). The same applies to mean time to failure (MTTF), mean time between unscheduled removal (MTBUR), and the many variations of MTXXX that exist. The primary reason is MTBF is not useful. It doesn’t help you and your team make decisions that lead to improving reliability.

The focus should be on balancing what you know about the performance and your other priorities. Although you may have fantastic cost-of-goods data, reliability data are often vague. Don’t cloud that scant information by obscuring what it means using MTBF.

A fault tree analysis (FTA) is a logical, graphical diagram that starts with an unwanted, undesirable, or anomalous state of a system. The diagram then lays out the many possible faults, and combinations of faults, within the subsystems, components, assemblies, software, and parts comprising the system that may lead to the top-level unwanted fault condition.

An FTA shows the many possible cause-and-effect paths to a specific fault condition. For example, a laptop computer may have a top-level fault of not turning on. A few possible causes are a dead battery, faulty power distribution circuitry, or a broken power switch.