Content By Fred Schenkelberg

Fred Schenkelberg’s picture

By: Fred Schenkelberg

In 2019, Nicholas W. Eyrich, Robert E. Quinn, and David P. Fessell published an article in the Harvard Business Review titled, “How One Person Can Change the Conscience of an Organization.” In it, they discuss how corporate transformations, although assumed to occur from the top-down, are actually brought about by middle managers and first-line supervisors. These are the people who can make significant change happen.

The authors look at what it takes for one person to make a significant change within an organization. As reliability or quality professionals, we often have the opportunity to spot needed changes. It is then up to us to tackle those challenges to make the change happen.

Change starts with one person

You, a single person, can initiate and make change happen. You can improve the reliability of your product or system. It starts with having a clear intent and goal concerning the improvement. It also starts with the willingness to speak up about the need to make that change to achieve the improvement.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

There is a type of error that occurs when conducting statistical testing: to work very hard to correctly answer the wrong question. This error occurs during the formation of the experiment.

Despite creating a perfect null and alternative hypothesis, sometimes we are simply investigating the wrong question.

Example of a type III error

Let’s say we really want to select the best vendor for a critical component of our design. We define the best vendor as one whose solution or component is the most durable. OK, we can set up an experiment to determine which vendor provides a solution that is the most durable.

We set up and conduct a flawless hypothesis test to compare the two leading solutions. We can see very clear results. Vendor A’s solution is, statistically, significantly more durable than Vendor B’s solution.

Yet neither solution is durable enough. We should have been evaluating if either solution could meet our reliability requirements instead.


Even if we perfectly answer a question in our work, if it’s not the right question, then the work is for naught.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

Reliability activities serve one purpose: to support better decision making. That is all they do. Reliability work may reveal design weaknesses, which we can decide to address. Reliability work may estimate the longevity of a device, allowing decisions when compared to objectives for reliability.

Creating a report that no one reads is not the purpose of reliability. Running a test or analysis to simply “do reliability” is not helpful to anyone. Anything with MTBF (mean time between failures) involved... well, you know how I feel about that.

The Type III error

A common problem in engineering work is the desire to solve the wrong problem. I know I am guilty of working on the issues that hold my interest rather than on the challenges requiring action. A Type III error is when you solve the wrong problem.

We only have so much time and resources for reliability work. There are plenty of challenges and interesting aspects of sorting out the reliability of a system. However. it is the focus on solving the right problems that matter. It is solving the right issues that provide value to the team and organization.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

Mean time between failures (MTBF) is a symptom of a bigger problem. It’s possibly a lack of interest in reliability (which I doubt is the case). Or it’s a bit of fear of reliability.

Many shy away from the statistics involved. Some simply don’t want to know the currently unknown. It could be the fear of potential bad news that the design isn’t reliable enough. Some don’t care to know about problems that will requiring solving.

Whatever the source of the uneasiness, you may know one or more co-workers who would rather not deal with reliability in any direct manner.

Is ‘reliaphobia’ a thing?

Maybe not directly, yet the symptoms seem to be there. A mindset of avoidance concerning the topic, the lack of focus to understand or improve reliability, the dismissal of estimates or test results, the rush to “put right” any life limiting problems.

The general desire to move on or away from detailed discussions concerning reliability is a clue. This may be difficult for reliability professionals to grasp because we tend to enjoy understanding failure mechanisms. We tend to work to estimate or analyze reliability performance. It is what we do.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

MTBF use and thinking is still rampant. It affects how our peers and colleagues approach solving problems, and there is a full range of problems that come from using the “mean time between failure” (MTBF) metric.

So, how do you spot the signs of MTBF thinking even when MTBF is not mentioned? Let’s explore some approaches that you can use to ferret out MTBF thinking and move your organization toward making informed decisions concerning reliability.

Ask, ‘What do you really want?’

Really, it is just that simple. Ask what it is you really mean about durability or how long the item should work or something similar. If someone asks for MTBF, they often are interested in the probability of failure over some duration within some set of conditions.

Asking for MTBF provides an inverse of the average failure rate—not at all what they may really have wanted to know.

If they really want the average inverse failure rate, ask them why? What decision are they going to make using that information? Is knowing MTBF the right information to inform the pending decision? If not—and MTBF, as you know, is not generally informative at all—suggest using reliability (i.e., probability of failure over a time period).

Fred Schenkelberg’s picture

By: Fred Schenkelberg

The term “Weibull” in some ways has become a synonym for reliability. Weibull analysis = life data (or reliability) analysis. The Weibull distribution has the capability to describe a changing failure rate, which is lacking when using just mean time between failures (MTBF). Yet, is it suitable to use Weibull as a metric?

What to use instead of MTBF

Use reliability, the probability of successful operation over a defined duration. This typically includes a defined environment as well. It’s the definition of reliability as we use it in reliability engineering.

Instead of saying, “We want a 50,000-hour MTBF for the new system,” say instead, “We want 98 percent to survive two years of use without failure.” Be specific and include as many couplets of probability and duration as is necessary and useful for your situation. For example, you may want to specify that 99.5 percent should survive the first month of use, and that 95 percent should survive five years of use.

Weibull is a distribution, one of many

Weibull, lognormal, normal, exponential, and many others are names of statistical distributions. They are formulas that describe the pattern formed by time to failure data (e.g., repair times, and many other groups or types of data).

Fred Schenkelberg’s picture

By: Fred Schenkelberg

Our customers, suppliers, and peers seem to confuse reliability information with mean time between failure (MTBF). Why is that?

Is it a convenient shorthand? Maybe I’m the one confused, maybe those asking or expecting MTBF really want to use an inverse of a failure rate. Maybe they aren’t interested in reliability.

MTBF is in military standards. It is in textbooks and journals and component data sheets. MTBF is prevalent.

If one wants to use an inverse simple average to represent the information desired, maybe I have been asking for the wrong information. Given the number of references and formulas using MTBF, from availability to spares stocking, maybe people ask for MTBF because it is necessary for all these other uses.

What I don’t get is why

When someone asks me for the MTBF, I ask them what they want to know.

The standard answer is they want to know the chance an item will survive over some duration. Or they say they want to know the reliability. They ask for MTBF expecting to learn something about an item’s reliability.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

A conversation the other day involved how or why someone would use the mean of a set of data described by a Weibull distribution.

The Weibull distribution is great at describing a dataset that has a decreasing or increasing hazard rate over time. Using the distribution we also do not need to determine the mean time between failures (MTBF)—which is not all that useful, of course.

Walking up the stairs today, I wondered if the arithmetic mean of the time-to-failure data, commonly used to estimate MTBF, is the same as the mean of the Weibull distribution. Doesn’t everyone think about such things?

So, I thought, I’d check. Set up some data with an increasing failure rate, and calculate the arithmetic mean and the Weibull distribution mean.

The data set

I opened R and using the random number-generating function, rweibull, created 50 data points from a Weibull distribution with a shape (β) of 7 and scale (η) of 1,000.

Here’s a histogram of the data.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

If you have been a reliability engineer for a week or more, or worked with a reliability engineer for a day or more, someone has asked about testing planning. The questions often include, “How many samples?” and, “How long will the test take?” No doubt you’ve heard the sample-size question.

What I continue to hear is the mistaken idea that adding another sample extends the effective time the testing represents in normal use. If I have a 1,000-hour test and add another unit, that doesn’t mean the results represent reliability for an additional 1,000 hours of use time.

The legacy of the exponential distribution

The problem stems from exponential distribution, where the chance of failure each hour for each unit is the same. There is no change to the hazard rate over time, therefore accumulating more individual hours provides additional information about how the system will behave throughout an hour of use.

This rarely if ever is true. Hazard rates change as different failure mechanisms evolve, as materials settle or wear, as damage accumulates, and as the environment changes. If you check, you will find assuming a content failure rate is invalid for your product or system.

But you know that.

Fred Schenkelberg’s picture

By: Fred Schenkelberg

We establish reliability goals and measure reliability performance. Goals and measures can be related; however, they’re not the same, and neither do they serve the same purpose.

Recently, I’ve seen a few statements that seem to confuse the role of statistical confidence when establishing a goal. Thus, I’d like to relate how I think about the difference between goals and statistical confidence, along with how they are related.

The purpose of a reliability goal

Setting any goal provides tangible direction or a meaningful target for a team. A reliability goal is a balance of:
• What the customer expects
• What is technically possible
• An expression of business objectives

A reliability goal establishes the probability that a function will successfully operate over a specific duration, given a specific use and environment. For example, my smart phone will make and receive calls in Northern California with a 99-percent probability of successful operation over two years.