Featured Product
This Week in Quality Digest Live
Operations Features
Taran March @ Quality Digest
If at first you don’t succeed, make it a quality problem
Richard Ruiz
Seven ways automation can focus layered process audits on quality improvement rather than administrative workload
Ryan E. Day
Dimensional Engineering uses FARO QuantumS ScanArm for a complex reverse engineering project in petrochemical industry
Bruce Hamilton
Why sharing and teaching are the best ways to learn
Janelle Farkas
Data and analysis don’t have to be complicated to yield bottom-line benefits

More Features

Operations News
Help drive team productivity with customizable preprinted templates
A guide for practitioners and managers
Examine nine ways to help understand context, diffuse politics, heal relationships, and find the right work at the right time
Provides eight operating modes and five alarms
Making lean Six Sigma easier and adaptable to current workplaces
Deep Reality Viewer creates stereo, high-definition 3D images without using a monitor
April 25, 2019 workshop focused on hoshin kanri and critical leadership skills related to strategy deployment and A3 thinking
Makes it faster and easier to find and return tools to their proper places
Process concerns technology feasibility, commercial potential, and transition to marketplace

More News

Fred Schenkelberg


Sample Size, Duration, and Mean Time Between Failures

Understanding reliability goals

Published: Tuesday, January 3, 2017 - 14:51

If you have been a reliability engineer for a week or more, or worked with a reliability engineer for a day or more, someone has asked about testing planning. The questions often include, “How many samples?” and, “How long will the test take?” No doubt you’ve heard the sample-size question.

What I continue to hear is the mistaken idea that adding another sample extends the effective time the testing represents in normal use. If I have a 1,000-hour test and add another unit, that doesn’t mean the results represent reliability for an additional 1,000 hours of use time.

The legacy of the exponential distribution

The problem stems from exponential distribution, where the chance of failure each hour for each unit is the same. There is no change to the hazard rate over time, therefore accumulating more individual hours provides additional information about how the system will behave throughout an hour of use.

This rarely if ever is true. Hazard rates change as different failure mechanisms evolve, as materials settle or wear, as damage accumulates, and as the environment changes. If you check, you will find assuming a content failure rate is invalid for your product or system.

But you know that.

Then why, when planning a reliability test, do so many engineers continue to rely on the total time of testing (number of units in test multiplied by the number of hours under test)? They claim:
• It’s in the textbooks—describing how to plan a life test.
• It’s what we always have done.
• It’s in the XYZ standard.
• It’s what the customer requested.

Maybe it’s time to step back and consider what you are trying to accomplish with the life test.

Is the goal reliability of mean time between failures (MTBF)?

In general terms, we conduct testing to learn something. For product-life testing, we generally want to know what will fail or when it will fail. We want to understand if the current design and assembly process will create items that will work as expected, for as long as expected, within our customer’s use conditions. Basically, how many should survive for two years (for example).

Learning what will fail is often an exploratory or discovery-style testing. HALT (highly accelerated life testing), HAST (highly accelerated stress screen), STRIFE (stress + life test), margin testing, or similar tests are deliberate work to reveal failure mechanisms. Life testing, accelerated life testing, duration testing, endurance testing, or similar testing tends to focus on whether the units will survive long enough.

The HALT approach involves finding failure mechanisms and designing them out of the product thus improving the reliability performance of the product. The ALT approach involves understanding the time to failure behavior of failure mechanisms in order to characterize the reliability performance of the product.

Are you just after estimating the MTBF?

Realize that MTBF is not reliability. It is not a suitable nor complete description of your product’s probability of survival over a duration within your customer’s use conditions. It’s just a representation of the average time to failure, or the inverse of the average failure rate. Not very informative when trying to identify or eliminate failure mechanisms or make a decision on reliability performance.

If you only want the MTBF (and I’m not able to dissuade you from this folly), then you can use the idea that adding another unit to your test improves your ability to estimate MTBF. If you are assuming the constant hazard rate, then you are fine focusing only on MTBF.

The results of MTBF-based testing will provide some information on the actual reliability performance of your product. If you run your test for 1,000 hours, then you’ll learn about the reliability performance of the first 1,000 hours of your product’s use. If you run more samples, you’ll improve your understanding of the performance over those 1,000 hours.

In the vast majority of cases where the assumption of a constant hazard rate is invalid, running 10 units for 1,000 hours doesn’t imply any meaningful information about the reliability performance out to 10,000 hours. The MTBF value may say you have achieved 10,000 MTBF, yet you have really only learned something about the initial 1,000-hour operation for your product.

Be prepared for proper test planning

In summary, when you hear this concept of test planning based on constant, hazard-rate assumption, help your team understand the error. Help them develop and conduct reliability testing that actually provides meaningful information. Help them create results they can use to make decisions.


About The Author

Fred Schenkelberg’s picture

Fred Schenkelberg

Fred Schenkelberg is an experienced reliability engineering and management consultant with his firm FMS Reliability. His passion is working with teams to create cost-effective reliability programs that solve problems, create durable and reliable products, increase customer satisfaction, and reduce warranty costs. Schenkelberg is developing the site Accendo Reliability, which provides you access to materials that focus on improving your ability to be an effective and influential reliability professional.


Sample size in reliability studies

Fred,Good article. Just a comment: Just as in classical statistics where the precision of the estimate of a distribution parameter is a strong function of sample size, reliability parameter estimates behave the same way. That is, if you are trying to estimate any reliability parameter, be it the reliability at a specified time or the time associated with a specified reliability, the precision of the estimate is strongly affected by the NUMBER OF FAILURES, not by the number of units on test. In some special cases, such as in reliability demonstration tests with zero failures, we settle for a one-sided lower confidence limit on time or reliability (effectively with an infinite confidence interval width), but in general when a two-sided confidence interval or hypothesis test is required it's the number of failures that going to substantially determine the confidence interval half-width or the power of the hypothesis test. This conclusion applies to all reliability distributions, exponential or otherwise. So the first step of planning any reliablity study should be the determination of the required precision of the estimate so that the total number of failures required by the test can be determined. Then the number of units to be tested and their time on test follows from that.