‘What’s the MTBF of a human?” A bit of a strange question I ask in my Reliability 101 course. Why ask such a weird question? I’ll tell you why. Because MTBF is the worst, most confusing, crappy metric used in the reliability discipline.
ADVERTISEMENT |
OK, maybe that statement is a smidge harsh, but it does have good intentions because the amount of damage done by misunderstanding MTBF is horrendous.
MTBF stands for “mean time between failure.” It is the inverse of failure rate. An MTBF of 100,000 hours/failure is a failure rate of 1/100,000 fails/hour = .00001 fails/hour. Those are numbers; what does that look like in operation?
Does it mean:
The product lasts 100,000 hours before failing?
Half the population fails by 100,000 hours?
Wait a minute! Our product is only supposed to last three years with a 50-percent duty cycle. That’s 13,140 hours of use. Why would we have an MTBF goal of 100,000 hours? It can’t even run that long if everything goes perfectly.
Because of all this confusion, what occurs is everyone makes up his own definition of what MTBF is. It’s usually one of the first two I listed—how long it runs, or when half have failed. No one wants to sound stupid and ask what MTBF means, so we all just pretend. This is also the moral of “The Emperor’s New Clothes.” Well, I’m here to tell you that the dude is buck naked.
That is why I ask the “MTBF of a Human” question. Because it forces a harsh realization when I give the correct answer. The answer is “At least 800 years.” The beauty is that the shared confusion in the room brings a sigh of relief as everyone realizes they weren’t alone in not knowing. “Dude, go put some clothes on. You’re kinda freaking us out, and there are kids here!”
OK, so I gave the answer. The MTBF of a human is 800 years. That’s actually very conservative. In your current lifestyle, it is probably more like 2,000 years. An 800-year MTBF is more indicative of living in some very harsh, old-world conditions. Maybe a coal-mining town in the 1700s.
The group’s surprise that a human can have an 800 year MTBF brings about a new interest in hearing some MTBF 101. So here we go.
When MTBF is used as a metric to describe a product’s reliability during its use life, there are three assumptions:
• The first is that no “infant mortality” numbers (i.e., quality failures) are included in this metric.
• The second is that no wear-out (i.e., end of life) failures are included in this metric.
• The third is failures during use life occur randomly. So in any given moment during use life, a failure is just as likely to occur as at any other moment. For a product with a 10-year life, this means that a random failure is just as likely to occur at three months of age as it is at seven years of age.
What does that mean for our human example? Let’s frame the question in a meaningful way. Let’s say the individual asking, “What is the MTBF of a human?” is the owner of a coal mine in an isolated 1700s town. Every person in that town works in the coal mine. The owner wants to know how often he can expect people not to show up to work because they are sick or injured, and oh yeah, the sickness or injury “had better kill them dead, if they're not showing up to work.” He doesn’t believe in sick days. He doesn’t care about children or retired people, either. They have nothing to do with the coal mine’s productivity.
The coal mine reliability engineer does a quick calculation and tells the owner that the miners can be expected to have an MTBF of 800 years.
Of course, the owner wants to know what that means in practical terms. The reliability engineer explains that he can expect that over an 800-year period, 62.3 percent of the work population will not show up to work due to death from a random illness or injury. The owner still isn’t sure what that means day to day for his operation. The engineer puts that MTBF into a reliability percentage equation to find the reliability for a one-year period, which is a more useful number for the owner.
“How many employees can we expect to not show up due to death in a one-year period?”
The answer is we can expect a reliability of 99.875 percent for the work population over a one-year period. This translates to 13 deaths for a work population of 10,000 employees. See what I mean about the 800 year MTBF for a human being very conservative? Could you imagine a modern-day university with 10,000 students having 13 students die each year from accident or illness? I’m pretty sure someone would be looking into what is going on at that campus.
Let’s break down how we got from an 800-year MTBF to a probability that in any given year, 13 employees may die. Below is the equation we will use. It is a derivative of the Weibull equation with the assumption that we are dealing with a constant failure rate and no offsets:
R(t) = e(–t/MTBF)
Time = time period for measured reliability
MTBF = Mean time between failure
R(t) = Reliability at time “t”
If…
T = 1 year
MTBF = 100 years
R(1) = e(–1/800) = 0.99875
Assumptions:
• We do not include children who die (<13 years of age). Those are infant mortality. In the production world, we would consider these to be quality defects and not a characteristic of the design’s reliability.
• We don’t include retirees. In production these are items that are to be removed from service (retired). The manufacturer has predicted that wear-out failure modes are going to become dominant at some point and that the promised use reliability will no longer be upheld.
• We are not repairing systems.
• Units that fail are immediately replaced with new units that are past the infant mortality stage, so the population is a consistent number.
I believe the big shift in understanding what MTBF means is realizing that wear-out failures do not contribute to it. When an item approaches wear-out, it is effectively removed from the MTBF population and replaced with a fresh unit. It never failed. A graphical way to look at this is the “bathtub curve,” below. It is commonly referred to for demonstration of failure rate over life for a population. You can clearly see the three life phases for the population.
There is an infant mortality phase driven by quality defects. This failure rate quickly falls as the defective units fall out of the population and are replaced with new units. We then have the useful life where there is a constant failure rate. The height of this line “failure rate” is dictated by the MTBF. The higher a MTBF, the lower the line. Remember, failure rate is the inverse of MTBF. The third phase is “wear out,” where predictable failures driven by accumulated stress begin to dominate the population’s behavior.
For the user it is recommended to remove the units at this point and replace them with new ones. Effectively, we are creating a scenario where the customers’ experience is a flat line with a small wave (i.e., quality-defect introduction of replacement units) in it that goes on forever. The wave can be reduced to almost nothing if quality practices are improved, and products are quality-screened so that defective product never leaves the factory.
Why do I dislike MTBF so much? It’s just too confusing unless you keep in mind all of those assumptions. It’s valuable for statistics because it is a population characteristic that is easily transferable between equations. But for general discussion with individuals not using it for statistics—for designers, marketers, and project managers—it creates more confusion than clarity. It is just better to discuss product performance as percent reliability, failure rate, or availability.
First published March 16, 2020, on the No MTBF blog.
Comments
Very useful and clear,
Very useful and clear, thanks!
"No MTBF" blog link
Hi ... I am teaching a Reliability course in the fall. Thanks for the article; I will be able to use the content in class. I was interested in reading more about this subject on the "No MTBF" blog, but the link doesn't seem to work. Could you repost the address to this blog? Thanks! Diane
Linnk
Hi Diane
The link is http://nomtbf.com/2020/03/mtbf-of-a-human/
Add new comment