{domain:"www.qualitydigest.com",server:"169.47.211.87"} Skip to main content

User account menu
Main navigation
  • Topics
    • Customer Care
    • FDA Compliance
    • Healthcare
    • Innovation
    • Lean
    • Management
    • Metrology
    • Operations
    • Risk Management
    • Six Sigma
    • Standards
    • Statistics
    • Supply Chain
    • Sustainability
    • Training
  • Videos/Webinars
    • All videos
    • Product Demos
    • Webinars
  • Advertise
    • Advertise
    • Submit B2B Press Release
    • Write for us
  • Metrology Hub
  • Training
  • Subscribe
  • Log in
Mobile Menu
  • Home
  • Topics
    • 3D Metrology-CMSC
    • Customer Care
    • FDA Compliance
    • Healthcare
    • Innovation
    • Lean
    • Management
    • Metrology
    • Operations
    • Risk Management
    • Six Sigma
    • Standards
    • Statistics
    • Supply Chain
    • Sustainability
    • Training
  • Login / Subscribe
  • More...
    • All Features
    • All News
    • All Videos
    • Contact
    • Training

The Law of Large Numbers and Big Data

The assumption behind big data techniques

Michal Balog / Unsplash

Donald J. Wheeler
Mon, 07/21/2025 - 12:03
  • Comment
  • RSS

Social Sharing block

  • Print
Body

In statistics class we learn that we can reduce the uncertainty in our estimates by using more and more data. This effect has been called the “law of large numbers” and is one of the primary ideas behind the various big data techniques that are becoming popular today. Here we’ll look at how the law of large numbers works in some simple situations to gain insight on how it will work in more complex scenarios.

ADVERTISEMENT

When we use a statistic to estimate some property of a process or system, we have to consider the inherent uncertainty in our estimate. And, as theory suggests, this uncertainty will shrink as the amount of data used in our computation increases.

To illustrate this relationship, consider the results of a series of drawings from a bead box containing a mix of red and white beads as shown in Figure 1. A sample of these beads is obtained using a paddle with 50 holes on one side. The number of red beads in a sample of 50 beads is recorded; then the 50 beads are returned to the bead box, and the whole box is stirred up before the next sample is drawn.


Figure 1: My bead box and sampling paddle

After my first 10 samples, I’d found a total of 65 red beads while looking at 500 beads. This gives a point estimate for the proportion of red beads in the bead box of

and the usual 90% interval estimate for the proportion of red beads in the box is:

It’s the uncertainty of ±0.0247 that we want to track as the number of samples increases. In the next 10 samples, I found 43 red beads out of 500 beads sampled. Combining the results from both experiments, we have a point estimate for the proportion of red beads in the bead box of

and the usual 90% interval estimate for the proportion of red beads is:

As we went from using 10 samples to using 20 samples in our estimate, the uncertainty dropped from ±0.0247 to ±0.0161. With increasing amounts of data, our estimates come to have lower levels of uncertainty. Figure 2 shows the results of 20 repetitions of this experiment of drawing 10 samples of 50 beads each from this bead box. The first column gives the cumulative number of red beads observed. The second column gives the cumulative number of beads sampled. The third column lists the cumulative point estimates of the proportion of red beads in the box. The last two columns list the end points for the 90% interval estimates for the cumulative proportions of red beads. Figure 3 shows these last three columns plotted against the number of the experiment.


Figure 2: 20 bead box experiments


Figure 3: Cumulative proportions of red beads and 90% interval estimates

As we look at Figure 3, we see the point estimates converge on a value near 11% and stabilize there while the uncertainty keeps dropping. This is the picture shown in the textbooks, and it’s the source of the “law of large numbers.” Bigger and bigger data sets result in better estimates of the process parameters.

So here, after inspecting 10,000 beads and finding 1,105 red beads, we might reasonably conclude that the bead box contains about 11% red beads, and our uncertainty for this estimate would be plus or minus 0.50%. This is just what we are taught to do in our statistics classes: collect our data, compute our estimate, and state the uncertainty of that estimate. In practice we can do no more.

However, the uncertainty computations can never take into account the biases that occur in practice. In the example above, the bead box contains 4,800 beads. Our 200 drawings effectively looked at every bead in the box twice. Yet, by actual count, the box contains only 10% red beads, a value that is outside the interval estimates from Experiment 6 on.


Figure 4: Statistics converge, but not to census value.

As we collected more and more data, our point estimate did converge to a value. But as Figure 4 shows, it didn’t converge to the proportion found by a complete count of the beads in the box. The box didn’t change. There were 10% red beads in the box for every draw. Yet the result we get when sampling with a paddle isn’t the same as the complete count.

So, here we come to the first problem with the law of large numbers. The whole body of computations involved with estimation are built on certain assumptions. One of these is that we’ve drawn random samples from the system or process being studied. Random samples are samples where every item in the lot has the same chance of being included in the sample. Random sampling is simply a way of getting samples that are representative of the lot as a whole. In practice, the complexity of random sampling usually guarantees that we’ll use some sort of sampling system or sampling device. Here, we used mechanical sampling. And regardless of how careful we may be, mechanical sampling isn’t the same as the assumption of random sampling.

The paddle inevitably will fill as it goes into the box, and this will favor the upper layers over the lower layers. Given the difficulty of thoroughly stirring the whole box, any stratification within the box will skew the results of mechanical sampling—thus, the difference between the results of 11% and 10%.

Of course, in practice, where we can’t stop the experiment and find the “true value” by counting all the beads, there will be no way to even detect these biases. Thus, in practice, the first problem with the law of large numbers is that, because of the way we obtain our data, our estimates may not converge to the values that we expect them to. But we rarely will know when this happens. We may take reasonable steps to ensure that our samples are representative of the process, but we seldom have any way to confirm this assumption.

And if this problem isn’t enough to give you pause, there’s an even bigger problem with the law of large numbers.

To illustrate this second problem, I’ll use the batch weight data shown in Figure 5. There you’ll find the weights, in kilograms, of 259 successive batches produced during one week at a plant in Scotland. For purposes of this example, assume that the specifications are 850 kg to 990 kg. We’ll look at how the law of large numbers works with the capability ratio by computing a cumulative capability ratio after each group of 10 batches. These 26 capability ratios should converge to some value as the uncertainty drops with the increasing amounts of data used.

Figure 6 shows the number of batches used for each computation, the capability ratios found, and the 90% interval estimates for these capability ratios. These values are plotted in sequence in Figure 7.


Figure 5: The batch weight data


Figure 6: Number of batches used for each computation, capability ratios, and 90% interval estimates
Figure 7: Plotted sequential values from Figure 6

These capability ratios don’t converge to any single value. Rather they continue to drop down from one level to another. As the law of large numbers dictates, the uncertainties decrease with increasing amounts of data. However, when the quantity being estimated is changing, the reduction in uncertainty is meaningless. We get better and better estimates of something, but that something may have already changed by the time we have the estimate.

To understand the behavior of these capability ratios, we need to look at the XmR chart for these data in Figure 8. The limits shown are based on the first 60 values. This baseline was chosen to obtain a reasonable characterization of the inherent process variation. Here, we see a process that is not only unpredictable, but also one that gets worse as the week wears on.


Figure 8: XmR chart for the batch weight data

The law of large numbers explicitly assumes that all the data come from one and the same process. Here the process is changing without warning. As a result, no amount of data will ever be sufficient to provide a good estimate of any process characteristic! This isn’t due to any flaw in the computations. Rather, it’s due to the mistaken notion that the process characteristics are unchanging.

The law of large numbers

When we’re seeking to describe the properties of a fixed lot or batch by using samples, the law of large numbers assumes that the samples are representative of that lot or batch. Whenever circumstances result in samples that aren’t representative of the batch, the estimates may converge to the wrong value. And one way this happens is to have a batch that’s not homogeneous.

When seeking to describe a continuing process, the law of large numbers assumes that the future will be the same as the past. When the past data show evidence of a lack of homogeneity, the extrapolation from the past to the future is unjustified, and the estimates may not converge to any specific value. So, once again, the law of large numbers has problems when there’s a lack of homogeneity within the data.

Implications

One of the organizing principles behind big data techniques is the law of large numbers. When our samples are representative of the process being sampled, and when the process remains unchanged over time, these techniques will work. But when your data come from a system like that in Figure 8, all of the traditional statistical techniques break down.

But who would ever operate their process like Figure 8? The plant in Scotland did—but not deliberately. Unpredictable operation happens when you’re not looking. It happens without your knowledge or intent. It happens because the causes of unpredictable behavior are input variables that are overlooked, ignored, or unknown.

So what does this mean for big data techniques? The database will never contain the unknown variables, and it’s unlikely to contain the variables that have been overlooked or ignored as well. So those variables that are taking your process on walkabout are precisely those variables that will be missing from your big data model. And when major variables are missing from a model, that model will inevitably be erroneous and misleading.

This is why dumping all your data into a database and hoping that big data will provide insights into your process is little more than wishful thinking. You’ll get some model, and that model might describe something, but it’s unlikely to describe the future. When your process is operated unpredictably, no amount of data will ever allow you to reliably estimate your process characteristics.

So how many data do you need?

The law of large numbers, and by extension big data techniques, fail when applied to data generated by a process that is being operated unpredictably. This is because big data techniques will always be blind to assignable causes of exceptional variation.

Big data techniques work in the same way that experimental techniques work. They seek to identify specific relationships between input variables and response variables. They can only evaluate those relationships when you have data on both the input variables and the responses. But processes like the one in Figure 8 are subject to unknown inputs, and the only way to identify these unknown inputs is to use a process behavior chart.

A process behavior chart doesn’t search for specific relationships. Rather, it examines the overall behavior of the response variable over time. When the process behavior chart shows evidence that the response variable has changed, you have an opportunity to identify the assignable cause of that change. If you then control this assignable cause, you’ll reduce the variation in the response variable, resulting in better quality, higher productivity, and better competitive position.

Finding assignable causes and using them to improve your process doesn’t require big databases and fancy computer programs. Rather, it requires the careful analysis of small amounts of data collected and analyzed in real time. So how many data do you need to improve your process? Just enough to detect the assignable causes that affect your process. This may not be as impressive as a fancy model coming out of a big data approach, but it has been proven to be a whole lot more effective in practice.

Donald Wheeler’s complete “Understanding SPC” seminar may be streamed for free. For details, see spcpress.com; for an example, see this column in Quality Digest.

Comments

Submitted by Alfredo (not verified) on Mon, 07/21/2025 - 14:54

Great Insight!

This is a very good reminder for those who spend more time in front of a computer screen than they do on the floor. Statistical processes are great and very useful, but I love that you are reminding us that we need to include some common sense to the numbers before we blindly believe the conclusions.

I would love for you (or someone else) to now draw the line between this and Large Language Models.

  • Reply

Add new comment

The content of this field is kept private and will not be shown publicly.
About text formats
Image CAPTCHA
Enter the characters shown in the image.
      

© 2025 Quality Digest. Copyright on content held by Quality Digest or by individual authors. Contact Quality Digest for reprint information.
“Quality Digest" is a trademark owned by Quality Circle Institute Inc.

footer
  • Home
  • Print QD: 1995-2008
  • Print QD: 2008-2009
  • Videos
  • Privacy Policy
  • Write for us
footer second menu
  • Subscribe to Quality Digest
  • About Us
  • Contact Us