The Imaginary Theorem of Large Samples

How many data samples do you need?

Courses in statistics generally emphasize the problem of inference. In my December column, “The Four Questions of Data Analysis,” I defined this problem in the following manner:

Given a single unknown universe, and a sample drawn from that universe, how can we describe the properties of that universe?

In general, we attempt to answer this question by estimating characteristics of the universe using statistics computed from our sample.

…

Want to continue?

By logging in you agree to receive communication from Quality Digest. Privacy Policy.

Create a FREE account

Forgot My Password

Comments

Comment

The punch line here is one that I have believed in for a long time: The Process Behavior Chart is a good tool to utilize as early as possible in the analysis of a problem. This is, of course, contrary to the Six Sigma DMAIC methodology, which utilizes Process Behavior Charts late in the overall process.

Use of XmR Chart

Just a note to the prior comment. Actually, the process behavior chart is used throughout the DMAIC. The first step is to define the project ('D'). Often, one sees a chart like the one shown in fig 6 in the article and would certainly want to place this process step on the list of possible projects. If the cost impact vs probability of solving places the project as one of the best improvement opportunities, then a project would be started. SPC charts are great sources of historical data to determine that a project needs to be started. Of course, they are then used later in the 'I' and 'C' phase to measure the project's success.

Spread index Cp

If one would succeed in stabilizing this process, according to the process capability 6s from individual X-chart a Cp = 140/137,4 = 1,019 is computed; indeed amazing that the time estimated Cp does not converge to this value but to a significant lower level! Could we not consider Cp = 1,019 as the expected (target) value as Cp is computed/estimated from the Process Sigma only reflecting the random process noise?

Adding samples

I have always read with the gratest interest your articles. In this case, I have doubts that 2 samples of size 500 each is the same as 1 sample of size 1000. In fact, if you calculate the proportion for each of the samples and the interval estimate for the population proportion, we get:
Experiment 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# yellow cum 65 108 164 219 276 336 389 448 506 553 609 655 714 770 817 874 927 981 1040 1105
# yellow 65 43 56 55 57 60 53 59 58 47 56 46 59 56 47 57 53 54 59 65
Point estimate 0.130 0.086 0.112 0.110 0.114 0.120 0.106 0.118 0.116 0.094 0.112 0.092 0.118 0.112 0.094 0.114 0.106 0.108 0.118 0.130
Low 0.105 0.065 0.089 0.087 0.091 0.096 0.083 0.094 0.092 0.073 0.089 0.071 0.094 0.089 0.073 0.091 0.083 0.085 0.094 0.105
High 0.155 0.107 0.135 0.133 0.137 0.144 0.129 0.142 0.140 0.115 0.135 0.113 0.142 0.135 0.115 0.137 0.129 0.131 0.142 0.155
0.1 inside? Fail Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Correct Fail
where 2 out of the 20 intervals don't include the correct population proportion 0.1 wich completelly matches our 90% confidence interval.