Tom Pyzdek  |  12/02/2008

Ten Steps to a Crystal Ball

It’s easy to predict that statistical purists don’t like using historical-data models.

The gold standard for modeling the future in a business environment is the designed experiment. Design of experiments (DOE) is a well-developed approach to planning and executing controlled manipulations.

Somewhat less respectable are models derived from historical data. It makes sense to utilize as much of this information as possible, but caution is required. Problems you may encounter are:

Measurement error . Historical data are often recorded by untrained people, or the precision required for day-to-day use of the data may be wide compared to what you need for modeling. Errors that aren’t important in the data’s original use may wreak havoc on your model-building activity.

Range restriction. Operational systems are deliberately controlled to minimize the effect of system variation on results, meaning that the allowance for variation of system parameters is very small. It is very possible that the response we are modeling will not be affected by variation of inputs in this range, but that doesn’t mean that the responses wouldn’t change if the inputs were varied over a larger range. The result is a model that gives misleading results by excluding important parameters.

Failure to observe lurking variables . It’s possible that an important variable driving an input isn’t recorded because it doesn’t matter to the process operator, but it may matter to you. With historical data it’s often impossible to deduce this variable.

Collinearity . Historical data often contain many variables that measure the same underlying drivers. If these variables are included in the same model, its parameter estimates will vary wildly, possibly causing serious problems.


Despite these concerns, people will continue to use historical data to build predictive models. When I’m asked to help build a linear regression model from historical data, I recommend DOEs if possible. If that’s not possible I judiciously point out the pitfalls, then suggest the following:

1. Select a continuous response variable (Y) to be modeled.

2. Choose a set of predictor variables. Choose Xs that subject-matter expertise suggests might cause a change in Y.

3. Using software such as Minitab, perform a best-subsets regression with the candidate variables.

4. Select the best-subset model using the following criteria: Cp ≤ p + 1, where p is the number of predictors; Cp minimized is a criterion that is often used; the standard error is small; and R 2 (or R 2 adjusted) is large. There may be more than one model that bears further study.

5. Create a regression model (or models) that include all Xs in the chosen best subset or subsets. Be sure the analysis includes the variance inflation factors (VIFs), which are measures of correlation between Xs in the model and the other Xs in the model. A large VIF indicates that the standard error of the predictor variable will be large as well.

6. If a “best subset” by the above criteria includes Xs with VIFs greater than 10, drop these subsets.

7. Occam’s razor is the philosophy that one should not increase, beyond what is necessary, the number of entities required to explain anything. In this case, applying Occam’s razor means that we will use as few predictors as possible.

8. Assess the quality of the fitted model. Look at the residuals to see if there are patterns, lack of normality, or outliers. If justified, remove the cases responsible for the problems or apply a transformation to the data. Use a procedure to identify influential observations. Minitab’s DFITS metric is a good way to do this. DFITS represents roughly the number of estimated standard deviations that the fitted value changes when the ith observation is removed from the data. One way to compare DFITS is to graph the DFITS values using boxplots, then look for extreme values on the boxplot chart. Brush these values to identify the cases responsible and consider dropping them.

9. Repeat steps 3-5 until all the criteria are met. If no acceptable model results from these steps, return to step 2.

10. If a transformation was used, convert the predictions back to the original units and compare them to the actual values.


Models built with this procedure won’t be perfect. A Master Black Belt could apply a number of advanced statistical methods, such as principal components analysis, ridge regression, or partial least squares, just to name a few. Most Black Belts and Green Belts will find themselves forced to glean as much information as possible from historical data by using the modeling tool that they learned in their training: linear regression analysis. This procedure is written for this group of noble change agents. As DOE guru George E.P. Box says, “All models are wrong. Some models are useful.”

Thanks to George Runger, Ph.D., and John Kendrick for their assistance with this column.



About The Author

Tom Pyzdek’s picture

Tom Pyzdek

Thomas Pyzdek’s career in business process improvement spans more than 50 years. He is the author more than 50 copyrighted works including The Six Sigma Handbook (McGraw-Hill, 2003). Through the Pyzdek Institute, he provides online certification and training in Six Sigma and Lean.