In fitting a model to your dataset you will be interested in residual diagnostics and this is where the residuals must satisfy the normality assumption. There are other conditions such as independence, constant variance and random nature that need to be satisfied as well. The original daatset may be normal or non-normal. Remeber you may not want to use censored data in your model fitting.
When building multiple regression models, you can use non-normal terms. For example, you can use binary terms (0 or 1) to indicate a subgroup of the data. Examples may include trained versus untrained, or heat-treated versus no heat treatment. This approach will generate two parallel regression lines with different intercepts, depending on the treatment. Many other variations are possible.
When looking for additional terms such as x**2, I find that Excel gives a useful shortcut. Create a scatterplot in Excel with the the response variable and a primary X variable. Then, click on one of the points in the graph and use the mouse and pop-up menu to select "add a trendline". In the options for the trendline, select "display equation" and "display R-squared". Then, you can play with various trend types (linear, nth-order polynomial, exponential, etc.). See from the appearance of the plot and the R-squared value whether the fit of the regression equation is significantly enhanced by changing the type of regression equation. Use caution here - the goal should not be to blindly maximize the R-square. For example, adding extra terms to go from 60% to 62% r-squared may not really help much. Does the model make common sense? For example, using the square of "hours of training" to predict productivity may not make sense, but the square of temperature may be useful to predict unwanted by-products in a chemical reaction.
INITIAL POSTING: I've always used the assumption that to include x values in a regression model--either single or multiple (with continuous data) that the data had to be normal. Can you use non-normal data in your model if it's significant or do you have to convert first?
RESPONSE: In regression what you would like to do is predict the Y output as a function of one or more X inputs that might change (or be changed) within a process. Within an analysis one way to examine the appropriateness of this relationship that you determined through regression is via a residual analysis. Within a residual analysis is when you are most interested in normality; i.e., you would like to see residuals that are "well behaved" and normally distributed. However, a residual plot that is not well behaved might lead you to an appropriate transformation that better models your system ; e.g., Y is a function of X**2, not Y is a function of X alone.
Hope this helps. Contact me if you would like to discuss further.
Community metadata Could not be loaded. No pre-configured community forum parent found for the incoming fid: . You need to add this forum into custom_code/community.sql
Comments
guest 2/9/2005
Elleran,
In fitting a model to your dataset you will be interested in residual diagnostics and this is where the residuals must satisfy the normality assumption. There are other conditions such as independence, constant variance and random nature that need to be satisfied as well. The original daatset may be normal or non-normal. Remeber you may not want to use censored data in your model fitting.
abarnett_tx 10/7/2004
When building multiple regression models, you can use non-normal terms. For example, you can use binary terms (0 or 1) to indicate a subgroup of the data. Examples may include trained versus untrained, or heat-treated versus no heat treatment. This approach will generate two parallel regression lines with different intercepts, depending on the treatment. Many other variations are possible.
When looking for additional terms such as x**2, I find that Excel gives a useful shortcut. Create a scatterplot in Excel with the the response variable and a primary X variable. Then, click on one of the points in the graph and use the mouse and pop-up menu to select "add a trendline". In the options for the trendline, select "display equation" and "display R-squared". Then, you can play with various trend types (linear, nth-order polynomial, exponential, etc.). See from the appearance of the plot and the R-squared value whether the fit of the regression equation is significantly enhanced by changing the type of regression equation. Use caution here - the goal should not be to blindly maximize the R-square. For example, adding extra terms to go from 60% to 62% r-squared may not really help much. Does the model make common sense? For example, using the square of "hours of training" to predict productivity may not make sense, but the square of temperature may be useful to predict unwanted by-products in a chemical reaction.
forrestbreyfogle 9/30/2004
INITIAL POSTING: I've always used the assumption that to include x values in a regression model--either single or multiple (with continuous data) that the data had to be normal. Can you use non-normal data in your model if it's significant or do you have to convert first?
RESPONSE: In regression what you would like to do is predict the Y output as a function of one or more X inputs that might change (or be changed) within a process. Within an analysis one way to examine the appropriateness of this relationship that you determined through regression is via a residual analysis. Within a residual analysis is when you are most interested in normality; i.e., you would like to see residuals that are "well behaved" and normally distributed. However, a residual plot that is not well behaved might lead you to an appropriate transformation that better models your system ; e.g., Y is a function of X**2, not Y is a function of X alone.
Hope this helps. Contact me if you would like to discuss further.
Forrest Breyfogle
512-918-0280
forrest@smartersolutions.com
www.smartersolutions.com