03.02.2015 Views

4.2 Least-Squares Regression

4.2 Least-Squares Regression

4.2 Least-Squares Regression

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>4.2</strong> <strong>Least</strong>-<strong>Squares</strong> <strong>Regression</strong><br />

1. If we feel there is a linear relationship between the two variables (the points of the<br />

scatter diagram cluster roughly in a straight line and the correlation coefficient is<br />

close to 1 or –1), how do we find the "best" line out of infinitely many that fits<br />

this data<br />

2. The criterion we will use to pick the best line is the least-squares criterion.<br />

This criterion is based on finding the smallest sum of all of the squared errors<br />

obtained when the calculated linear equation is used to predict the y data values<br />

(response variables) for each x variable (predictor variable).<br />

That is, if y is the data value (observed value) for a given value of x and y ˆ<br />

(read "y hat") is the value predicted from the equation for this x, then we want<br />

the smallest of Σ(y − y ˆ ) 2<br />

Note that y − ˆ y is the signed vertical distance between the data point and the<br />

point on the line for a given x value. It is the difference between the observed and<br />

the predicted values. This is called the residual. (Residual = observed y –<br />

predicted y.)<br />

3. <strong>Least</strong>-<strong>Squares</strong> <strong>Regression</strong> Criterion: (Page 198) The straight line that best fits<br />

a set of data points is the one having the smallest possible sum of squared errors.<br />

Thus we want to Minimize Σ residuals 2<br />

The straight line that best fits a set of data points according to the least-squares<br />

criterion is called the regression line.<br />

4. To find the equation ˆ y = b 0<br />

+ b 1<br />

x for the best-fit line using the least-square<br />

criterion, we will use the following formulas.<br />

b 1<br />

= r ⋅ s y<br />

s x<br />

is the slope of the least-squares regression line<br />

and<br />

b 0<br />

= y − b 1<br />

x is the y-intercept of the least-squares regression line.<br />

(Note that x = Σx<br />

Σy<br />

is the mean of the predictor variable, y = is the mean of the<br />

n n<br />

response variable, s x<br />

is the standard deviation of the predictor variable, and s y<br />

is


the standard deviation of the response variable. Also note that<br />

⎛ x i<br />

− x ⎞ ⎛<br />

⎝<br />

⎜<br />

s x<br />

⎠<br />

⎟ y i<br />

− y ⎞<br />

∑ ⎜<br />

⎝ s<br />

⎟<br />

y ⎠<br />

r =<br />

is the correlation coefficient for the data.)<br />

n −1<br />

5. Note that the point (x , y ) is always a point on this least-squares linear regression<br />

line. (This is useful when drawing this line by hand.)<br />

7. Because s x<br />

and s y<br />

are always positive, the sign on the slope of the leastsquares<br />

linear regression line is the same as the sign of the correlation<br />

coefficient r.<br />

8. The slope ( b 1<br />

) can be interpreted as the rate of change of the response variable,<br />

y, with respect to the predictor variable, x. Thus, when x increases by one unit,<br />

y will change by the amount of b 1<br />

.<br />

9. The y-intercept ( b<br />

0<br />

) can be interpreted as the predicted value of the response<br />

variable when the predictor variable is zero. This makes sense only if:<br />

a. The value of 0 for the predictor variable makes sense.<br />

b. There is an observed value of the predictor variable near 0.<br />

10. Never use the least-squares regression line to make predictions for values of<br />

the predictor variable that are much larger or much smaller than the<br />

observed values. (We don't even know if the data far outside our data set would<br />

be linear, much less whether it would follow the same line.)<br />

11. Cautions:<br />

Only use this method when the scatter diagram looks roughly linear. (Also check<br />

the value of r, the correlation coefficient.)<br />

Outliers are data points that lie far from the regression line relative to the other<br />

data points. They can sometimes have a significant effect on the regression<br />

analysis.<br />

12. <strong>Regression</strong> Lines in a Calculator: As was noted in the last section, the<br />

regression line can be found using the calculator. Follow the steps: STAT over<br />

to CALC ENTER (to select #4). Then enter the x-list (say L1), a comma,<br />

and the y-list (say L2) and ENTER. The slope and y-intercept will be given on<br />

2<br />

the screen as well as the r and r values, provided that you have turned on the<br />

Diagnostics.


If you want to store the equation under y<br />

1<br />

(in the Y = key), then you need to<br />

call the y<br />

1<br />

function. To do this follow the steps: VARS over to Y-VARS<br />

ENTER (to select Function) ENTER again (to select y<br />

1<br />

). Thus, the screen<br />

should show: 1-Var Stats L<br />

1<br />

, L<br />

2<br />

, y<br />

1<br />

and then ENTER. The resulting screen<br />

will look the same as it did without the y<br />

1<br />

. But, if you go to the Y = key, the<br />

linear equation will be stored in y<br />

1<br />

. You can then graph the scatter plot and the<br />

regression line by entering ZOOM #9.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!