Behavior of a Regression Line

Web resource used: http://statweb.calpoly.edu/chance/applets/LRApplet.html

Objectives:

1)      Observe the effect of  “influential points” versus “outliers.” 

2)      Manipulate the points and observe the effect on the (slope and y-intercept of the) regression equation, the correlation coefficient, and the SSE.

3)      Gain understanding of the relationship between a scatter plot and the correlation coefficient.

Created by: Felice Shore (Baltimore City Community College) for Project Synergy

This site shows 10 points plotted on the rectangular coordinate plane, along with the line of best fit. Below the graph is the data showing the equation of the regression line, the SSE (sum of squared errors), and the correlation coefficient (r-value). You can highlight a point by clicking on it. When you do this, you will see in the data box the coordinates of the selected point, as well as the error of that point (from the regression line). This site is dynamic in that you can 1) click and drag any point and 2) add points by clicking in the plane.

As you move or add points, the data will be updated simultaneously.

A.     Move all of the points so they are clustered near the line (but spread out along the line).  Then move one point horizontally far from the line.  Was the change in the regression equation and orientation of the line drastic? How did the correlation coefficient change?

That point that is far from the line horizontally is called an “influential” point. Why do you think that?

B.     Move all of the points so they are clustered near the line (but spread out along the line). Now move one point far from the other points, but still in the direction of the line. Was the change in the regression equation and orientation of the line drastic? How did the correlation coefficient change?

That point that is far from the other points is not an influential point, but an “outlier.” Why do you think that is?

C.     Recall from the “Least Squares” illustration the meaning of the SSE as the sum of squared errors for all 10 points. If you click on a point to highlight it, the “selected point residual” in the data box gives the error for that point. Recall that the SSE is the sum of all squared errors for all 10 points. Recall that a regression equation is “built” in order to minimize the SSE. That is, you want to put the line so that the SSE is as small as it can possibly be. 

Try this: First drag all points so they are spread out along the line (but not necessarily on it), but so that none of the points have too much error. You should have a relatively small SSE. Now see if you can manipulate the points in such a way as to keep the line oriented pretty much the same way, but spread the points out so that the SSE gets much bigger.

How did you accomplish this? What do you think a large SSE indicates for a given data set? What has this exercise shown you?

D.     Correlation Play: Notice the correlation coefficient in the data box. Recall that that is the Pearson “r-value” of a data set. Move the points around to create data sets that have r-values that reflect each of the following:

strong negative relationship: r-value = ____;       Sketch scatter plot



strong positive relationship: r-value = ____; Sketch scatter plot



moderate negative relationship: r-value = ____;       Sketch scatter plot



weak positive relationship: r-value = ____;       Sketch scatter plot



very weak relationship: r-value = ____;       Sketch scatter plot


E.      Free play: What else might you want to explore? Remember you can add points to this plot, or move any points around. As you add or move, you should be carefully noticing how the orientation of the regression line changes, and how the slope and y-intercept of the equation change. Investigate/ explore and write what you did, and what you think you learned here: