Assumptions of correlation coefficient, normality, homoscedasticity

An
inspection of a scatterplot can give an impression of whether two variables
are related and the direction of their relationship. But it alone is not sufficient
to determine whether there is an association between two variables. The relationship
depicted in the scatterplot needs to be described qualitatively. Descriptive
statistics that express the degree of relation between two variables are called
** correlation coefficients.** A commonly employed correlation coefficient
for scores at the interval or ratio level of measurement is the ** Pearson
product-moment correlation coefficient**, or ** Pearson’s r**.

The Pearson's r is a descriptive statistic that describes the linear relationship between two or more variables, each measured for the same collection of individuals. An "individual" is not necessarily a person: it might be an automobile, a place, a family, a university, etc. For example, the two variables might be the heights of a man and of his son; there, the "individual" is the pair (father, son). Such pairs of measurements are called bivariate data. Observations of two or more variables per individual in general are called multivariate data. As with any sample of scores, the sample is drawn from a larger population of scores.

The
test for significance of Pearson's r assumes that a particular variable, X and
another variable, Y, form a bivariate
normal distribution in the population. A bivariate normal distribution
possesses the following characteristics:

·
The
distribution of the X scores is normally distributed in the population
sampled.

·
The
distribution of the Y scores is normally distributed in the population
sampled.

·
For
each X score, the distribution of Y scores in the population is normal.

· For each Y score, the distribution of Y scores in the population is normal.

**Assumption 1: The correlation coefficient
r assumes that the two variables measured**

Describing Scatterplots

One of the best tools for studying the association of two variables visually is the scatterplot or scatter diagram. It is especially
helpful when the number of data is large---studying a list is then virtually hopeless. A scatterplot plots two measured variables
against each other, for each individual. That is, the "x" (horizontal) coordinate of a point in a scatterplot is the value of one
measurement of an individual, and the "y" (vertical) coordinate of that point is the other measurement of the same individual. We
call such a plot a scatterplot of "y versus x" or "y against x." Here's an
example of a scatterplot:

The red square in the middle of the scatterplot is the point of averages. The point of averages is a
measure of the "center" of a scatterplot, quite analogous to the mean as a measure of the center of a list.

Scatterplots let us see the relationships among variables. Does one variable tend to be larger when another is large? Does the
relationship follow a straight line? Is the scatter in one variable the same, regardless of the value of the other variable?

Correlation and Association

Correlation is a measure of linear association: how nearly a scatterplot
follows a straight line. We say that two variables are positively correlated if the
scatterplot slopes upwards; they are negatively correlated if the scatterplot slopes
downward. The correlation coefficient for a scatterplot of Y versus X is always the same as the
correlation coefficient for a scatterplot of X versus Y. Note that linear association is not the only kind of association: some variables
are nonlinearly associated. For
example, the average monthly rainfall in Berkeley, CA, is associated with the month of the
year, but that association is nonlinear: it is a seasonal variation that runs in cycles.
**
Correlation does not measure nonlinear association, only linear association. The
correlation coefficient is appropriate only for quantitative variables, not ordinal or
categorical
variables, even if their values are numerical.**

**Correlation is a measure of association, not
causation. ** For example, the average height of people at maturity in the US has been
increasing. Similarly, there is evidence that the number of plant species is decreasing
with time. These two variables have a a negative correlation, but there is no
(straightforward) causal connection between them.

The correlation coefficient *r* is close to 1 if the data cluster tightly
around a straight line that slopes up from left to right. The correlation coefficient is
close to -1 if the data cluster tightly around a straight line that slopes down from left
to right. If the data do not cluster around a straight line, the correlation coefficient
*r* is close to zero, even if the variables
have a strong nonlinear association. Here are some examples of scatterplots that
have specific values of the correlation coefficient *r*.

Linearity

The following scatterplot illustrates a linear relationship between the variables. The scatterplot is roughly football-shaped: the
points do not lie exactly on a line, but are scattered more-or-less evenly around one.

Some scatterplots show curved patterns. Such scatterplots are said to
show *nonlinear *association* *between
the two variables. The correlation coefficient
does not reflect nonlinear relationships between variables, only linear ones. For example,
even if the association is quite strong, if it is
nonlinear, the correlation coefficient *r*
can be small or zero:

In this plot, the scatter in X for a given value of Y is very small, so
the association is strong. Even though the
association is perfect*, *because you can predict Y exactly from X,* *the
correlation coefficient *r* is exactly zero.*
*This* *is* *because the association is nonlinear*. *

*\*

In this scatterplot, the pattern in the relationship between the variables is not a straight line---it is curved. The data are scattered more-or-less evenly around a curve: the scatter in the values of Y is about the same for different values of X, that is, in different vertical "slices" through the scatterplot. The correlation coefficient is reasonably large (0.71), because there is an overall trend in the data. However, the correlation coefficient still does not show how strongly associated the variables are, because the pattern of their relationship is curved. The correlation coefficient is not a good summary of the association of these variables.

**Assumption 2: The correlation coefficient
r measures only linear associations: how nearly the data**

**falls on a straight line***. ***It is not
a good summary of the association if the scatterplot has a nonlinear
(curved) pattern.**

Homoscedasticity and Heteroscedasticity

Scatterplots in which the scatter in Y is about the same in different vertical slices are called homoscedastic (equal scatter).
Data are homoscedastic if the
SD in vertical slices through the
scatterplot is about the same, regardless of where you take the slice. Homoscedastic means
"same scatter." In contrast, if the vertical SD varies a great deal depending on
where you take the slice through the scatterplot, the data are heteroscedastic.
The SD is a measure of the scatter in the list. So
far, all the plots in this section have been homoscedastic. The next scatterplot shows heteroscedasticity: the scatter in vertical
slices depends on where you take the slice.

The scatter in a strip near the right of the plot is much larger than
the scatter in a strip near the left of the plot. There is not much association between Y
and X, but the correlation coefficient is still 0.15. This is an artifact of the
heteroscedasticity.

**Assumption 3: The correlation coefficient r
**

Outliers

A point that does not fit the overall pattern of the data, or that is many SDs from the bulk of the data, is called an outlier. A single outlier that is far from the point of
averages can have a large effect on the correlation
coefficient. Here are two extreme examples of scatterplots with a large
outlier:

In the first, the outlier makes the
correlation coefficient nearly one; without it, the correlation coefficient would be
nearly zero.

In the second, the outlier makes the correlation coefficient nearly zero;
without it, the correlation coefficient would be nearly one.

**Assumption 4: The correlation coefficient r is not a good
summary of association if the data have outliers.**