Micro-evolution

STATISTICS IN INTRODUCTORY BIOLOGY

T

hi

s supplement

is excerpted with permission from

the

BIOL 1010 lab

m

anual

(Bishop et al., 2012).

References to BIOL 1010 and BIOL 1011 also apply to BIOL 1020 and BIOL 1021,

respectively.

Bishop T, Gass G, Van Dommelen J. 2012. Appendix E: Statistics in Introductory Biology. I

n: Biology

1010 Laboratory Manual. Halifax (NS): Dalhousie University.

***************************************************************************************************

In virtually every published primar

y research article in science, the Results section will contain the

results of a number of statistical tests performed on the data collected by the researchers. Scientists use

statistics to demonstrate mathematically that their results (for example, that p

lants treated with

Fertilizer A grew larger than plants treated with Fertilizer B) are meaningful. For example, a biologist

might weigh the plants in the two groups and find that the average weights calculated for each group

were different values. The rese

archer would not stop there, but would then want to find out whether

that the difference observed between the two groups is a legitimate or

significant

one (Fertilizer A really

does promote plant growth better than Fertilizer B), rather than just an accide

nt of chance (that is, the

plants chosen for measurement in treatment A just happened to be heavier than the plants chosen for

measurement in treatment B, even though there was no real difference in weights caused by the

fertilizer used). Another biologist

might have used a hypothesis to generate a prediction of the

frequency of a particular phenotype in the offspring of a cross between two plants. When he or she

actually performs that cross by breeding the plants together, do the frequencies match what was

predicted? If they don’t match exactly, are they close enough, or do the expected and the observed

differences differ significantly?

In this Appendix, you will learn

the

basic stati

stical tech

niques that you

may

need to use in your BIOL

1010

and 101

1 lab

oratory activities. More advanced biology classes make use of more advanced

statistical techniques, but many of these techniques are based on the same concepts you will use in your

labs this year.

There are three sections to this Appendix:

I. Basic descr

iptive statistics: mean and standard deviation

II. Statistical tests: the chi

–

square test

III. Standard error and 95% confidence intervals

I. BASIC DESCRIPTIVE STATISTICS: MEAN AND STANDARD DEVIATION

When a group of measurements are taken, we often want to

be able to characterize that group using

descriptive statistics: for example, what was the middle or average weight of a plant in that group? How

much did individual plants in that group tend to differ in weight from one another?

A common measure of the

middle or average value used in biology is the

mean

. You have likely

calculated means in secondary school math: the mean is found by adding up all of the observed values,

then dividing by the number of observed values. The number of observations or data po

ints is referred

to as

n

. The Greek letter sigma (

?

) indicates that you should sum up whatever comes immediately after

the sigma. We can represent the procedure for finding the mean like this:

mean =

?observed

values

/ n

.

In spreadsheet programs such as Microsoft Excel or Google Docs Spreadsheets you can

calculate

the mean using the “=AVERAGE” formula.

The variability of the data set (how much the values tended to differ from the mean) is described using

the

standard deviation

. Together, the mean and the standard deviation tell you about the distribution

of your observations: what value they cluster around, and how narrow or wide that cluster is. The

larger the differences between each observation and the mean, the larger the standard deviation. The

procedure for finding the standard deviation of a sample

is more complex than the procedure for finding

the mean. Some values will fall below the mean (resulting in a negative number), while some values will

fall above the mean (resulting in a positive number), so the values need first to be squared so that all

of

the differences will be positive, then a square root taken. Here is the formula describing this procedure:

standard deviation =

v

(

?(observed value

–

sample mean)

2

/ n

–

1)

You can use the “=STDEV” formula in spreadsheet software to calculate the standard deviation. In BIOL

1010, you will need to know what the mean and standard deviation tell you about your set of data. In

BIOL 1011,

we will build on this knowledge: you will learn how to use

n

and standard deviation to

calculate a related value called standard error, so that when graphing your data you can quickly assess

whether the means of two groups are likely to be significantly d

ifferent, as in the case of the Fertilizer A

and B treatments described above.

II. STATISTICAL TESTS: THE CHI

–

SQUARE TEST

In both BIOL 1010 and 1011, you will carry out and interpret a

statistical test

called the chi

–

square

(

?

2

) test of goodness of fit. This test will allow you to test hypotheses by comparing your predictions to

your observations, as in the plant cross example described above. On the next page, you will find

complete instructions for carrying out and interpret

ing the chi

–

square test. This test is just one of a very

wide range of statistical tests used in science, and if you take upper

–

year courses in biology you will

likely encounter many different statistical tests. However, these tests tend to share some comm

on

features:

o

The purpose of the test is to help you decide whether or not to reject some hypothesis. The

hypothesis itself will differ depending on the study being performed and the statistical test being

used, but at the end of the test you should be able

to say whether the hypothesis should be

rejected or not. Notice that we do not say that the hypothesis is “supported” or “proven”,

simply that we fail to reject it.

o

At the end of the mathematical operations involved in the test, you have computed what is

called a

test statistic

. In the chi

–

square test, the test statistic is the

?

2

value that you calculate

by adding up the squared differences between observed and expected values divided by the

expected value; other types of tests (the Student’s t

–

test or the Mann

–

Whitney U test, for

example) have their own test statistics arrived

at by their own procedures.

o

Each test also requires that you find the number of

degrees of freedom

, which is related to

the number of different categories being studied. Together, the test statistic and the degrees of

freedom value will allow you to interp

ret the results of your test.

o

When the test statistic and degrees of freedom have been calculated, you use these values to

consult

s

tatistical tables

(on paper or in computer databases) specific to each statistical test.

In your chi

–

square test, the degree

s of freedom value tells you which row of the table to look in.

BIOL 1020

Lab Assignment: Microevolution

Start by re-saving this file as follows: lab_surname_labtitle.rtf, substituting your own surname. Remember to convert to PDF after you have finished entering your answers and before submitting for grading.

Type your responses to the questions below where indicated. Remember to save your work frequently.

Data Analysis and Interpretation

1. Use your data to estimate the allele frequencies at the longhair locus for the cats in each city of your chosen pair. Use the warm-up exercises in the online content as a guide and show your work clearly. Enter your results in Table 2 (which goes with Question 8 in this document). (2 marks)

(a) City # 1 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(a) City # 2 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

2. What percentage of the cat population in each city is heterozygous at the longhair locus? Use the warm-up exercises in the online content as a guide and show your work clearly. (2 marks)

(a) City # 1 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(b) City # 2 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

3. Use your data to calculate the allele frequencies at the spotting locus for the cats in each city of your chosen pair. Exclude unknowns from your totals. Use the warm-up exercises in the online content as a guide and show your work clearly. (2 marks)

(a) City # 1 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(b) City # 2 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

4. Use your answers from the previous question to calculate the NUMBER of cats with each genotype that would be expected in your sample if the population were in Hardy-Weinberg equilibrium with respect to the spotting locus. Use the warm-up exercises in the online content as a guide and show your work clearly. Add your genotype numbers to the appropriate ‘Expected #’ columns in Tables 1a and 1b. (2 marks)

(a) City # 1 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(b) City # 2 [replace with city name]

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

5. From your data sheet, add your observed number of cats for each genotype associated with the spotting locus to the ‘Observed #’ columns in Tables 1a and 1b. Do the ‘Expected’ genotype numbers match the actual genotype numbers that you observed in your samples? Probably not. But are the differences STATISTICALLY significant? If not, then we can say that the differences are due to chance alone, and that they do not represent a meaningful deviation from the equilibrium numbers. If the differences ARE statistically significant, then we can say that they are not due to chance alone, and that there is some other factor that accounts for the differences.

We can use the chi-squared test of goodness of fit to determine whether the observed spotting genotype numbers in your data are significantly different from those expected under equilibrium conditions. In this case, we can say that the null hypothesis is that there is no difference between the observed spotting genotype numbers and those expected under Hardy-Weinberg equilibrium.

Complete Tables 1a and 1b to obtain a test statistic for each city, and answer the questions that accompany them. (5 marks)

Table 1a. Calculation of chi-squared test statistic for three genotypes in cats at shelters in

_____________________________________________[fill in the name of the first city in your pair].

Genotype

(class) Observed #

(o) Expected #

(e) (o-e) (o-e)2 (o-e)2

e

SS

Ss

ss

Total ?2 =

You will have to determine the number of degrees of freedom before you proceed. When calculating the degrees of freedom for Hardy-Weinberg, the equation is slightly different than in other genetics problems. Use the formula

df = k – r

where k= the number of classes (genotypes)

and r = the number of alleles in an individual

(a) How many degrees of freedom are there for Table 1a?

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(b) What p-value did you obtain for the test statistic in Table 1a (refer to Table 3 near the end of this document)? Give a range if appropriate.

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(c) Should you reject or fail to reject the null hypothesis? With reference to your p-value, justify your decision.

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

Table 1b. Calculation of chi-squared test statistic for three genotypes in cats at shelters in

_____________________________________________[fill in the name of the second city in your pair].

Genotype

(class) Observed #

(o) Expected #

(e) (o-e) (o-e)2 (o-e)2

e

SS

Ss

ss

Total ?2 =

(d) How many degrees of freedom are there for Table 1b?

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(e) What p-value did you obtain for the test statistic in Table 1b? Give a range if appropriate.

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

(f) Should you reject or fail to reject the null hypothesis? With reference to your p-value, justify your decision.

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

6. With respect to the spotting locus, is microevolution occurring in the cat population in either city in your pair? State your evidence. (1 mark)

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

7. List the five assumptions of the Hardy-Weinberg principle. If micoevolution is occurring with respect to the spotting locus, which if the assumptions do you think could be violated? Explain your answer. (If your data indicate that microevolution at the spotting locus is NOT occurring, pretend for a moment that it is, and answer the same question.) (1 mark)

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

8. Table 2 below summarizes your data and calculations. Another test (the chi-squared test of independence) would be required to determine whether any differences between the cities might be statistically significant, but that is beyond the scope of this lab. For our purposes, we’ll consider the differences to be significant.

Propose a hypothesis (explanation) as to why there are differences in cat data between cities. You may propose a general hypothesis (i.e., one that might apply to any or all of the items in Table 2), or a hypothesis specific to a particular item in Table 2. (Hint: the cities weren’t paired at random!) (1 mark)

Table 2. Summary of observations and calculations based on data collected from photos of cats at shelters in two North American cities.

City longhair spotting HWE for spotting (yes/no)

f (L) f (l) f (SS) f (Ss) f (ss)

enter name of first city

enter name of second city

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

9. Name one potential drawback to the data collection procedure and speculate about its potential impact on your data. (1 mark)

RESPONSE:

PLEASE LEAVE THE SPACE BELOW EMPTY FOR TA COMMENTS

Table 3. Critical ?2 Values

Degrees of Freedom Probability (P)

0.95 0.8 0.5 0.2 0.05 0.01 0.005

1 0.004 0.064 0.455 1.642 3.841 6.635 7.879

2 0.103 0.446 1.386 3.219 5.991 9.21 10.597

3 0.352 1.005 2.366 4.642 7.815 11.345 12.838

4 0.711 1.649 3.357 5.989 9.48 13.277 14.86

5 1.145 2.343 4.351 7.289 11.07 15.086 16.75

6 1.635 3.07 5.348 8.558 12.592 16.812 18.548

7 2.167 3.822 6.346 9.803 14.067 18.475 20.278

8 2.733 4.594 7.344 11.03 15.507 20.09 21.955

Non significant Significant

Using the table of Critical ?2 Values

1. Locate the row containing the appropriate degrees of freedom.

2. Find where your chi-squared test statistic fits within the range of numbers in the row (it may fall outside of the range; i.e. to the left or right ends of the scale).

3. Note the probability values (p-values) corresponding to your test statistic and determine which p-values your test statistic lies between, or whether the p-value is off the scale.

4. According to statistical convention, a p-value of less than 0.05 (p < 0.05) means that there is less than a 5% chance that the difference between what you observed and what you expected is due to chance. Therefore the difference between the actual and expected values is considered to be due to some factor other than chance. So you can reject your null hypothesis that the difference is due to chance alone.

5. If the p-value is greater than or equal to 0.05 (p=0.05) there is a greater than 5% chance that your test statistic is due to chance, so you do not reject the null hypothesis that the difference between what you expected and what you observed is due to chance alone.

Lab Assignment Survey Questions

We’re interested in your feedback! Please visit the Lab AssignmentsSurvey, via the ‘Proctor and Other Surveys’ page in the Course Menu of the class site, to enter your responses to the questions below. This survey is anonymous.

1. Approximately how long did it take you to complete this assignment?

• less than an hour

• 1-2 hours

• 2-3 hours

• 3-4 hours

• more than 4 hours

2. Was this a fair amount of time, considering the particulars of the assignment?

• Yes, it was a fair amount of time.

• No, the assignment could have been more comprehensive.

• No, it took too much time.

3. How would you rate the level of difficulty of the assignment?

• Easy

• Challenging, but manageable

• Too challenging

4. How would you rate the learning value of the assignment?

• The assignment helped my learning.

• The assignment did not help my learning.

5. Do you have any additional comments or feedback about this assignment?

Start by re-saving this file as follows: lab_surname_nicroevolutiondata.xlsx, substituting your own surname. Remember to convert to (or save as) PDF before submitting.

Table 4. Data sheet for recording selected phenotypes of cats in shelters in __________________________ [replace the blank with the first city of your pair].

cat name Tina T Oximo

phenotype E.g. 1 E.g. 2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Total

short hair (L_) 1 0

long hair (ll) 0 1

100% white (W_) 0 1

<100% white (ww) 1 0

>50% white spotting (SS) 1 0

<50% white spotting (Ss) 0 0

0% white (ss) 0 0

unknown at spotting locus 0 1

Table 5. Data sheet for recording selected phenotypes of cats in shelters in __________________________ [replace the blank with the second city of your pair].