Using Gene Mapping to Introduce the Chi-square Test of Independence
The Next Generation Science Standards (NGSS), the AP Biology Course Description, and Vision and Change all call for life science teachers to incorporate into their curricula as many opportunities as possible for students to practice quantitative reasoning (i.e., quantitative analysis or mathematical thinking). Though often loosely referred to as numeracy, quantitative reasoning has been defined to include not only counting and measuring, but also the act of comparing quantities, determining differences between quantities, and analyzing those differences. The NGSS extends this framework to include using mathematics to make quantitative predictions. We can take this reasoning approach one step further by having students practice quantitative reasoning within context (also called QR-C). QR-C is defined in the literature by Robert Mayes and James Myers as “mathematics and statistics applied in real-life, authentic situations that impact an individual’s life as a constructive, concerned, and reflective citizen.” Indeed, a primary and critical role of the science teacher, and in our case here, the biology teacher, is to provide students with as many opportunities as possible to practice QR-C in a scientific context.
The BioInteractive activity “Mapping Genes to Traits in Dogs Using SNPs” gives students the opportunity to practice their numeracy skills, but with the option of going a step further into quantitative reasoning within context. For those readers who are unfamiliar with the “Mapping Genes” activity, students can first watch an engaging 29-minute lecture in which Dr. Elinor Karlsson of the Broad Institute in Cambridge, MA, discusses genome-wide association studies (GWAS). Students then analyze actual sequence data from DNA isolated from dog saliva, which was obtained and analyzed by Dr. Karlsson and colleagues at the Broad.
In brief, the analysis of the data in the “Mapping Genes” activity has students count the number of single-nucleotide polymorphisms (SNPs) found in dog chromosome regions that may be associated with some basic traits in dogs, like curly and straight hair. Students then graph their count data to visualize the magnitudes of the SNP differences and make evidence-based claims that infer whether certain traits are likely associated with certain SNPs.
As an extension, students can dig a bit deeper into data analysis and perform several chi-square tests. Students use the chi-square test to statistically test the large differences shown in their graphs as they look for associations (called “correlations” in the educator materials for the activity) between the alleles of certain SNPs and dog phenotypes. Students are shown a shortcut for how to calculate their expected values, but the activity does not reveal to them that they are actually doing a version of the chi-square test called the chi-square test of independence, rather than the more-familiar chi-square test of goodness of fit.
Most high school and college biology teachers already have their students learn and use the chi-square goodness of fit statistical test to analyze genetics and Hardy-Weinberg data. I like to use this part of the “Mapping Genes” activity to introduce my students to the test of independence and the logical fun of testing whether combinations of count data are occurring independently or are associated. Teaching the logic that underlies some of our more common statistical tests, like chi-square, removes some of the mystery of why and when we use different tests to analyze data. Revealing this logic also brings students into the realm of QR-C (i.e., logic).
In the “Mapping Genes” activity, calculating expected values is straightforward because the number of dogs with each of two traits is kept equal. For example, in the “Dog Coat Texture” section of the student handout, students are provided a sample of 10 dogs, five of which have a curly coat and five of which have a straight coat (shown in the curly-straight “SNP Cards”). The GWAS has identified six loci on chromosome 27 that have SNPs that may or may not be associated with curly and straight coats. One SNP is at locus Chr27 5545082, where some dogs have an A nucleotide and others have a G (see Table 1).
Allele |
Curly Coat |
Straight Coat |
Difference |
A |
4 |
8 |
4 |
G |
6 |
2 |
4 |
Total number of differences: 8
Table 1. A table from the “Educator Materials” (see the “Distribute the coat-texture cards” section) of one of the many SNPs students analyze in the “Mapping Genes” activity.
For all calculations of expected values, students are told that the expected value is equal to the total number of times an allele occurs at that locus divided by two. In other words, if a SNP is not associated with a trait, it should be expected to show up in equal numbers in the dogs with each trait. Thus, in the example in Table 1, out of 12 A nucleotides, we would expect six to show up in curly-coat dogs and six to show up in straight-coat dogs. At the same time, out of eight G nucleotides, we would expect four to show up in curly-coat dogs and four to show up in straight-coat dogs. Students use this information to fill out a table similar to Table 2.
Allele |
Curly |
Straight |
Expected |
A |
4 |
8 |
6 |
G |
6 |
2 |
4 |
Table 2. Allele counts for locus Chr27 5545082, where some dogs have an A nucleotide and others have a G, with observed and expected values.
Students then use the observed and expected values to perform the chi-square test as shown in Table 3.
Category |
Observed (O) |
Expected (E) |
(O-E)^{2}/E |
Curly with A allele |
4 |
6 |
0.67 |
Curly with G allele |
6 |
4 |
1 |
Straight with A allele |
8 |
6 |
0.67 |
Straight with G allele |
2 |
4 |
1 |
Χ^{2}=Σ(O-E)^{2}/E = 3.34
Table 3. Chi-square calculations for locus Chr27 5545082. If you do not round your calculations until the end, Χ^{2} will be 3.33 instead, which is very similar.
From the chi-square critical values table below (Table 4), students will find that with one degree of freedom (df), the probability (p) of obtaining a chi-square (Χ^{2}) value of 3.34 is between 0.10 and 0.05 (gray highlighted squares) if the null hypothesis (H_{0}) — which is that the difference between the curly- and straight-coat dogs is not statistically significant — is true. This result does not exceed the p = 0.05 rejection threshold for the null hypothesis. Thus, students can conclude that, in this small sample, there is not a statistically significant association between having an A or G allele at the Chr27 5545082 locus in dogs and having curly or straight hair.
However, if these observed ratios are maintained with a larger sample of dogs, the association is likely to become significant. Another important consideration is that small sample sizes have the potential to either over- or underestimate associations or independence. A good guideline when collecting data for a chi-square test is to try to ensure that no expected values are less than 1.0 and no more than 20% of expected values are less than 5.0. (Also note that here we are using p = 0.05 as the rejection level, but this value can vary by the question being asked or the field of study.)
Using equal numbers of dogs with each trait is effective scaffolding for students as they calculate their expected values. However, students are likely to encounter other GWAS data, and other data in general, that require the test of independence but that may not be evenly distributed among two of the categories. In these situations, it’s important for students to understand the logic behind calculating the expected values.
Genome-wide association studies are allowing scientists to quickly and efficiently find small but potentially important genetic differences between individuals both within and among species. Ultimately, for human biology, GWAS data can and are revealing previously undetectable genetic connections to both common and rare diseases. The BioInteractive “Mapping Genes” activity provides students with an intriguing and fun opportunity to understand this process at a basic level and opens a door for students to see what is becoming possible in the field of bioinformatics.
Students can learn the logic required for the chi-square test of independence by reading the "Student Guide to the Chi-square Test of Independence."
Paul Strode teaches at Fairview High School in Boulder, Colorado. He’s been teaching since 1991 and has been teaching descriptive and inferential statistics in his high school biology classes since finishing graduate school in 2004. In 2014, he and fellow biology teacher Ann Brokaw co-authored the guide “Using BioInteractive Resources to Teach Mathematics and Statistics in Biology” as a resource for teachers to bring statistics into their own biology courses. He’s married to a high school language arts teacher, with whom he has a teenage daughter. He runs, bikes, swims, publishes in The American Biology Teacher, and blogs as Mr. Dr. Science Teacher.