A blog about statistics. How great is this?! If it’s a blog, it has to be short. My wife, however, would say that even a blog about statistics is still going to be way too long.
In physiology education, we usually want to compare the impact of something—a new instructional paradigm, say—between different groups: for example, a group that gets a traditional approach and a group that gets a new approach. Depending on the number of groups we want to compare, there are different ways to design the experiment and to analyze the data.
Two Samples: to Pair or Not to Pair?
Suppose you want to see if formative assessments over an entire semester impact learning. Clearly, your students can either have formative assessments or not. So you randomly assign your 12 students to be in one group or the other. You teach your course, give the 6 students formative assessments, and then grade your 65-point final. The question is, did formative assessments (given to the students in Group 1) impact their grade on the final? These are the grades:
These groups are independent of each other: the observations in one group are unrelated to the observations in the other group. So we want an unpaired 2-sample test. One option is a 2-sample t test. Here, the grades in the 2 groups are similar (P = 0.54): in this fictitious experiment, formative assessments did not impact grades.
What happens if the observations in one group are related to the observations in the other group? This could happen if you gave formative assessments to each student (Treatment 1) for half of your course and then gave an exam. During the other half of your course, each student got no formative assessments (Treatment 2). For each student you randomly assign the order of the treatments so that half get Treatment 1 first, the other half get Treatment 2 first.
In this situation each subject acts as her own control—this makes the comparison of the treatments more precise—and we want a paired 2-sample test. These are the data:
|Subject||Treatment 1||Treatment 2||Difference|
Here, the grades after each treatment are similar (P = 0.62): in this fictitious experiment, formative assessments did not impact grades.
When You Have Three or More Samples
Let’s pretend we want to think about the amount of fat donuts absorb when they are cooked. These numbers represent the amount of fat absorbed when 6 batches of donuts are cooked in 4 kinds of fat.
If you are watching your diet, the lower the number, the better. There is good news and bad news about this example. The good news is there are 24 donuts in a single batch. The bad news is 100 has been subtracted from the actual amount in order to simplify the numbers.
The first question: why not just use a 2-sample (unpaired) test to compare the amount of fat absorbed? There are two answers. First, if we compare just 2 groups at a time, we fail to use information about the variation within each of the two remaining groups. Second, if we compare just 2 groups at a time, we can make a total of 6 comparisons (1–2, 1–3, 1–4, 2–3, 2–4, 3–4). And if we do that, the chances we find at least one of the 6 comparisons to be statistically meaningful when all 6 are all statistically equivalent is about 1 in 4 (26%). The more comparisons we make, the greater the chances that we find a comparison to be statistically meaningful simply because we are making more comparisons.
What’s the solution? Use a procedure that initially compares all 4 groups at the same time. One option is analysis of variance. In analysis of variance, if the variation between groups is enough bigger than the variation within groups, then that is unusual if the group means are equal. Here, by analysis of variance, the amount of fat absorbed differs among the 4 fat types (P = 0.007). You can then use other techniques to identify just which groups differ.
The Big Picture
No matter how many groups you want to compare, the idea is the same: you want to design the experiment to account for—as best you can—extraneous sources of variation (like individual differences) that can impact the thing you want to measure, and you want to use all the information you collected when you compare the groups.
- Curran-Everett D. Multiple comparisons: philosophies and illustrations. Am J Physiol Regul Integr Comp Physiol 279: R1–R8, 2000.
- Curran-Everett D. Explorations in statistics: hypothesis tests and P. Adv Physiol Educ 33: 81–86, 2009.
- Curran-Everett D. Explorations in statistics: permutation methods. Adv Physiol Educ 36: 181–187, 2012.
- Snedecor GW, Cochran WG. Statistical Methods (7th edition). Ames, IA: Iowa State Univ. Press, 1980, p 83–106, 215–237.
Doug Everett (Curran-Everett for publications) graduated from Cornell University (BA, animal behavior), Duke University (MS, physical therapy) and the State University of New York at Buffalo (PhD, physiology). He is now Professor and Head of the Division of Biostatistics and Bioinformatics at National Jewish Health in Denver, CO. In 2011, Doug was accredited as a Professional Statistician by the American Statistical Association; he considers this quite an accomplishment for a basic cardiorespiratory physiologist. Doug has written invited reviews on statistics for the Journal of Applied Physiology and the American Journal of Physiology; with Dale Benos he has written guidelines for reporting statistics; and he has written educational papers on statistics for Advances in Physiology Education. Doug and his wife Char Sorensen officiate for USA Swimming and US Paralympic Swimming. After 32 years in 6th-grade classrooms, Char is now on her Forever Summer schedule: she retired in May 2009.