Imagine we had a question: “Do men and women differ on X?”
No matter what “X” is—height, empathy, knowledge of 13th century Spanish history, or anything else—we know that any given man will be different than any given woman, but what we don’t know is how men “on average” differ from women “on average.” That is, when we asked our initial question, we probably wanted to know how the mean for men compared with the mean for women. But we will never know the actual mean for men or the actual mean for women, because that would involve measuring more than 7 billion people! So, we need to, somehow, get a sample of men and a sample of women, compare them, and draw a conclusion from that.
Let’s say we get a sample of 100 men and 100 women, and we ask them about Spanish history. In our sample, women average 68% and men average 63%. That is the result for our sample and it is a rock-solid result. But, remember, we aren’t particularly interested in our sample – we are interested in “men vs. women,” not “men we happened to look at” vs. “women we happened to look at.” We want to use our sample to infer something about the larger population (and that is what puts the infer in inferential statistics).
Making these inferences has a serious challenge: Any difference we see in our samples could be due to chance! Sure, our group of men differs from our group of women, but that doesn’t tell us much in itself, because if we picked two groups of men at random, they would also differ. This is a serious problem: Given that any two samples will obviously differ from each other on virtually everything we try to measure (if we can measur in enough detail), how can we use samples to draw conclusions?
All is not lost, though, as a little intuition will tell us. Differences found due to random chance are likely to be small, and are likely to be of a very different size if we do the same test again. If we could repeat our test again and again (with new samples), it would help us make better inferences: If we got samples of 100 men and 100 women 20 times, and every time we found women scoring 5 points higher than men, we would be much more confident in our finding. While replication isn’t usually practical, we can use one sample to guess what would happen if we replicated. And, our intuition can help us here as well: If we find a small difference between groups after measuring only a small number of people, that is more likely to be due to random chance than if we find a big difference between groups after measuring a lot of people. Breaking that down: 1) Large differences are less likely to be due to chance than small differences, and 2) the bigger the size of the sample, the more point 1 is true.
If we could get a good mathematical handle on the “less likely” vs. “more likely” part of those claims, we could start to use our samples to make really good guesses about how replicable our results are. That is, we could use our single sample to reliably predict what would happen if we replicated our study a bunch of times. We already agreed above that if the result replicated over and over again, then we would be confident in drawing conclusions about the larger population. And now we know that we can use a single sample to draw conclusions about what would happen if we had many samples. Putting the last two sentences together: If we could get some math behind us, we can use our single sample to make reliable inferences about the larger population.
Thus, no matter what inferential statistic we use, the question is always something like: “This difference we found in our sample, what is the probability we found a difference that large, just by chance?” When it is unlikely that our observed difference is due to chance, we feel confident it is real.