Tips and Statistics

Beware of rote significance testing.

Posted Nov 09, 2020

Klare Daten [clear data] ­— Professor Wulf-Uwe Meyer, University of Bielefeld, Germany, ca. 1980, upon surveying experimental findings prior to running inferential statistics; which then, of course, he did. 

Warren et al. (2020) show that customers leave smaller tips when tips are requested before instead of after the service has been provided. A feeling of (not) being manipulated is the critical intervening variable. Customers, it seems, have a stronger preference for rewarding good service than for incentivizing it. They rather pay it, as it were, backward.

This is an interesting result, both in terms of its general message regarding human motivation and in terms of its practical implications for customers and the businesses that serve them. Warren et al. report both data from the field and data from vignette studies, where the latter ask respondents to report what they would do or how they would feel in a particular situation.

My concern here is the statistical treatment of the field data. The stats are not wrong; they rather offer a crisp example of how significance testing is overused. Here is what happened; There were 7,523 purchases of juices and smoothies, 4,704 of which involved requests for preservice tips, while the rest invited tipping after service. The average tip amounts were $0.90 and $1.58, respectively, in the pre- and post-service locations. A t-test with 7,521 degrees of freedom produced a test statistic of 15.97 and a p-value of—no surprise here—less than .001.

Why would one perform a significance test for such a large data set? Large data sets are overpowered; they yield significant results for tiny differences.

The standardized effect size in this study can be recovered from the sample size, the test statistic, and the difference between the means [delta], such that Cohen’s d = delta / (delta x sqrt(n) / t) = .184. A d of .2 is conventionally regarded as a small effect, so this seems close enough. The eyebrow furrower is the realization that, assuming the same variance, a delta of .1 (e.g., average tips of $1.48 and $1.58 respectively for pre- and post-service transactions) would still yield a significant t of 2.35, p = .019, with a diminutive d value of .027. 

We should not ask the researchers to collect less data or to throw out the data they have. Instead, they should report the standardized effect size and offer an opinion if it is large enough for us to care. They should, in other words, make a judgment call.

Some readers might reasonably disagree, but I maintain that with very large samples, significance is beside the point, although non-significance might be interesting. The informative value of significance and non-significance is asymmetrical. When samples are very large, significance is likely a priori; not finding it is diagnostic. When samples are small, significance is unlikely a priori, and finding it is diagnostic, providing a stimulant for a closer look and perhaps the trouble of replication work (Krueger & Heck, 2017).

As a rule of thumb, ask whether significance testing is necessary or appropriate if effects or samples are either very large or very small. Consider an example of a large effect. You sample a big pot of legumes and find that of the first 60, 56 are peas, and 4 are lentils. Would you conduct a significance test to see if the null hypothesis of drawing a pea is .5?

An experienced data friend would not do such a test, arguing that the outcome is obvious. A less experienced sampler might do a chi-squared test, only to find a huge tests statistic of 45 with an infinitesimal probability (11 zeros after the dot). If 38 peas had been drawn (and 22 lentils), there'd be a little tension and perhaps a drum roll. Now, chi-squared is 4.27, and p is .039.

Yet, as one sampler once told me, they would do a test even if they had sampled 60 peas and nothing else. Why? Because editors and reviewers would demand it. Now that's sad. Perhaps what we have here is a collective illusion, such that editors and reviews demand significance testing because they falsely believe that others truly want to see them. This would be a case of pluralistic ignorance (Taylor, 1982).  


Krueger, J. I., & Heck, P. R. (2017). The heuristic value of p in inductive statistical inference. Frontiers in Psychology: Educational Psychology.

Taylor, D. G. (1982). Pluralistic ignorance and the spiral of silence: A formal analysis. Public Opinion Quarterly, 46, 311-335. 

Warren, N., Hanson, S., & Yuan, H. (2020). Feeling manipulated: How tip request sequence impacts customers and service providers? Journal of Service Research. DOI: 10.1177/1094670519900553