The Guardian reports on a recent study that “claims to find a genetic link between creativity and mental illness.” 86,000 Icelanders were used to identify genetic variants that were “more common” in schizophrenia and bipolar disorder, and these variants were then shown to be more common in certain “creative” professions.
Referencing genetic variant being “more common” in schizophrenia is a landmine. For a while, people thought a handful of genetic mutations might cause schizophrenia and bipolar disorder, and a careful study of families with these illnesses may identify these genes—a similar research program that identified the key genes in Huntington’s disease and Duchene’s Muscular Dystrophy. This hunch was wrong. First, it was discovered, primarily from twin studies, that a large fraction of risk (perhaps 40%) of common psychiatric illnesses was not genetic, but "environmental". Second, primarily from large Genome Wide Association Studies (GWAS), it was discovered that the fraction of the risk that was truly genetic was conferred through the collective action of many, rather than a few, genes.
DNA is a very long string of letters; each letter can be different between different individuals. Most of them are the same, or else we would not be all human. But about one out of every three hundred of these letters are different between you and I, and about 3% of that 1/300 are “functional” (i.e. they most likely make some meaningful impact on some biological function.) The nice thing about genetics is that all of these variations are beautifully catalogued in a library called the Human HapMap Project, and whenever we have an individual’s biological sample, we can map them onto the “consensus sequence”, and detect which of these letters deviate from those from the “average” string of a human. Detecting a particular variation of a letter is as easy as searching for an address using Google Maps.
The bigger problem is, there are too many of these possible changes of letters: hundreds of thousands, depending on the methods used to detect them. The newer deep sequencing methods can detect every single one of them. If you checked lots and lots of these letters, some of them will pop out as being statistically significantly associated with diseases just by pure chance. Problems in which the number of variables (i.e. P, the letters) is much larger than the sample size (N, number of people in the study) are called P>>N problems, and Professor Rob Tibshirani has written lots of nice articles on them. Certain statistical methods, such as shrinkage, and false discovery rate, have been specifically designed to deal with these problems.
What if you want to not just check for disease associations one letter at a time, but also to use multiple letters together? Polygenic scoring is an easy way to add, literally, multiple variations together. Let’s say we have three variations (1, 2 & 3), and each tested against the presence of schizophrenia. Two of them have p-value of 0.01 and 0.02. The third has p-value of 0.06. You draw an arbitrary threshold of p=0.05 and the first two are included in your model. If you have both variation 1 and 2, your “polygenic score” is 2. If you only have 1, you get a score of 1.
You see how this works—every variation is weighted about the same, and possible nonlinear interactions between them are ignored (i.e. what if I’m sort of at risk with either 1 or 2, but REALLY at risk if I have both 1 and 2?). But you’ve got to start somewhere. These scores were indeed correlated to disease status, but the size of this correlation was unimpressive: “variance explained was 5.5% for schizophrenia and 1.2% for bipolar … odds ratios computed using these scores were 2.22 for schizophrenia and 1.46 for bipolar disorder.” About 1% of the population has schizophrenia. With a polygenic score computed using >100,000 letters, I can tell you a particular individual has a risk of 2% instead of 1%. That was a lot of work for not a lot of information gained.
The authors subsequently showed that, this score, despite it being not very well correlated with disease status, was also correlated to job status in a creative profession, “with the schizophrenia and bipolar disorder scores explaining a maximum of 0.24% and 0.26% of the variance of creativity” (emphasis mine). An unimaginably small effect, but since the sample size was large, it was technically statistically significant.
The lesson from all this: simplistic additive scoring of a large number of variants does not work. More advanced algorithms are needed—a fairly obvious, if snarky point. Nevertheless, a demonstration that we can achieve some amount of predictive accuracy, however minute, of a complex behavior (picking a job), using only genetic data, is perhaps worthy of some celebration.
A couple of more technical points—general readers can skip this paragraph—one is that prior information, especially from environmental and clinically informed sources (clinical history, mental status exam, etc.), should be built into the model—a shockingly obvious and unconscionably ignored idea in psychiatric genetics. (In fact, such grants might get a low score because reviewers think of clinical characteristics as “confounders”… if only study sections are more pragmatically Bayesian, the world would be a better place.) Second, common machine learning metrics of performance (ROC analysis, confusion matrix, etc.) are still uncommonly used—though this is changing. The gold standard of predictive performance, out-of-sample cross-validation, is rarely used, probably because it would expose (and indeed, has exposed: see a number of studies from the Psychiatric Genomics Consortium) the feeble effect sizes of using genetics alone in your face. Nevertheless, I’m gonna go ahead and quote Russell here: “even if the open windows of science at first make us shiver…in the end the fresh air brings vigour, and the great spaces have a splendour of their own.” Psychiatric genetics (and Precision Medicine at large) is a classic, difficult P>>N problem, and a great space, and we will have lots of invigorating work to do.