Who Will Use Open Data?

The motivation, and lack thereof, for sharing data in basic science.

Posted Aug 10, 2017

Recently I’ve been asking myself a very simple question about open data: Who is it for?

Thinking about this question has clarified two issues for me. (Well, perhaps clarified is a bit strong — let’s settle for “made less murky”). First, why anyone would spend time making their data open — who has the motivation to do so, and who does not. Second, a vision of future science based on open data.

The first issue is simple. Anyone gathering experimental data is faced with the basic motivational question: Why should I make my data publicly available?

For some fields of research, there is a clear answer. Clinical trials of new drugs or new treatments seek to test that their newfangled thing is both safe and effective. How sure can we be that the data gathered from the trial show the drug is both safe and effective? This is an issue with potentially far-reaching consequences, as it could change healthcare for all of us. So it is in the interests of multiple bodies — both public and private — to check that data. And so it is in the interests of multiple bodies to push for the public release of that data; and to reward the release of that data; and to support the people needed to check that data.

Cohort studies have the same public interest motivation for sharing data. Cohort studies do things like compare smokers and non-smokers on the rate of lung cancer. They take different groups of people and see if the health of these groups depends on the things that differ between the groups — like smoking, diet, or childhood teddy bears with a gender-neutral or gender-specific name. Cohort studies influence both health and education policies. Just like clinical trials, cohort studies can potentially affect us all. There is then a public interest argument for putting their data through tough scrutiny. And so it is in the interests of multiple bodies to push for the public release of that data.

For some fields, like astronomy, data-sharing is a norm; or becoming so. This is down to the sheer scale and expense of data-collection. When your life’s work depends, like so many of your colleagues, on the telemetry data from a 20 year-old satellite orbiting Saturn, it’s just obvious that the data should be shared for the benefit of all astronomers. After all, you personally are not about to launch a Saturn-bound probe from your apartment balcony (but if you are, please do let us know in the comments). Also the part of neuroscience that does imaging of the human brain using bloody massive electromagnets (fMRI, to its friends) has also embraced data-sharing. It has the same motivating force: each bit of fMRI kit is phenomenally expensive, the data-sets are huge, and the analysis pipeline is a collaborative effort of many smart people. So to some in that field it is just obvious that data-sharing is the way to go.

Data-sharing in clinical trials, cohort studies, astronomy, and their ilk is far from perfect. Just because it should be done, doesn’t mean it is being done. But they’re light-years ahead of most of “basic” science, of the exploratory work of finding stuff out. For much of basic science there is not a clear answer to the question of “who is it for?” For who will use it?

My field — “basic” neuroscience — is one of them. It is one of the laggards in the open science movement, with issues that typify the barriers to data sharing. It typifies the idea that open data is not for experimentalists. That’s because of the primacy of new data.

Data primacy, the cult of data, has many roots. One is the seemingly obvious fact that experimental labs exist to do experiments. Their raison d’etre is to obtain new data (or build tools to make that possible). Their funding depends on getting new data.

Another root of the cult of data is that good quality neuroscience data are phenomenally difficult to obtain. Getting that data needs cunning, perseverance, and skill. “Patch-clamping” is an excellent example. This technique allows us to record all the activity of a single neuron — every flicker of voltage — by gently attaching a glass pipette to the surface of a neuron’s membrane. That neuron’s diameter is 100 times smaller than the width of a human hair. So when I say “gently attach,” I mean painstakingly lowering a pipette using microdrives until there is a whisper of contact with a neuron, and gently, oh so gently, suctioning the membrane’s edge into the end of the pipette, creating a seal. Such hard won data, data obtained personally through skill and sheer bloody-mindedness, not through some gigantic machine, is protected fiercely.

The most pernicious root is that labs, and their members, do not need to share data. They need papers. As I’ve previously rehearsed, papers are the currency of basic science. Want promotion? Need papers. Want tenure? Need papers. Want grants? Need papers. Want a research post? Need papers.

And what does data buy you? Papers. So the basic equation of scientific employment is Data = Papers = Success in career (= food for me and my family and a roof over our heads).

So even if experimentalists acknowledge the concept of data-sharing as “a good thing,” a thing they want to do (and there are many who think this) their motivation to do it personally is very low. Their peers have the same view of the world as they do: new data is king; data is atrociously hard to get; data is the root of our survival. Data is for me. Who else is it for?

Not knowing who the data is for removes one of the primary motivators for open data in many basic science fields.

What we need are data labs. Data labs are who the data is for. Or, rather, will be for. Teams of data scientists who are subject specialists and so are able to analyse existing data to answer their own scientific questions.

A scant few data labs exist. Like me, and a handful of others (some more successful than myself, I should stress). I do neuroscience, I answer scientific questions using neuroscience data, but I don’t do experiments. We collaborate with experimental labs, working with them to answer deep scientific questions, ours and theirs. Best of all, we use what little current data is freely available. We use it to develop and test new ways of analyzing neuroscience data. Amongst other things, we use it to answer fundamental questions about how populations of neurons compute.

There are reasons to believe that data labs will be an essential part of the fabric of basic science. Perhaps in neuroscience most of all, with its extraordinary data complexity yet lack of guiding theories to interpret that data. Data are only growing exponentially — in neuroscience, the number of simultaneously recorded neurons is growing exponentially fast with time. Funders of science are increasingly pushing for open data, as are some scientific journals — and if they fund the data release, they ought to think about funding people to do something with it. Perhaps the number one reason to believe that data labs are coming is that they will provide a greater return on money invested in research. By re-using existing data to both replicate previous work and answer new scientific questions, data labs can add value to any gathered experimental data. And who doesn’t want to see their taxes being used efficiently?

Sharing data takes effort, sometimes extreme effort. Any substantial effort needs motivation. And the motivation for experimentalists to share their data is missing a key element: knowing who they are doing it for.

That missing element is data labs. Without supporting people to use the data, we have created a system that churns out endless new data, that rewards the production of new data, whether we need it or not. So here is a dream of future science, of a scientific landscape populated by data labs, using vast repositories of open data to tackle deep questions of how brains do what they do.

Now all we need is the funding…

This is a write-up of a keynote lecture at the JISC Research Data Network Meeting, June 27th 2017, York University UK.

Cross-posted from The Spike