Closed Data… Excuses, Excuses

If you are a fan of data sharing, open data, open science, and generally openness in research, you’ve heard them all: excuses for keeping data out of the public domain. If you are NOT a fan of openness, you should be. For both groups (the fans and the haters), I’ve decided to construct a “Frankenstein monster” blog post composed of other peoples’ suggestions for how to deal with the excuses.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

Yes, I know. Frankenstein was the doctor, not the monster. From Flickr by Chop Shop Garage.

I have drawn some comebacks from Christopher Gutteridge, University of Southampton, and Alexander Dutton, University of Oxford. They created an open google doc of excuses for closing off data and appropriate responses, and generously provided access to the document under a CC-BY license. I also reference the UK Data Archive‘s list of barriers and solutions to data sharing, available via the Digital Curation Centre‘s PDF, “Research Data Management for Librarians” (pages 14-15).

People will contact me to ask about stuff

Christopher and Alex (C&A) say: “This is usually an objection of people who feel overworked and that [data sharing] isn’t part of their job…” I would add to this that science is all about learning from each other – if a researcher is opposed to the idea of discussing their datasets, collaborating with others, and generally being a good science citizen, then they should be outed by their community as a poor participant.

People will misinterpret the data

C&A suggest this: “Document how it should be interpreted. Be prepared to help and correct such people; those that misinterpret it by accident will be grateful for the help.” From the UK Data Archive: “Producing good documentation and providing contextual information for your research project should enable other researchers to correctly use and understand your data.”

It’s worth mentioning, however, a second point C&A make: “Publishing may actually be useful to counter willful misrepresentation (e.g. of data acquired through Freedom of Information legislation), as one can quickly point to the real data on the web to refute the wrong interpretation.”

My data is not very interesting

C&A: “Let others judge how interesting or useful it is — even niche datasets have people that care about them.” I’d also add that it’s impossible to decide whether your dataset has value to future research. Consider the many datasets collected before “climate change” was a research topic which have now become invaluable to documenting and understanding the phenomenon. From the UK Data Archive: “Who would have thought that amateur gardener’s diaries would one day provide essential data for climate change research?”

I might want to use it in a research paper

Anyone who’s discussed data sharing with a researcher is familiar with this excuse. The operative word here is might. How many papers have we all considered writing, only to have them shift to the back burner due to other obligations? That said, this is a real concern.

C&A suggest the embargo route: “One option is to have an automatic or optional embargo; require people to archive their data at the time of creation but it becomes public after X months. You could even give the option to renew the embargo so only things that are no longer cared about become published, but nothing is lost and eventually everything can become open.” Researchers like to have a say in the use of their datasets, but I would caution to have any restrictions default to sharing. That is, after X months the data are automatically made open by the repository.

I would also add that, as the original collector of the data, you are at a huge advantage compared to others that might want to use your dataset. You have knowledge about your system, the conditions during collection, the nuances of your methods, et cetera that could never be fully described in the best metadata.

I’m not sure I own the data

No doubt, there are a lot of stakeholders involved in data collection: the collector, the PI (if different), the funder, the institution, the publisher, … C&A have the following suggestions:

  • Sometimes as it’s as easy as just finding out who does own the data
  • Sometimes nobody knows who owns the data. This often seems to occur when someone has moved into a post and isn’t aware that they are now the data owner.
  • Going up the management chain can help. If you can find someone who clearly has management over the area the dataset belongs to they can either assign an owner or give permission.
  • Get someone very senior to appoint someone who can make decisions about apparently “orphaned” data.

My data is too complicated.

C&A: “Don’t be too smug. If it turns out it’s not that complicated, it could harm your professional [standing].” I would add that if it’s too complicated to share, then it’s too complicated to reproduce, which means it’s arguably not real scientific progress. This can be solved by more documentation.

My data is embarrassingly bad

C&A: “Many eyes will help you improve your data (e.g. spot inaccuracies)… people will accept your data for what it is.” I agree. All researchers have been on the back end of making the sausage. We know it’s not pretty most of the time, and we can accept that. Plus it helps you strive will be at managing and organizing data during your next collection phase.

It’s not a priority and I’m busy

Good news! Funders are making it your priority! New sharing mandates in the OSTP memorandum state that any research conducted with federal funds must be accessible. You can expect these sharing mandates to drift down to you, the researcher, in the very near future (6-12 months).

The Who’s Who of Publishing Research

This week’s blog post is a bit more of a Sociology of science topic… Perhaps only marginally related to the usual content surrounding data, but still worth consideration. I recently heard a talk by Laura Czerniewicz, from University of Cape Town’s Centre for Educational Technology. She was among the speakers  during the Context session at Beyond the PDF2, and she asked the following questions about research and science:

Whose interests are being served? Who participates? Who is enabled? Who is constrained?

She brought up points I had never really considered, related to the distribution of wealth and how that affects scientific outputs. First, she examined who actually produces the bulk of knowledge. Based on an editorial in Science in 2008, she reported that US academics produce about 30% of the articles published in international peer-reviewed journals, while developing countries (China, India, Brazil) produce another 20%. Sub-saharan Africa? A mere 1%.

She then explored what factors are shaping knowledge production and dissemination. She cited infrastructure (i.e., high speed internet, electricity, water, etc.), funding, culture, and reward systems. For example, South Africa produces more articles than other countries on the continent, perhaps because the government gives universities $13,000 for every article published in a “reputable journal”, and 21 of 23 universities surveyed give a cut of that directly to the authors.

Next, she asked “Who’s doing the publishing? What research are they publishing?” She put up some convincing graphics showing the number of articles published by authors from various countries, of which the US and Western Europe were leading the pack by six fold. I couldn’t hunt down the original publication, so take this rough statistic with a grain of salt. What about book publishing? The Atlantic Wire published a great chart back in October (based on an original article in Digital Book World) that scaled a country’s size based on the value of their domestic publishing markets:

Scaled map of the world based on book publishing. From Digital Book World via Atlantic Wire.

Scaled map of the world based on book publishing. From Digital Book World via Atlantic Wire.

When asking whose interests are served by international journals, she focused on a commentary by R. Horton, titled “Medical journals: Evidence of bias against the diseases of poverty” (The Lancet 361, 1 March 2003 – behind paywall). Granted, it’s a bit out of date, but it still has interesting points to consider. Horton reported that of the five top medical journals there is little or no representation on their editorial boards from countries with low Human Development Indices. Horton then postulates that this might be the cause for the so-called 10/90 gap – where 90% of research funding is allocated to diseases that affect only 10% of the world’s population. Although Horton does not go so far as to blame the commercial nature of publishing, he points out that editorial boards for journals must consider their readership and cater to those who can afford subscription fees.

I wonder how this commentary holds up, 10 years later. I would like to think that we’ve made a lot of progress towards better representation of research affecting humans that live in poverty. I’m not sure, however, we’ve done better with access to published research. I’ll leave you with something Laura said during her talk (paraphrased): “If half of the world is left out of knowledge exchange and dissemination, science will suffer.”

Check out Laura Czerniewicz’s Blog for more on this. She’s also got a Twitter feed.