Feedback Wanted: Publishers & Data Access

This post is co-authored with Jennifer Lin, PLOS

Short Version: We need your help!

We have generated a set of recommendations for publishers to help increase access to data in partnership with libraries, funders, information technologists, and other stakeholders. Please read and comment on the report (Google Doc), and help us to identify concrete action items for each of the recommendations here (EtherPad).

Background and Impetus

The recent governmental policies addressing access to research data from publicly funded research across the US, UK, and EU reflect the growing need for us to revisit the way that research outputs are handled. These recent policies have implications for many different stakeholders (institutions, funders, researchers) who will need to consider the best mechanisms for preserving and providing access to the outputs of government-funded research.

The infrastructure for providing access to data is largely still being architected and built. In this context, PLOS and the UC Curation Center hosted a set of leaders in data stewardship issues for an evening of brainstorming to re-envision data access and academic publishing. A diverse group of individuals from institutions, repositories, and infrastructure development collectively explored the question:

What should publishers do to promote the work of libraries and IRs in advancing data access and availability?

We collected the themes and suggestions from that evening in a report: The Role of Publishers in Access to Data. The report contains a collective call to action from this group for publishers to participate as informed stakeholders in building the new data ecosystem. It also enumerates a list of high-level recommendations for how to effect social and technical change as critical actors in the research ecosystem.

We welcome the community to comment on this report. Furthermore, the high-level recommendations need concrete details for implementation. How will they be realized? What specific policies and technologies are required for this? We have created an open forum for the community to contribute their ideas. We will then incorporate the catalog of listings into a final report for publication. Please participate in this collective discussion with your thoughts and feedback by April 24, 2014.

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

We need suggestions! Feedback! Comments! From Flickr by Hash Milhan

 

Mountain Observatories in Reno

A few months ago, I blogged about my experiences at the NSF Large Facilities Workshop. “Large Facilities” encompass things like NEON (National Ecological Observatory Network), IRIS PASSCAL Instrument Center (Incorporated Research Institutions for Seismology Program for Array Seismic Studies of the Continental Lithosphere), and the NRAO (National Radio Astronomy Observatory). I found the event itself to be an eye-opening experience: much to my surprise, there was some resistance to data sharing in this community. I had always assumed that large, government-funded projects had strict data sharing requirements, but this is not the case. I had stimulating arguments with Large Facilities managers who considered their data too big and complex to share, and (more worrisome), that their researchers would be very resistant to opening up the data they generated at these large facilities.

Why all this talk about large facilities? Because I’m getting the chance to make my arguments again, to a group with overlapping interests to that of the Large Facilities community. I’m very excited to be speaking at Mountain Observatories: A Global Fair and Workshop  this July in Reno, Nevada. Here’s a description from the organizers:

The event is focused on observation sites, networks, and systems that provide data on mountain regions as coupled human-natural systems. So the meeting is expected to bring together biophysical as well as socio-economic researchers to discuss how we can create a more comprehensive and quantitative mountain observing network using the sites, initiatives, and systems already established in various regions of the world.

I must admit, I’m ridiculously excited to geek out with this community. I’ll get to hear about the GLORIA Project (GLObal Robotic-telescopes Intelligent Array), something called “Mountain Ethnobotany“, and “Climate Change Adaptation Governance”. See a full list of the proposed sessions here. The conference is geared towards researchers and managers, which means I’ll have the opportunity to hear about data sharing proclivities straight from their mouths. The roster of speakers joining me include a hydroclimatologist (Mike Dettinger, USGS) and a researcher focused on socio-cultural systems (Courtney Flint, Utah State University), plus representatives from the NSF, a sensor networks company, and others. The conference should be a great one – abstract submission deadline was just extended, so there’s still time to join me and nerd out about science!

Reno! From Flickr by Ravensmagiclantern

Reno! From Flickr by Ravensmagiclantern

Lit Review: #PLOSFail and Data Sharing Drama

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

Turn and face the strange, researchers. From pipedreamsfromtheshire.wordpress.com

I know what you’re thinking– how can yet another post on the #PLOSfail hoopla say anything new? Fear not. I say nothing particularly new here, but I do offer a three-weeks-out lit review of the hoopla, in hopes of finding a pattern in the noise. For those new to the #PLOSFail drama, the short version is this: PLOS enacted a mandatory data sharing policy. Researchers flipped out. See the sources at the end of this post for more background.

 Arguments made against data sharing

1) My data is my lifeblood. I won’t just give it away.

Terry McGlynn, a biologist writing at Small Pond Science argues that “Regardless of the trajectory of open science, the fact remains that, at the moment, we are conducting research in a culture of data ownership.” Putting the ownership issue aside for now, let’s focus on the crux of this McGlynn’s argument: he contends that data sharing results in turning a private resource (data) into a community resource. This is especially burdensome for small labs (like his) since each data point takes relatively more effort to produce. If this resource is available to anyone, the benefits to the former owner are greatly reduced since they are now shared with the broader community.

Although these are valid concerns, they are not in the best interest of science. I argue that what we are really talking about here is the incentive problem (see more in the section below). That is, publications are valued in performance evaluation of academics, while data are not. Everyone can agree that data is indispensable to scientific advancement, so why hasn’t the incentive structure caught up yet? If McGlynn were able to offset the loss of benefits caused to data sharing by getting mad props for making their data available and useful, this issue would be less problematic. Jeff Leek, a biostatistician blogging at Simply Statistics, makes a great point with regard to this: to paraphrase him, the culture of credit hasn’t caught up with the culture of science. There is no appropriate form of credit for data generators – it’s either citation (seems chintzy) or authorship (not always appropriate). Solution: improve incentives for data sharing. Find a way to appropriately credit data producers.

2) My datasets are special, unique snowflakes. You can’t understand/use them.

Let’s examine what McGlynn says about this with regard to researchers re-using his data: “…anybody working on these questions wouldn’t want the raw data anyway, and there’s no way these particular data would be useful in anybody’s meta analysis. It’d be a huge waste of my time.”

Rather than try to come up with a new, witty way to answer to this argument, I’ll shamelessly quote from MacManes Lab blog post, Corner cases and the PLOS data policy:

 There are other objections – one type is the ‘my raw data are so damn special that nobody can over make sense of them’, while another is ‘I use special software and stuff, so they are probably not useful to anybody else’. I call BS on both of these arguments. Maybe you have the world’s most complicated data, but why not release them and not worry about whether or not people find them useful – that is not your concern (though it should be).

I couldn’t have said it better. The snowflake refrain from researchers is not new. I’ve heard it time and again when talking to them about data archiving. There is certainly truth to this argument: most (all?) datasets are unique. Why else would we be collecting data? This doesn’t make them useless to others, especially if we are sharing data to promote reproducibility of reported results.

DrugMonkey, an anonymous blogger and biomedical researcher, took this “my data are unique” argument to paranoia level. In their post, PLoS is letting the inmates run the asylum and it will kill them, s/he contends that researchers will somehow be forced to use all the same methods to facilitate data reuse. “…diversity in data handling results, inevitably, in attempts for data orthodoxy. So we burn a lot of time and effort fighting over that. So we’ll have PLoS [sic] inserting itself in the role of how experiments are to be conducted and interpreted!”

I imagine DrugMonkey pictures future scientists in grey overalls, trudging to a factory to do “science”. This is just ridiculous. The idiosyncrasies of how individual researchers handle their data will always be part of the challenge of reproducibility and data curation. But I have never (ever) heard of anyone suggesting that all researchers in a given field should be doing science in the exact same way. There are certainly best practices for handling datasets. If everyone followed these to the best of their ability, we would have an easier time reusing data. But no one is punching a time card at the factory.

 3) Data sharing is hard | time-consuming | new-fangled.

This should probably be #1 in the list of arguments from researchers. Even those that cite other reasons for not sharing their data, this is probably at the root of the hoarding. Full disclosure – only a small portion of the datasets I have generated as a researcher are available to the public. The only explanation is it’s time-consuming and I have other things on my plate. So I hear you, researchers. That said, the time has come to start sharing.

DrugMonkey says that the PLOS data policy requires much additional data curation which will take time. “The first problem with this new policy is that it suggests that everyone should radically change the way they do science, at great cost of personnel time…” McGlynn states this point succinctly: “Why am I sour on required data archiving? Well, for starters, it is more work for me… To get these numbers into a downloadable and understandable condition would be, frankly, an annoying pain in the ass.”

Fair enough. But I argue here (along with others others) that making data available is not an optional side note of research: it is research. In the comments of David Crotty’s post at The Scholarly Kitchen, “PLOS’ bold data policy“, there was a comment that I loved. The commenter, Mike Taylor, said this:

 …data curation is research. I’d argue that a researcher who doesn’t make available the data necessary to reproduce his conclusions isn’t getting his job done. Complaining about having to spend time on preparing the data for others to use is like complaining about having to spend time writing the paper, or indeed running experiments.

When I read that comment, I might have fist pumped a little. Of course, we still have that pesky incentive issue to work out… As Crotty puts it, “Perhaps the biggest practical problem with [data sharing] is that it puts an additional time and effort burden on already time-short, over-burdened researchers. Researchers will almost always follow the path of least resistance, and not do anything that takes them away from their research if it can be avoided.” Sigh.

What about that “new-fangled” bit? Well, researchers often complain that data management and curation requires skills that are not taught. I 100% agree with this statement – see my paper on the lack of data management education for even undergrads. But as my ex-cop dad likes to say, “ignorance of the law is not a defense”. In continuation of my shameless quoting from others, here’s what Ted Hart (Staff Scientist at NEON) has to say in his post, “Just Get Over Yourself and Share Your Data“:

Sharing is hard. but not an intractable problem… Is the alternative is that everyone just does everything in secret with myriad idiosyncrasies ferociously milking least publishable units from a data set? That just seems like a recipe for science moving slowly and in the dark. …I think we just need to own up to the fact being a scientist these days requires new skills, and it always have. You didn’t have to know how to do PCR prior to 1983, but now you do. In the 21st century to do science better, we need more than spreadsheets with a few rows, we need to implement best practices for data management.

More fist pumping! No, things won’t change overnight. Leek at Simply Statistics rightly stated that the transition to open data will be rough for two reasons: (1) there is no education on data handling, and (2) the is a disconnect between the incentives for individual researchers and the actions that will benefit science as a whole. Sigh. Back to that incentive issue again.

Highlights & Takeaways

At risk of making this blog post way too long, I want to showcase a few highlights and takeaways from my deep dive into the #PLOSfail blogging world.

1) The Incentives Problem

We have a big incentives problem, which was probably obvious from my repeated mentions of it above. What’s good for researchers’ careers is not conducive to data sharing. If we expect behavior to change, we need to work on giving appropriate credit where it’s due.

Biologist Björn Brembs puts it well in his post, What is the Difference Between Text, Data, and Code?“…it is unrealistic to expect tenure committees and grant evaluators to assess software and data contributions before anybody even is contributing and sharing data or code.” Yes, there is a bit of a chicken-and-egg situation. We need movement on both sides to get somewhere. Share the data, and they will start to recognize it.

2) Empiricism Versus Theory

There is a second plot line to the data sharing rants: empiricists versus theoreticians. See ecologist Timothée Poisot‘s blog, “Of the value of datasets and methods in open science” for a more extensive review of this issue as it relates to data sharing. Of course, this tension is not a new debate in science. But terms like “data vultures” get thrown about, and feelings get hurt. Due to the nature of their work, most theoreticians’ “data” is equations, methods, and code that are shared via publication. Meanwhile, empiricists generate data and can hoard it until they see fit to share it, only offering a glimpse of the entire suite of their research outputs. To paraphrase Hart again: science is equal parts data and analysis/methods. We need both, so let’s stop fighting and encourage open science all around.

3) Data Ownership Issues

There are lots of potential data owners: the funders who paid for the work, the institution where the research was performed, the researcher who collected the data, the principle investigator of the lab where the researcher works, etc. etc. The complications around data ownership make this a tricky subject to work out. Zen Faulkes, a neurobiologist at University of Texas, blogged about who owns data, in particular, his data. He did a little research and found what many (most?) researchers at universities might find: “I do not own research data I generate. Neither do the funding agencies. The University of Texas system Board of Regents own research data I generate.” Faulkes goes on to state that the regents probably don’t care what he does with his data unless/until they can make money off of it… very true. To make things more complicated, Crotty over at Scholarly Kitchen reminded us that “under US law (the Bayh-Dole Act), the intellectual property (IP) generated as the result of federal research funds belongs to the researcher and their institution.” What does that even mean?!

To me, the issue is not about who owns the data outright. Instead, it’s about my role as an open science “waccaloon” who is interested in what’s best for the scientific process. To that extent, I am going to borrow from Hart again. Hart makes a comparison between having data and having a pet: in Boulder CO, there are no pet “owners” – only pet “guardians”. We can think of our data in this same way: we don’t own it; we simply care for it, love it, and are intellectually (and sometimes emotionally!) invested in it.

4) PLOS is Part of a Much Bigger Movement

Open science mandates are already here. The OSTP memo released last year is a huge leap forward in this direction – it requires that federally funded research outputs (including data) be made available to the public. Crotty draws a link between OSTP and PLOS policies in his blog: “Once this policy goes into effect, PLOS’ requirements would seem to be an afterthought for authors funded in this manner. The problem is that the OSTP policy seems nowhere near being implemented.”

That last part is most definitely true. One way to work on implementing this policy? Get the journals involved. The current incentive structure is not well-suited for ensuring compliance with OSTP, but journals have a role as gatekeepers to the traditional incentives. Crotty states it this way:

PLOS has never been a risk averse organization, and this policy would seem to fit well with their ethos of championing access and openness as keys to scientific progress. Even if one suspects this policy is premature and too blunt an instrument, one still has to respect PLOS for remaining true to their stated goals.

So I say kudos to PLOS!

In Conclusion…

I’ll end with a quote from MacManes Lab blog post:

How about this, make an honest effort to make the data accessible and useful to others, and chances are you’re probably good to go.

Final fist pump.

Sources

  1. Timothée Poisot, Ecologist. Of the value of datasets and methods in open science.
  2. Terry McGlynn, Biologist. I own my data until I don’t. Blog at Small Pond Science @hormiga
  3. David Crotty, publisher & former researcher. PLOS’ bold data policy Blog at The Scholarly Kitchen @scholarlykitchn
  4. Edmund Hart, Staff Scientist at NEONJust Get Over Yourself and Share Your Data. @DistribEcology
  5. MacManes Lab, genomics. Corner cases and the PLOS data policy.
  6. DrugMonkey, biomedical research. PLoS is letting the inmates run the asylum and it will kill them. @DrugMonkey
  7. Zen Faulkes, Neurobiologist. Who owns data. Blog at NeuroDojo @DoctorZen
  8. Björn Brembs, biologist. What is the Difference Between Text, Data, and Code? @brembs
  9. Jeff Leek, biostatistician. PLoS One, I have an idea for what to do with all your profits: buy hard drives Blog at Simply Statistics. @leekgroup

Twitter feed for #PLOSfail

From PLOS