DCXL, Policies & NSF Data Management Plans

Unless you live in a cave, you are probably aware that NSF started requiring that researchers submit a two-page supplement to all proposals titled “Data Management Plan”.  To paraphrase from the Grant Proposal Guide, investigators are told they need to discuss:

  1. Types of data
  2. Standards they will use for data and metadata
  3. Policies for access and sharing
  4. Policies and provisions for re-use
  5. Plans for archiving, preserving, and providing access to data

Points #3 and #4 were discussed quite a bit last week at the Data Governance Workshop I attended, with much concern from myself over how scientists would be able to find and comprehend these policies.  If a room full of librarians, funders, publishers, and experts can’t figure out what policies might apply to scientific data, I began to wonder if scientists had any hope understanding data governance.  I think they do, so long as some of the proposed products that will result from the workshop come to fruition.

So where might the Excel add-in we are developing fit into this scheme? The first version of the add-in will likely not have much utility for data governance issues, like setting policies, establishing access rights, and restricting data availability.  We do, however, envision that this add-in might provide a framework for future developers to implement tools to facilitate good data governance practices.  This might be in the form of a link to an archive’s policy, metadata with provisions for access and use, or other methods.

I like to think that because this add-in is intended to be open-source, it will become a useful tool upon which savvy developers can build in capabilities for things like governance, collaboration, links to social networking tools, etc.

Roadie Wearing <No Backstage Passes> Tank Top

If your data rocks, be sure you are involved in who gets to access it. Photo from theselvedgeyard.wordpress.com

The next post will discuss the current state of Data Management Plans as they are discussed in NSF Review Panels.

Intellectual Property, Copyright, & Other Dry Topics

Recently, I found myself wondering What the heck is data governance?  I was asked to participate in a workshop on Data Governance, supported by DataONE and led by MacKenzie Smith of Creative Commons and Trisha Cruse of UC3.  I promptly replied “yes!”, pretending to understand the phrase, and then hurried back to my computer and Googled it.

Data governance is one of those phrases where you can define all of the words involved, but are unclear what they represent when strung together.  No need for you to start Googling – after participating in the data governance workshop in DC for the last couple of days, I can happily report all that I learned and save you the effort.

First, let’s define data governance (based on Wikipedia’s entry): it’s the policies surrounding data, including data risk management, assignment of roles and responsibilities for data, and more generally formally managing data assets throughout the research cycle.  Data governance issues include things like

  • data sharing licenses
  • providing credit for data (see my post about data citation here)
  • managing persistent identifiers (like those available via EZID)
  • documenting data provenance 
  • sharing metadata to enable discovery
  • establishing registries for standards and ontologies

Many scientists might think this is a rather dry set of topics (whether they are correct is a matter of opinion!).  Scientists aren’t concerned about the policies surrounding data, and they have very little incentive for caring.  We have all signed copyright agreements when we publish in journals and patent agreements for our institutions (like this one for the UC system).  But how many of us have read those documents? We have agreed to the terms and conditions of accepting funding, using institutional resources, publishing in journals, and engaging in collaborative research; but how many of us know what we have agreed to do with our data? My guess? Very close to zero.

The important point here is that we SHOULD care. In my conversations with scientists, I have discovered that most of them, if willing to share at all, would like to place restrictions on access and use of their data.  We need to be involved in those data governance discussions if we want to set the terms of our data sharing.

The data governance meeting was attended by 30 folks representing a wide range of perspectives. There were publishers, librarians, funders, scientists, data managers and a lawyer to offer up their ideas about how best to tackle the issues surrounding digital data. Examples of issues that surfaced:

  • Who owns the data?
  • Who is legally allowed to set the polices for data access and use?
  • How are data affected by copyright law?
  • How should we handle data that is used for meta-analysis, and therefore subject to many different policies?
  • What is the implicit policy if none is specified?
  • How should we educate the community of stakeholders about data governance?

We certainly didn’t solve all of the problems associated with data governance, but we made good headway on starting the conversation and encouraging further work in this area.  I will expand on some of these topics in the next blog entry, so stay tuned! For a preview, check out this Storify record of the Twitter feed from the meeting.


Whether or not you think this graffiti speaks the truth, you should be part of the discussion. From Flickr by 917press

The Skinny on Data Publication

The concept of data publication is rather simple in theory: rather than relying on journal articles alone for scholarly communication, let’s publish data sets as “first class citizens” (hat tip to the DataCite group).  Data sets have inherent value that makes them standalone scholarly objects— they are more likely to be discovered by researchers in other domains and working on other questions if they are not associated with a specific journal and all of the baggage that entails.

Consider this example (taken from personal experience).  If you are a biologist interested in studying clam population connectivity, how likely are you to find the (extremely relevant) data related to clam shell chemistry that are associated with paleo-oceanography journals?  It took me several months before I discovered them during my PhD.  If those datasets had been published in a repository, however, with a few well-chosen keywords and a quick web search, I would have located those datasets much more quickly.

Who would be against this idea, you ask?  It turns out data publication is similar to data management: no one is against the concept per se, but they are against all of the work, angst, and effort involved in making it a reality.  There is also considerable debate about how we should proceed to make data publication the norm in scientific communication.

phd cartoon

A summary of what's wrong with the current system, from a PhD Comics cartoon: http://www.phdcomics.com/comics/archive.php?comicid=1200

I had a lovely dinner last week with some colleagues in town for the AGU meeting, where a passionate debate ensued about data publication.  One of the scientists made the (quite valid) argument that data publication is  a terrible phrase because the word “publication” insinuates that we are beholden to the current broken system of journal publication.  The word itself has too much baggage.  The opposing counsel suggested that bureaucrats, funders, and institutions have a familiarity with the word publication and that will ensure the success of the data publication goals, regardless of whether we break the mold in the process.  We agreed to brainstorm potential metaphors for the concept of data publication that might result in a better phrase to describe the idea.  Any suggestions?

This has relevance to the DCXL project since we consider this Excel add-in to be a stepping stone towards data publication (whatever we end up calling it). By allowing scientists to directly link with archives and upload their data, we are promoting data as a unique scholarly object. Through services like EZID, you can even get a DOI for your dataset.  These are all good advances towards promoting data as a first class object.

For more on the current debate that is raging about scholarly communication via journal publication, check out these two recent excellent pieces:

And for a giggle, watch the awesome cartoon called Scientist Meets Publisher from the blog Ceptional.