Open Up

Open Access Week came and went last week, and I marked the event on the blog with a post on Open Access.  But the Open movement goes far beyond just Open Access: there are lots of different flavors of open, with a select few explored in this post.

open range

Watch out for loose data and stray knowledge, folks. From Flickr by osiatynska

First let’s start with Open Notebook Science. This concept throws out the idea that you should be a hoarder, not telling others of your results until the Big Reveal in the form of a publication.  Instead, you keep your lab notebook (you do have one, right?) out in a public place, for anyone to peruse.  Most often ONS takes the form of a blog or a wiki.  The researcher updates their notebook daily, weekly, or whatever is most appropriate. There are links to data, code, relevant publications, or other content that helps readers, and the researcher themselves, understand the research workflow.

The most obvious reason for doing Open Notebook Science is that you can get feedback while you are still working on your research. If you are having problems or are stuck, the community might be able to help you. Another potential benefit is more opportunity for collaboration with others working on similar or related projects. Of course, the altruistic reason for keeping an open notebook is to contribute to the reproducibility and credibility of your research. For more information, check out Carl Boettiger’ great site that tells you more about ONS and contains his own notebook.

Open Science is basically the same concept as open notebook science: you make sure anyone who wants information on your work, your data, or your process can find it easily. You may or may not keep a lab notebook online, however.

Open Source refers to software (it actually refers to lots of stuff, but I’m only going to talk about software here). From Wikipedia:

Open-source software is software whose source code is published and made available to the public, enabling anyone to copy, modify and redistribute the source code without paying royalties or fees.

An important component of the open source software model is the community.  Developers and individuals can rally around the code, making it better and working as a group to improve the software.

Open-source code can evolve through community cooperation. These communities are composed of individual programmers as well as very large companies.

The statistical program R is a great example of open source software with an active, strong community.

Open Data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. Data that is truly open should be released into the public domain (e.g., with a CC-0 license). For those that use the ONEShare repository via DataUp, your data will be open data.

And finally, Open Knowledge encompasses all of these concepts. It’s described as a set of principles and methodologies related to the production and distribution of “knowledge works” in an open manner. In this definition, knowledge can include data, content and general information. To learn more about the OK movement, check out the materials and resources on the Open Knowledge Foundation website.

Three Cheers for Open Access!


Open Access has two flavors (green and gold), but the concept of “open” has many more: open science, open source, open knowledge… From Flickr by aquarian librarian.

If you weren’t aware, this week is Open Access Week.  The word open gets used quite a bit these days… like open notebooks, open science, open source, open content, open access, open data, open government, open repositories, and open knowledge. If you are not sure what all the hoopla is about, read on. Note: I will talk about “Open Stuff” for the next couple of blog posts, so stay tuned.

Let’s start with the honorary “Open” for the week: open access.  This phrase is used to describe “content”, which is a rather ambiguous phrase used to indicate it could be just about anything digital… like pictures, data, articles, blog posts, etc. etc. Open access content has three basic characteristics:

  1. Digital
  2. Free
  3. Online

That means there are no price or permission barriers, the full content is available, and it is made available immediately. Although Open Access content can be anything that fitst the criteria above, the phrase is most frequently used to describe academic journals.

It might surprise some to learn that Open Access as a way of publishing is compatible with copyright, peer review, revenue, quality, indexing, and prestige.  If the journal is open access, there may be little or none of the following: printing, price negotiations for institutional subscriptions, site licenses, user authentication, access blocking.

Open access comes in two basic flavors: Green and Gold.

  • Gold OA refers to peer reviewed, open access journals. Examples include PLOS and Ecosphere. Sometimes authors are charged, but often fees are waived if their home institution has a subscription to the journal.
  • Green OA refers to open access repositories, traditionally with no peer review, that are institutional or discipline-specific. Examples include PubMed Central and MIT’s DSpace repository. Basically, these repositories are able to house articles for authors that may or may not be published in OA journals. This is known as “post-print archiving”, and is explicitly allowed by about 60% of journals and allowed by almost all others upon request.

Researchers: in case you weren’t paying attention, you can post-print archive ALL of your publications with an OA archive, which makes them open access! Why aren’t all researchers doing this already? Probably because they either don’t know they can (I didn’t until recently), or they don’t know what repositories are available to them to do this. If it’s the latter, here are a few suggestions:

  • Talk to your friendly institutional librarian.  They know all kinds of things (read my blog post about libraries being under-utilized), including whether your institution has a relationship with any repositories.
  • Check out OpenDOAR, the OA repositories list. It’s a complete list of “Green OA” repositories with over 2000 listings.

In honor of Open Access week, I challenge you to make at least one of your previously published articles open access. Go forth and open!

You might be wondering… why are some people against open access? What are the down sides? 

Not surprisingly, most of the folks that aren’t big fans of OA are traditional scholarly publishers.  They contend that publishers play an important gatekeeper role, keeping out the riffraff articles that will drag down the journal’s reputation.  Traditional journals also have a strong record of facilitating peer review, editing articles, and indexing them with various services.  The Association of American Publishers (AAP) is leading the charge against OA requirements for publicly funded research, and in 2011 they helped sponsor a bill put before congress called the Research Works Act.  Wikipedia sums up the bill nicely:

The bill contains provisions to prohibit open-access mandates for federally funded research and effectively revert the NIH’s Public Access Policy that requires taxpayer-funded research to be freely accessible online.

Needless to say, this bill would have severely restricted all kinds of open access progress that has been made in the last decade.  In response to this bill, an online petition was initiated called The Cost of Knowledge, which focused on the business practices of the academic publisher Elsevier.  It was signed by more than 10,000 scholars, who were calling for lower prices for journals and promotion of increased open access to information. The bill did not pass, and hopefully it or some new version of it never will.

Provenance at the #MSeScience Workshop

Last week some pretty fabulous speakers congregated in Chicago for the Microsoft eScience Workshop, which was scheduled to coincide with the 2012 IEEE eScience Workshop (IEEE stands for Institute of Electrical and Electronics Engineers, but their conference has evolved into a general tech conference).  Having never attended either conference, I wasn’t sure what to expect.  I was invited to participate because of DataUp; I took part in the “DemoFest” and  led a panel on data curation that included an overview of the DataUp tool and DataONE.

German artist Gerhard Richter’s piece “Woman Descending the Stairs” is on display at the Art Institute of Chicago. Its provenance? Gift of the Lannan Foundation in 1997. Click for more information.

I was pleasantly surprised by the workshop’s breadth and depth of topics.  My favorite session by far, however, was titled “Publishing and eScience”, co-chaired by Mark Abbott (dean at the College of Earth, Ocean, and Atmospheric Sciences at Oregon State) and Jeff Dozier (faculty at the UCSB Bren School for Environmental Science and Management). Abbott and Dozier were joined by Jim Frew (also of UCSB) and Shuichi Iwata (Emeritus Professor of the University of Tokyo).  The topic du jour was how to maintain dataset provenance, especially for those datasets that are used for publishing results.

If the word “provenance” is throwing you for a loop, you aren’t alone. Many researchers aren’t familiar with this term as it relates to research.  It’s more commonly used in, say, the art or museum world.  From Wikipedia:

…from the French provenir, “to come from”, refers to the chronology of the ownership or location of a historical object.

In his talk on “When Provenance Gets Real”, Frew exploited our familiarity with provenance as an art term by describing the 2009 story of a painting being attributed to Leonardo da Vinci based on discovering his fingerprint on the canvas. (Read a summary in the Park West Gallery blog or from CNN). The painting was originally bought for $19,000 in 2007, however based on its clarified provenance it is worth around $160 million.  It is hard to estimate what a well-documented dataset with excellent provenance is worth; we should always operate under the assumption, however, that future users of our data might be able to spectacularly important things. I like the fact that, in this scenario, I can be the Leonardo da Vinci of data.

Provenance is something I’ve blogged about before (see my two posts on workflows: informal and formal).  It’s a topic near and dear to my heart since I believe that documenting and archiving provenance will be the next major frontier for scientific research and advancement.  The discussion during the workshop session ran the gamut from informal to formal; one particularly fabulous moment was when Jim Frew projected a scripted workflow (UNIX, no less!) to demonstrate what provenance looks like in the real world. Frew went on to suggest that provenance for digital resources is the foundation for other important scientific concepts, like authenticity, trust, and reproducibility. Hear hear!

I did a rough Storify with tweets from the workshop. Check it out: Storify for Microsoft eScience Workshop. You can also check out videos of the workshop presentations on the Microsoft Website.

Data to Receive Recognition from NSF

first class ticket

NSF just gave data an upgrade to First Class. From Flickr by acme

This week, the National Science Foundation announced changes to its Grant Proposal Guidelines (Full GPG for January 2013 here).  Included in this list is something that has me pretty jazzed about the future of research data.

The biosketch section of the proposal is a place where the proposal’s researchers describe their background and why they are qualified to do the work proposed.  Included in the biosketch is a list of “relevant publications”. The change for 2013 is this: the wording has been updated to “relevant products”.

Chapter II.C.2.f(i)(c), Biographical Sketch(es), has been revised to rename the “Publications” section to “Products” and amend terminology and instructions accordingly. This change makes clear that products may include, but are not limited to, publications, data sets, software, patents, and copyrights.

WooHoo!! That’s pretty great news for those of us trying to get data the recognition it deserves.  This goes along with my idea of adding a “Data” section to researchers’ Curriculum Vitae and having data recognized in the Tenure and Promotion process at institutions. We are a bit closer to treating datasets as first class products of research.

Researchers, take heed: gone are the days when you can leave your data on zip drives in your filing cabinet (*cough* data from my undergrad project *cough*). The NSF is incentivizing data sharing by recognizing its importance as a research output; hopefully institutions and funders will follow suit. How can you get ready? Make your data public, create a citation with a unique, persistent identifier, and start watching the credit for your work roll in.