Data Policies & Other Things

Last Friday I attended a seminar at UC Berkeley’s iSchool given by MacKenzie Smith, a terrific presenter and colleague who is affiliated with Creative Commons (among other prestigious organizations).  MacKenzie was talking about data governance, an issue I covered a few months back for the DCXL blog.  However on Friday MacKenzie brought up a few things that I think warrant another post.

First, let’s define data governance for those that aren’t familiar with the concept. Based on Wikipedia’s entry, it’s the policies surrounding data, including data risk management, assignment of roles and responsibilities for data, and more generally formally managing data assets throughout the research cycle.  Now on to the new things:

The thing adams family

Data policies are some combination of scary and confusing. Similar to Thing from The Addams Family. From monstermoviemusic.blogspot.com

Thing 1: Facts cannot be copyrighted. It makes sense for things like, say, simple math. I can’t say “2+2=4” © 2011 Carly Strasser. Known facts can’t be copyrighted.  So what about data? One might argue that data are facts (assuming you are doing science correctly). That means you don’t own the copyright to your data. Eeek! Scary thought, I know. You might be saved by the fact that a unique arrangement or collection of facts can be copyrighted. Huh.  Data in a database? Can’t be copyrighted. The database itself? Can be copyrighted. This obviously makes things related to data quite messy when it comes to intellectual property.

Thing 2: Did you know that “attribution” can be legally imposed? The remedy for a lack of attribution where warranted is a lawsuit. Creative Commons licenses are built on this fact.  This is not true, however of citation.  Citation is a “scholarly norm” that has no underlying legality.

Thing 3: Creative Commons is now working on a CC 4.0 license. Some of goals of this new version are enabling internationalization and interoperability, and improving support of data, Science, and Education. They want input from scientists, librarians, administrators, and anyone else who might have an opinion about intellectual property, open science, and governance in general.

Thing 4Open Knowledge Foundation is working on concepts related to governance with a global perspective.  They have a range of projects in the works for improving the sharing of knowledge, data, and content.

Thing 5: While waiting for a consensus on how to properly govern digital data and other digital content, many data providers are dealing with governance by constructing data usage agreements.  These are contracts created by lawyers for a specific data provider (e.g., an online database).  The problem with data usage agreements is that they are all different.  This means that if you want to use data from a source that requires you agree to their terms, you have three options:

  1. Carefully read the terms before agreeing (and who does that?)
  2. Click that you agree without reading and hope you don’t accidentally break any rules
  3. Find the data that you need from another source that doesn’t have terms and conditions for data usage.

Item three points to one of the serious downsides to data usage agreements: researchers may avoid using data if don’t understand the terms of use.  Furthermore, the terms only apply to the party that agreed to the contract (i.e. checked the box).  If they (potentially illegally) share those data with someone else, that someone else is not bound by the terms.

Thing 6: What about international collaborations? As you might imagine, this offers yet another layer of complication. As a scientist, you are supposed to be ensuring that you look into any data policies that may apply to your collaborators. From NSF DMP FAQ (hello, alphabet soup!):

16. If I participate in a collaborative international research project, do I need to be concerned with data management policies established by institutions outside the United States?

Yes. There may be cases where data management plans are affected by formal data protocols established by large international research consortia or set forth in formal science and technology agreements signed by the United States Government and foreign counterparts. Be sure to discuss this issue with your sponsored projects office (or equivalent) and your international research partner when first planning your collaboration.

Hmm. It looks like the waters are very muddy right now, and until they clear, researchers should watch their step.

Data Literacy Instruction: Training the Next Generation of Researchers

This post was contributed by Lisa Federer, Health and Life Sciences Librarian at UCLA Louise M. Darling Biomedical Library

In my previous life as an English professor, every semester I looked forward to the information literacy instruction that our librarian did for my classes.  I always learned something new, and, even better, my students no longer tried to cite Wikipedia as a source in their research papers.  Now that I’m a health and life sciences librarian, the tables are turned, and I’m the one responsible for making sure that my patrons are equipped to locate and use the information they need.  When it comes to the people I work with in the sciences, often the information they need is not an article or a book, but a dataset.  As a result, I am one of many librarians starting to think about best practices for providing data literacy instruction.

According to the National Forum on Information Literacy, information literacy is “the ability to know when there is a need for information, to be able to identify, locate, evaluate, and effectively use that information for the issue or problem at hand.”  The American Library Association has outlined a list of Information Literacy Competency Standards for Higher Education.  So far, a similar list of competencies for data literacy instruction has not been defined, but the general concepts are the same – researchers need to know how to locate data, evaluate it, and use it.  More importantly, as data creators themselves, they need to know how to make their datasets available and useful not just to their own research group, but to others.

Fortunately, a number of groups around the country are working on developing data literacy curricula.  Teams from Purdue University, Stanford University, the University of Minnesota, and the University of Oregon have received a grant from the Institute of Museum and Library Services (IMLS) to “develop a training program in data information literacy for graduate students who will become the next generation of scientists.”  Results and resources will eventually be available on their project website.  Also working under the auspices of an IMLS grant, a team from University of Massachusetts Medical School and Worcester Polytechnic Institute has developed a set of seven curricular modules for teaching data literacy.  Their curriculum centers on teaching researchers what they would need to know to complete a data management plan as required by the National Science Foundation (NSF) and several other major grant funders.

All of the work that these other institutions has done is a fantastic start, but at my institution, the researchers and students are very busy, and not likely to commit to a seven-session data literacy program.  Nonetheless, it’s still important that they learn how to manage, preserve, and share their data, not only because many funders now require it, but also because it’s the right thing to do as a member of the scientific community.  Thus, my challenge has been to design a one-off session that would be applicable across a variety of scientific (and perhaps even social science) fields.  In order to do so, I’ve started with my own list of core competencies for data literacy instruction, including:

  • understanding the “data life cycle” and the importance of sharing and preservation across the entire life cycle, especially for rare or unique datasets
  • knowing how to write a data management plan that will fulfill the requirements of funders like NSF
  • making appropriate choices about file forms and formats (such as by choosing open rather than proprietary standards)
  • keeping data organized and discoverable using file naming standards and appropriate metadata schema
  • planning for long-term, secure storage of data
  • promoting sharing by publishing datasets and assigning persistent identifiers like DOIs
  • awareness of data as scholarly output that should be considered in the context of promotion and tenure

Does this list cover everything a researcher would need to know to effectively manage their data?  Almost certainly not, but as with any single session, my goal is to introduce learners to the major issues and let them know that the library has the expertise to assist them with the more complicated issues that will inevitably arise.  Supporting the data needs of researchers is a daunting task, but librarians already have much of the knowledge and skills to provide this assistance – we simply need to adapt our knowledge of information structures and best practices to this burgeoning area.

As research becomes increasingly data-driven, libraries will be doing a great service to individuals and the research community as a whole by helping to create researchers who are good data stewards.  Like my formerly Wikipedia-dependent students, many of our researchers are still taking shortcuts when it comes to handling their data because they simply don’t know any better.  It’s up to librarians and other information professionals to ensure that the valuable research that is going on at our institutions remains available for future generations of researchers.

Data Publishing and the Coproduction of Quality

This post is authored by Eric Kansa

There is a great deal of interest in the sciences and humanities around how to manage “data.” By “data,” I’m referring to content that has some formal and logical structure needed to meet the requirements of software processing. Of course, distinctions between structured versus unstructured data represent more of a continuum or spectrum than a sharp line. What sets data apart from texts however is that data are usually intended for transactional (with queries and visualizations) rather than narrative applications.

The uses of data versus texts make a big difference in how we perceive “quality.” If there is a typo in a text, it usually does not break the entire work. Human readers are pretty forgiving with respect to those sorts of errors, since humans interpret texts via pattern recognition heavily aided by background knowledge and expectations. Small deviations from a reader’s expectations about what should be in a text can be glossed over or even missed entirely. If noticed, many errors annoy rather than confuse. This inherently forgiving nature of text makes editing and copy-editing attention-demanding tasks. One has to struggle to see what is actually written on a page rather than getting the general gist of a written text.

Scholars are familiar with editorial workflows that transform manuscripts into completed publications. Researchers submit text files to journal editors, who then circulate manuscripts for review. When a paper is accepted, a researcher works with a journal editor through multiple revisions (many suggested by peer-review evaluations) before the manuscript is ready for publication. Email, versioning, and edit-tracking help coordinate the work. The final product is a work of collaborative “coproduction” between authors, editors, reviewers, and type-setters.

What does this have to do with data?

Human beings typically don’t read data. We use data mediated through software. The transactional nature of data introduces a different set of issues impacting the quality and usability of data. Whereas small errors in a text often go unnoticed, such errors can have dramatic impacts on the use and interpretation of a dataset. For instance, a misplaced decimal point in a numeric field can cause problems for even basic statistical calculations. Such errors can also break visualizations.

These issues don’t only impact single datasets, they can also wreak havoc in settings where multiple individual datasets need to be joined together. I work mainly on archaeological data dissemination. Archaeology is an inherently multidisciplinary practice, involving inputs from different specialists in the natural sciences (especially zoology, botany, human osteology, and geomorphology), the social sciences, and the humanities. Meaningful integration of these diverse sources of structured data represents a great information challenge for archaeology. Archaeology also creates vast quantities of other digital documentation. A single field project may result in tens of thousands of digital photos documenting everything from excavation contexts to recovered artifacts. Errors and inconsistencies in identifiers can create great problems in joining together disparate datasets, even from a single archaeological project.

It is a tremendous challenge to relate all of these different datasets and media files together in a usable manner. The challenge is further compounded because archaeology, like many small sciences, typically lacks widely used recording terminologies and standards. Each archaeological dataset is custom crafted by researchers to address a particular suite of research interests and needs. This means that workflows and supporting software to find and fix data problems needs to be pretty generalized.

Fortunately, archaeology is not alone in needing tools to promote data quality. Google Refine helps meet these needs. Google Refine leverages the transactional nature of data to summarize and filter datasets in ways that make many common errors apparent. Once errors are discovered, Google Refine has powerful editing tools to fix problems. Users can also undo edits to roll-back fixes and return a dataset to an earlier state.

With funding from the Alfred P. Sloan Foundation, we’re working to integrate Google Refine in a collaborative workflow called “Data Refine“. Again, the transactional nature of data helps shape this workflow. Because use of data is heavily mediated by software, datasets can be seen as an integral part of software. This thinking motivated us to experiment with using software debugging and issue tracking tools to help organize collaborative work on editing data. Debugging and issue tracking tools are widely used and established ways of improving software quality. They can play a similar role in the “debugging” of data.

We integrated Google Refine and the PHP-based Mantis issue tracker to support collaboration in improving data quality. In this approach, contributing researchers and data editors collaborate in the coproduction of higher quality, more intelligible and usable datasets. These workflows try to address both supply and demand needs in scholarship. Researchers face strong and well known career pressures. Tenure may be worth $2 million or more over the course of a career, and its alternative can mean complete ejection from a discipline.  A model of editorially supervised “data sharing as publication” can help better align the community’s interest in data dissemination with the realities of individual incentives. On the demand side, datasets must have sufficient quality and documentation. To give context, data often need to be related and linked with shared concepts and with other datasets available on the Web (as in the case of “Linked Open Data” scenarios).

All of these processes require effort. New skills, professional roles, and scholarly communication channels need to be created to meet the specific requirements of meaningful data sharing. Tools and workflows as discussed here can help make this effort a bit more efficient and better suited to how data are used in research.

Communication Breakdown: Nerds, Geeks, and Dweebs

Last week the DCXL crew worked on finishing up the metadata schema that we will implement in the DCXL project.  WAIT! Keep reading!  I know the phrase “metadata schema” doesn’t necessarily excite folks – especially science folks.  I have a theory for why this might be, and it can be boiled down to a systemic problem I’ve encountered ever since becoming deeply entrenched in all things related to data stewardship: communication breakdown.

I began working with the DataONE group in 2010, and I was quickly overwhelmed by the rather steep learning curve I encountered related to data topics.  There was a whole vocabulary set I had to learn, an entire ecosphere of software and hardware, and a hugely complex web of computer science-y, database-y, programming-y concepts to unpack.  I persevered because the topics were interesting to me, but I often found myself spending time on websites that were indecipherable to the average intelligent person, or reading 50 page “quick start guides”, or getting entangled in a rabbit hole of wikipedia entries for new concepts related to data.

Fredo Corleone

Fredo Corelone was smart. Not stupid like everybody says. Nerds, Geeks, and Dweebs are all smart – just in different ways. from godfather.wikia.com

I love learning, so I am not one to complain about spending time exploring new concepts. However I would argue that my difficulties represent a much bigger issue plaguing advances in data stewardship: communication issues.  It’s actually quite obvious why these communication problems exist.  There are a lot of smart people involved in data, all of whom have very divergent backgrounds.  I suggest that the smart people can be broken down into three camps: the nerds, the geeks, and the dweebs.  These stereotypes should not be considered insults; rather they are an easy way to refer to scientists, librarians, and computer types. Check out the full venn diagram of nerds here.

The Nerds. This is the group to which I belong.  We are specially trained in a field and have in-depth knowledge of our pet projects, but general education about computers, digital data, and data preservation are not part of our education.  Certainly that might change in the near future, but in general we avoid the command line like the plague, prefer user-friendly GUIs, and resist any learning of new software, tools, etc. that might take away from learning about our pet projects.

The geeks. Also known as computer folksThese folks might be developers, computer scientists, information technology specialists, database managers, etc.  They are uber-smart, but from what I can tell their uber-smart brains do not work like mine.  From what I can tell, geeks can explain things to me in one of two ways:

  1. “To turn your computing machine on, you need to first plug it in. Then push the big button.”
  2. “First go to bluberdyblabla and enter c>*#&$) at the prompt. Make sure the juberdystuff is installed in the right directory, though. Otherwise you need to enter #($&%@> first and check the shumptybla before proceeding.”

In all fairness, (1) occurs far less than (2).  But often you get (1) after trying to get clarification on (2).  How to remedy this? First, geeks should realize that our brains don’t think in terms of directories and command line prompts. We are more comfortable with folders we can color code and GUIs that allow us to use the mouse for making things happen.  That said, we aren’t completely clueless. Just remember that our vocabularies are often quite different from yours.  Often I’ve found myself writing down terms in a meeting so I can go look them up later.  Things like “elements” and “terminal” are not unfamiliar words in and of themselves.  However the contexts in which they are used are completely new to me.  That doesn’t even count the unfamiliar words and acronyms, like APIs, github, Python, and  XML.

The dweebs.  Also known as librarians.  These folks are more often being called “information professionals”, but the gist is the same – they are all about understanding how to deal with information in all its forms.  There’s certainly a bit of crossover with the computer types, especially when it comes to data.  However librarian types are fundamentelly different in that they are often concerned with information generated by other people: put simply they want to help, or at least interact with, data producers.  There are certainly a host of terms that are used more often by librarian types: “indexing” and “curation” come to mind.  Check out the DCXL post on libraries from January.

Many of the projects in which I am currently involved require all three of these groups: nerds, geeks, and dweebs.  I watch each group struggle to communicate their points to the others, and too often decide that it’s not worth the effort.  How can we solve this communication impasse? I have a few ideas:

  • Nerds: open your minds to the possibility that computer types and librarian types might know about better ways of doing what you are doing.  Tap the resources that these groups have to offer. Stop being scared of the unknown. You love learning or you wouldn’t be a scientist; devote some of that love in the direction of improving your computer savvy.
  • Geeks: dumb it down, but not too much. Recognize that scientists and librarians are smart, but potentially in very different ways than you.  Also, please recognize that change will be incremental, and we will not universally adopt whatever you think is the best possible set of tools or strategies and how “totally stupid” or current workflow seems.
  • Dweebs: spend some time getting to know the disciplines you want to help. Toot your own horn– you know A LOT of stuff that nerds and geeks don’t, and you are all so darn shy! Make sure both geeks and nerds know of your capacity to help, and your ability to lend important information to the discussion.

And now a special message to nerds (please see the comment string below about this message and its potential misinterpretation).  I plead with you to stop reinventing the wheel.  As scientists have begun thinking about their digital data, I’ve seen a scary trend of them taking the initiative to invent standards, start databases, or create software.  It’s frustrating to see since there are a whole set of folks out there who have been working on databases, standards, vocabularies, and software: librarians and computer types.  Consult with them rather than starting from scratch.

In the case of dweebs, nerds, and geeks, working together as a whole is much much better than summing up our parts.

Popular Demand for Public Data

Scanned image of a 1940 Census Schedule (from http://1940census.archives.gov)

The National Archives and Records Administration digitized 3.9 million schedules from the 1940 U.S. census

When talking about data publication, many of us get caught up in protracted conversations aimed at carefully anticipating and building solutions for every possible permutation and use case. Last week’s release of U.S. census data, in its raw, un-indexed form, however, supports the idea that we don’t have to have all the answers to move forward.

Genealogists, statisticians and legions of casual web surfers have been buzzing about last week’s release of the complete, un-redacted collection of scanned 1940 U.S. census data schedules. Though census records are routinely made available to the public after a 72-year privacy embargo, this most recent release marks the first time that the census data set has been made available in such a widely accessible way: by publishing the schedules online.

In the first 3-hours that the data was available, 22.5 million hits crippled the 1940census.archives.gov servers. The following day, nearly 3 times that number of requests continued to hammer the servers as curious researchers scoured the census data looking for relatives of missing soldiers; hoping to find out a little bit more about their own family members; or trying to piece together a picture of life in post-Great Depression, pre-WWII America.

For the time being, scouring the data is a somewhat laborious task of narrowing in on the census schedules for a particular district, then performing a quick visual scan for people’s names. The 3.9 million scanned images that make up the data set are not, in other words, fully indexed — in fact, only a single field (the Enumeration District number field) is searchable. Encoding that field alone took 6 full-time archivists 3-months.

The task of encoding the remaining 5.3 billion fields is being taken up by an army of volunteers. Some major genealogy websites (such as Ancestry.com and MyHeritage.com) hope the crowd-sourced effort will result in a fully indexed, fully searchable database by the end of the year.

Release day for the census has been described as “the Super Bowl for genealogists.” This excitement about data, and participation by the public in transforming the data set into a more useable, indexed form are encouraging indications that those of us interested in how best to facilitate even more sharing and publishing of data online are doing work that has enormous, widely-appreciated value. The crowd-sourced volunteer effort also reminds us that we don’t necessarily have to have all the answers when thinking about publishing data. In some cases, functionality that seems absolutely essential (such as the ability to search through the data set) is work that can (and will) be taken up by others.

So, how about your data set(s)? Who are the professional and armchair domain enthusiasts that will line up to download your data? What are some of the functionality roadblocks that are preventing you from publishing your data, and how might a third party (or a crowd sourced effort) work as a solution? (Feel free to answer in the comments section below.)

Data Citation Redux

I know what faithful DCXL readers are thinking: didn’t you already post about data citation? (For the unfaithful among you, check out this post from last November). Yes, I did. But I’ve been inspired to post yet again because I just attended an amazing workshop about all things data citation related.

The workshop was hosted by the NCAR Library (NCAR stands for National Center for Atmospheric Research) and took place in Boulder on Thursday and Friday of last week.  Workshop organizers expected about 30 attendees; more than 70 showed up to learn more about data citation.  Hats off to the organizers – there healthy discussions among attendees and interesting presentations by great speakers.

One of the presentations that struck me most was by Dr. Tim KilleenAssistant Director for the Geosciences Directorate at NSF.  His talk (available on the workshop website) discussed the motivation for data citation, and what policies have begun to emerge.  Near the end of a rather long string reports about data citation, data sharing, and data management, Killeen said  “There is a drumbeat into Washington about this.”

John Bonham

If Led Zeppelin drummer J Bonham were still alive, he would leading the data charge into DC. Bonham was voted by Rolling Stone readers as the best drummer of all time. Photo from drummerworld.com

This phrase stuck with me long after I flew home because it juxtaposted two things I hadn’t considered as being related: Washington DC and data policy.  Yes, I understand that NSF is located in Washington, and that very recently the White House announced some exciting Big Data funding and initiatives. But Washington DC as a whole – congress, lobbyists, lawyers, judges, etc. – would notice a drum beat about data? I must say, I got pretty excited about the idea.

What are these reports cited by Killeen?  In chronological order:

The NSB report on long-lived digital data had yet another a great phrase that stuck with me:

Long-lived digital data collections are powerful catalysts for progress and for democratization of science and education

Wow. I really love the idea of democratized data.  It warms the cockles, doesn’t it?  With regard to DCXL, the link is obvious.  One of the features we are developing is generation of a data citation for your Excel dataset.

The Future of Metrics in Science

Ask any researcher what they need for tenure, and the answer is virtually the same across institutions and disciplines: publications.  The “publish or perish” model has reigned supreme for generations of scientists, despite its rather annoying ignorance of having quality over quantity publications, how many collaborations have been established, or even the novelty or difficulty of a particular research project.  This archaic measure of impact tends to rely measures like a scientist’s number of citations and the impact factor of the journals in which they publish.

With the upswing in blogs, Twitter feeds, and academic social sites like MendeleyZotero, and (my favorite) CiteULike, some folks are working on developing a new model for measuring one’s impact on science.  Jason Priem, a graduate student at UNC’s School of Information and Library Science, coined the term “altmetrics” rather recently, and the idea has taken off like wildfire.

altmetrics is the creation and study of new metrics based on the Social Web for analyzing, and informing scholarship.

The concept is simple: instead of using traditional metrics for measuring impact (citation counts, journal impact factors), Priem and his colleagues want to take into account more modern measures of impact like number of bookmarks, shares, or re-tweets.  In addition, altmetrics seeks to consider not only publications, but associated data or code downloads.

sex pistols

The original alternatives: The Sex Pistols. From Arroz Do Ceu (limpa-vias.blogspot.com). Read more about the beginnings of alternative rock in Dave Thompson’s book “Alternative Rock”.

Old-school scientists and Luddites might balk at the idea of measuring a scientist’s impact on the community by the number of re-tweets their article received, or by the number of downloads of their dataset.  This reaction can be attributed to several causes, one of which may be an irrational fear of change.  But the reality is that the landscape of science is changing dramatically, and the trend towards social media as a scientific tool is only likely to continue.  See my blog post on why scientists should tweet for more information on the benefits of embracing one of the aspects of this trend.

Need another reason to get onboard? Funders see the value in altmetrics.  Priem, along with his co-PI (and my DataONE colleague) Heather Piwowar, just received $125K from the Sloan Foundation to expand their Total Impact project.  Check out the Total Impact website for more information, or read the UNC SILS news story about the grant.

The DCXL project feeds right into the concept of altmetrics.  By providing citations for datasets that are housed in data centers, the impact of a scientist’s data can be easily incorporated into their impact factor.

Trending: Big Data

Last week, the White House Office of Science and Technology Policy  hosted a “Big Data” R&D event, which was broadcast live on the internet (recording available here, press release available as a pdf).  GeekWire did a great piece on the event that provides context.  Wondering what “Big Data” means? Keep reading.

Big Tex

“Howdy Folks!” Big Tex from the State Fair of Texas thinks Big Data is the best kind of data. From Flickr by StevenM_61. For more info on Big Tex, check out http://en.wikipedia.org/wiki/Big_Tex

Big Data is a phrase being used to describe the huge volume of data being produced by modern technological infrastructure.  Some examples include social media and remote sensing instruments. Facebook, Twitter, and other social media are producing huge amounts of datasets that can be analyzed to understand trends in the Internet.  Satellites and other scientific instruments are producing constant streams of data that can be used to assess the state of the environment and understand patterns in the global ecosphere.  In general, Big Data is just what it sounds like– a sometimes overwhelming amount of information, flooding scientists, statisticians, economists, and analysts with an ever-increasing pile of fodder for understanding the world.

Big Data is often used alongside the “Data Deluge”, which is a phrase used to describe the onslaught of data from multiple sources, all waiting to be collated and analyzed.  The phrase brings about images of being overwhelmed by data: check out The Economist‘s graphic that represents the concept.  From Wikipedia:

…datasets are growing so large that they become awkward to work with using on-hand database management tools. Difficulties include capture, storage, search, sharing, analytics, and visualizing.

Despite the challenges of Big Data, folks are hungry for big data sets to analyze.  Just this week, the 1940 US Census data was released; there was so much interest in downloading and analyzing the data, the servers crashed.  You only need to follow the Twitter hash tag #bigdata to see it’s a very hot topic right now. Of course, Big Data should not be viewed as a bad thing.  There is no such thing as too much information; it’s simply a matter of finding the best tools for handling all of those data.

Big Data goes hand-in-hand with Big Science, which is a term first coined back in 1961 by Alvin Weinberg, then the director of the Oak Ridge National Laboratory.  Weinberg used “Big Science” to describe large, complex scientific endeavors in which society makes big investments in science, often via government funding.  Examples include the US space program, the Sloan Digital Sky Survey, and the National Ecological Observatory Network.  These projects produce mountains of data, sometimes continuously 24 hours a day, 7 days a week.  Therein lies the challenge and awesomeness of Big Data.

What does all of this mean for small datasets, like those managed and organized in Excel?  The individual scientist with their unique, smaller scale dataset has a big role in the era of Big Data.  New analytics tools for meta-analysis offer a way for individuals to participate in Big Science, but we have to be willing to make our data standardized, useable, and available.  The DCXL add-in will facilitate all three of these goals.

In the past, meta-analysis of small data sets meant digging through old papers, copying data out of tables or reconstructing data from graphs.  Wondering about the gland equivalent of phenols from castoreum? Dig through this paper and reconstruct the data table in Excel.  Would you like to combine that data set with data on average amounts of neutral compounds found in one beaver castor sac? That’s another paper to download and more data to reconstruct.  By making small datasets available publicly (with links to the datasets embedded in the paper), and adhering to discipline-wide standards, meta-analysis will be much easier and small datasets can be incorporated into the landscape of Big Science.  In essence, the whole is greater than the sum of the parts.

Think you can take on the Data Deluge? NSF’s funding call for big data proposals is available here.

DataShare: A Plan to Increase Scientific Data Sharing

This post was co-authored by Dr. Michael Weiner, CIND director at UCSF

The DataShare project is a collaboration between University of California San Francisco’s Clinical and Translational Science Institute, the UCSF Library, and the UC Curation Center (UC3) at the California Digital Library.  The goal of the DataShare project is to achieve widespread voluntary sharing of scientific data at the time of publication.  This will be achieved by creating a data sharing website which could be used by all UCSF investigators, and ultimately by others in the UC system and other institutions.  Currently data sharing is mostly done by large, well funded multi-investigator projects.  There would be great benefit if much more raw data were widely shared, especially data from individual investigators.

we are the world

Imagine the possible scientific advances if we pooled our data the way that "We are the world" pooled celebrity voices. From live.drjays.com

 This project is the brainchild of Michael Weiner M.D., the director for the Center for Imaging of Neurodegenerative Diseases.  Weiner’s experience as the Principal Investigator of the Alzheimer’s Disease Neuroimaging Initiative (ADNI) led him to conclude that widespread data sharing can be achieved now, with great scientific and economic benefits.  All ADNI raw data is immediately shared (at UCLA/LONI/ADNI) with all scientists in the world without embargo. The project is very successful: more than 300 publications and many more submitted.  This success demonstrates the feasibility and benefits of sharing data.

 Individual initiatives:

The  laboratory at the Center for Imaging of Neurodegenerative Diseases  began to share data at the time of publication in 2011. This included both raw data and a description of how the raw data was processed and analyzed, leading to the findings in the publication.  For the DataShare project, the following expansions to data sharing are planned:

  1. ADNI scientists will be encouraged to share the raw data of their ADNI papers, and other papers from their laboratories
  2. Other faculty in the Department of Radiology at UCSF and our collaborators in Neurology and Psychiatry at UCSF will be encouraged to share their raw data
  3. Chancellor, Deans, and Department Chairs at UCSF will be urged to make more widespread voluntary sharing of scientific data a UCSF priority/policy; this may include providing storage space for shared data and/or development of policies which would reward data sharing in the hiring and promotion process
  4. The example UCSF sets may then encourage the entire University of California system to implement similar changes
  5. Other collaborators and colleagues in other universities around the world will then be encouraged to adopt similar policies
  6. A “data sharing impact factor” will be developed and tested which will allow scientists to cite others’ data that they use and provide metrics for how others are using their data.

 Institutional initiatives:

The project seeks to encourage involvement by the National Institutes of Health (NIH), the National Science Foundation (NSF), and the National Library of Medicine (NLM), to promote and facilitate sharing of scientific data. This will be accomplished via five tasks:

  1. Encourage NIH and NSF to emphasize and expand their existing policies concerning data sharing and notify the scientific community of this greater emphasis
  2. Promote the establishment of a small group of committed individuals who can help formulate policy for NIH in this area, including a policy framework that favors open availability of scientific data.
  3. Establish technical mechanisms for data sharing, such as a national system for storage of all raw scientific data (e.g., a national data repository or data bank).  This repository may be created by NLM, or be housed at universities, foundations, or private companies (e.g., Dataverse).
  4. Work to develop incentives for scientists and institutions to share their raw data. This may include
    1. Requesting reports in non competitive reviews, competitive reviews and/or new applications
    2. Instructing the reviewers to consider data sharing in assessing priority scores in grant reviews
    3. Acknowledgment in publications
    4. Providing affordable access to infrastructure, i.e. software and media, which facilitates data sharing
    5. Encouraging NIH to provide funding for small grants aimed to promote and take advantage of shared data.  Examples include projects that utilize data mining or cloud computing.

The potential gains from widespread sharing of raw scientific data greatly outweigh the relatively small costs involved in developing the necessary infrastructure. Industries likely to benefit from increased accessibility of large amounts of raw data include the pharmaceutical and health care industry, chemistry, technology, engineering, etc. We also expect new technologies and new companies to develop to take advantage of newly available data.  Furthermore, there will be substantial societal benefits gained by widespread sharing of scientific data, primarily due to the ability to link data sets and repurpose data for making unforeseen discoveries.