Your Time is Gonna Come

You know what they say:  Timing is everything.  Time enters into the data management and stewardship equation at several points and warrants discussion here.  Why timeliness? Last week at the University of North TexasOpen Access Symposium, there were several great speakers who touched on timeliness of data management, organization, and sharing.  It led me to wonder whether there is agreement about the timeliness of activities data-related, so here I’ve posted my opinions about time in a few points in the life cycle of data.  Feel free to comment on this post with your own opinions.

1. When should you start thinking about data management?  The best answer to this question is as soon as possible.  The sooner you plan, the less likely you are to be surprised by issues like metadata standards or funder requirements (see my previous DCXL post about things you will wish you had thought about documenting).  The NSF mandate for data management plans is a great motivator for thinking sooner rather than later, but let’s face facts: the DMP requirement is only two pages, and you can create one th

dark side of the rainbow image

If you have never watched the Wizard of Oz while listening to Pink Floyd’s Dark Side of the Moon album, you should. Of course, timing is everything: start the album on the third roar of the MGM lion. Image from horrorhomework.com (click on the image to go to the site)

at might pass muster without really thinking too carefully about your data.  I encourage everyone to go well beyond funder requirements and thoughtfully plan out your approach to data stewardship.  Spend plenty of time doing this, and return to your plan often during your project to update it.

2. When should you start archiving your data? By archiving, I do not mean backing up your data (that answer is constantly).  I am referring to the action of putting your data into a repository for long-term (20+ years) storage. This is a more complicated question of timeliness. Issues that should be considered include:

  • Is your data collection ongoing? Continuously updated sensor or instrument data should begin being archived as soon as collection begins.
  • Is your dataset likely to undergo a lot of versions? You might wait to begin archiving until you get close to your final version.
  • Are others likely to want access to your data soon?  Especially colleagues or co-authors? If the answer is yes, begin archiving early so that you are all using the same datasets for analysis.

3. When should you make your data publicly accessible?  My favorite answer to this question is also as soon as possible.  But this might mean different things for different scientists.  For instance, making your data available in near-real time, either on a website or in a repository that supports versioning, allows others to use it, comment on it, and collaborate with you while you are still working on the project.  This approach has its benefits, but also tends to scare off some scientists who are worried about being scooped.  So if you aren’t an open data kind of person, you should make your data publicly available at the time of publication.  Some journals are already requiring this, and more are likely to follow.

There are some that would still balk at making data available at publication: What if I want to publish more papers with this dataset in the future?  In that case, have an honest conversation with yourself.  What do you mean by “future”?  Are you really likely to follow through on those future projects that might use the dataset?  If the answer is no, you should make the data available to enhance your chances for collaboration. If the answer is yes, give yourself a little bit of temporal padding, but not too much.  Think about enforcing a deadline of two years, at which point you make the data available whether you have finished those dream projects or not.  Alternatively, find out if your favorite data repository will enforce your deadline for you– you may be able to provide them with a release date for your data, whether or not they hear from you first.

Data Diversity is Okay

At risk of sounding like a motivational speaker, this is such an exciting time to be involved in science and research.  We are swimming in data and information (yay!), there are exciting software tools available for researchers, librarians, and lay people alike, and the possibilities for discovery seem endless.  Of course, all of this change can be a bit daunting.  How do you handle the data deluge? What software is likely to be around for a while? How do you manage your time effectively in the face of so much technology?

Growing Pains

Just like Kirk Cameron’s choice of hair style, academics and their librarians are going through some growing pains. From www.1051jackfm.com

Like many other groups, academic libraries are undergoing some growing pains in the face of the information age. This may be attributed drastic budget cuts, rising costs for journal subscriptions, and the less important role that physical collections play in due to increasing digitization of information.  Researchers are quite content to sit at their laptops and download PDFs from their favorite journals rather than wander the stacks of their local library; they would rather use Google searches to scour the internet for obscure references rather than ask their friendly subject librarian for help in the hunt.

Despite the challenges above, I firmly believe that this is such an exciting time to be working at the interface of libraries, science, and technology.  Many librarians agree with me, including those at UCLA.  Lisa Federer and Jen Weintraub recently put on a great panel at the UCLA library focused on data curation.  I was invited to participate and agreed, which turned out to be an excellent decision.

The panel was called “Data Curation in Action”, and featured four panelists: Chris Johanson, UCLA professor of classics and digital humanities; Tamar Kremer-Sadlik, director of research at the UCLA Center for Everyday Lives of Families (CELF); Paul Conner, the digital laboratory director of CELF; and myself, intended to represent some mix of researchers in science and librarians.

Without droning on about how great the panel was, and how interesting the questions from the audience were, and how wonderful my discussions were with attendees after the panel, I wanted to mention the major thing that I took away: there is so much diverse data being generated by so many different kinds of projects and researchers.  Did I mention that this is an exciting time in the world of information?

Take Tamar and Paul: their project involves following families every day for hours on end, recording video, documenting interactions and locations of family members, taking digital photographs, conducting interviews, and measuring cortisol levels (an indicator for stress).  You should read that sentence again, because that is an enormous diversity of data types, not to mention the volume. Interviews and video are transcribed, quantitative observations are recorded in databases, and there is an intense coding system for labeling images, videos, and audio files.

Now for Chris, who has the ability to say “I am a professor of classics” at dinner parties (I’m jealous).  Chris doesn’t sit about reading old texts and talking about marble statues. Instead he is trying to reconstruct “ephemeral activities in the ancient world”, such as attending a funeral, going to the market, etcetera. He does this using a complex combination of Google Earth, digitized ancient maps, pictures, historical records, and data from excavations of ancient civilizations.  He stole the show at the panel when he demonstrated how researchers are beginning to create virtual worlds in which a visitor can wander around the landscape, just like in a modern day 3D video game.

This is really just a blog post about how much I love my job. I can’t imagine anything more interesting than trying to solve problems and provide assistance for researchers such as Tamar, Paul and Chris.

In case you are not one of the 35 million who have watched it, OK Go has a wonderful video about getting through the tough times associated with the dawning information age (at least that’s my rather nerdy interpretation of this song):

 

Trailblazers in Demography

Last week I had the great pleasure of visiting Rostock, Germany.  If your geography lessons were a long time ago, you are probably wondering “where’s Rostock?” I sure did… Rostock is located very close to the Baltic Sea, in northeast Germany.  It’s a lovely little town with bumpy streets, lots of sausage, and great public transportation.  I was there, however, to visit the prestigious Max Planck Institute for Demographic Research (MPIDR).

Demography is the study of populations, especially their birth rates, death rates, and growth rates.  For humans, this data might be used for, say, calculating premiums for life insurance.  For other organisms, these types of data are useful for studying population declines, increases, and changes.  Such areas of study are especially important for endangered populations, invasive species, and commercially important plants and animals.

baby rhino

Sharing demography data saves adorable endangered species. From Flickr by haiwan42

I was invited to MPIDR because there is a group of scientists interested in creating a repository for non-human demography data.  Luckily, they aren’t starting from scratch.  They have a few existing collections of disparate data sets, some more refined and public-facing than others; their vision is to merge these datasets and create a useful, integrated database chock full of demographic data.  Although the group has significant challenges ahead (metadata standards, security, data governance policies, long term sustainability), their enthusiasm for the project will go a long way towards making it a reality.

The reason I am blogging about this meeting is because for me, the group’s goals represent something much bigger than a demography database.  In the past two years, I have been exposed to a remarkable range of attitudes towards data sharing (check out blog posts about it here, here, here, and here).  Many of the scientists with whom I spoke needed convincing to share their datasets.  But even in this short period of time that I have been involved in issues surrounding data, I have seen a shift towards the other end of the range.  The Rostock group is one great example of scientists who are getting it.

More and more scientists are joining the open data movement, and a few of them are even working to convert others to  believe in the cause.  This group that met in Rostock could put their heads down, continue to work on their separate projects, and perhaps share data occasionally with a select few vetted colleagues that they trust and know well.  But they are choosing instead to venture into the wilderness of scientific data sharing.  Let them be an inspiration to data hoarders everywhere.

It is our intention that the DCXL project will result in an add-in and web application that will facilitate all of the good things the Rostock group is trying to promote in the demography community.  Demographers use Microsoft Excel, in combination with Microsoft Access, to organize and manage their large datasets.  Perhaps in the future our open-source add-in and web application will be linked up with the demography database; open source software, open data, and open minds make this possible.

QSE3, IGERT, OA and DCXL

A few months back I received an invite to visit the University of Florida in sunny Gainesville.  The invite was from organizers of an annual symposium for the Quantitative Spatial Ecology, Evolution and Environment (QSE3) Integrative Graduate Education and Research Traineeship (IGERT) program.  Phew! That was a lot of typing for the first two acronyms in my blog post’s title.  The third acronym  (OA) stands for Open Access, and the fourth acronym should be familiar.

I presented a session on data management and sharing for scientists, and afterward we had a round table discussion focused on OA.  There were about 25 graduate students affiliated the QSE3 IGERT program, a few of their faculty advisors, and some guests (including myself) involved in the discussion.  In 90 minutes we covered the gamut of current publishing models, incentive structures for scientists, LaTeX advantages and disadvantages, and data sharing.  The discussion was both interesting and energetic in a way that I don’t encounter from scientists that are “more established”.  Some of the themes that emerged from our discussion warrant a blog post.

First, we discussed that data sharing is an obvious scientific obligation in theory, but when it comes to your data, most scientists get a bit more cagey.  This might be with good reason – many of the students in the discussion were still writing up their results in thesis form, never mind in journal-ready form.  Throwing your data out into the ether without restrictions might result in some speedy scientist scooping you while you are dotting i’s and crossing t’s in your thesis draft.  In the case of grad students and scientists in general, embargo periods seem to be a good response to most of this apprehension. We agreed as a group, however, that such embargos should be temporary and should be phased out over time as cultural norms shift.

The current publishing model needs to change, but there was disagreement about how this change should manifest. For instance, one (very computer-savvy) student who uses R, LaTeX and Sweave asked “Why do we need publishers? Why can’t we just put the formatted text and code online?”  This is an obvious solution for someone well-versed in the world of document preparation in the vein of LaTeX.  You get fully formated, high-quality publications by simply compiling documents. But this was argued against by many in attendance because LaTeX use is not widespread, and most articles need heavy amounts of formatting before publication.  Of course, this is work that would need to be done by the overburdened scientist if they published their own work, which is not likely to become the norm any time soon.

empty library

No journals means empty library shelves. Perhaps the newly freed up space could be used to store curmudgeonly professors resistant to change.

Let’s pretend that we have overhauled both scientists and the publishing system as it is.  In this scenario, scientists use free open-source tools like LaTeX and Sweave to generate beautiful documents.  They document their workflows and create python scripts that run in the command line for reproducible results.  Given this scenario, one of the students in the discussion asked “How do you decide what to read?” His argument was that the current journal system provides some structure for scientists to hone in on interesting publications and determine their quality based (at least partly) on the journal in which the article appears.

One of the other grad students had an interesting response to this: use tags and keywords, create better search engines for academia, and provide capabilities for real-time peer review of articles, data, and publication quality.  In essence, he used the argument that there’s no such thing as too much information. You just need a better filter.

One of the final questions of the discussion came from the notable scientist Craig Osenberg. It was in reference to the shift in science towards “big data”, including remote sensing, text mining, and observatory datasets. To paraphrase: Is anyone worrying about the small datasets? They are the most unique, the hardest to document, and arguably the most important.

My answer was a resounding YES! Enter the DCXL project.  We are focusing on providing support for the scientists that don’t have data managers, IT staff, and existing data repository accounts that facilitate data management and sharing.  One of the main goals of the DCXL project is to help “the little guy”.  These are often scientists working on relatively small datasets that can be contained in Excel files.

In summary, the very smart group of students at UF came to the same conclusions that many of us in the data world have: there needs to be a fundamental shift in the way that science is incentivized, and this is likely to take a while.  Of course, given that these students are early in their careers, and their high levels of interest and intelligence, they are likely to be a part of that change.

Special thanks goes to Emilio Bruna (@brunalab) who not only scored me the invite to UF, but also hosted me for a lovely dinner during my visit (albeit NOT the Tasty Budda…)

EZID: now even easier to manage identifiers

EZID, the easy long-term identifier service, just got a new look. EZID lets you create and maintain ARKs and DataCite Digital Object Identifiers (DOIs), and now it’s even easier to use:

  • One stop for EZID and all EZID information, including webinars, FAQs, and more.

    Image by Simon Cousins

    • A clean, bright new look.
    • No more hunting across two locations for the materials and information you need.
  • NEW Manage IDs functions:
    • View all identifiers created by logged-in account;
    • View most recent 10 interactions–based on the account–not the session;
    • See the scope of your identifier work without any API programming.
  • NEW in the UI: Reserve an Identifier
    • Create identifiers early in the research cycle;
    • Choose whether or not you want to make your identifier public–reserve them if you don’t;
    • On the Manage screen, view the identifier’s status (public, reserved, unavailable/just testing).

In the coming months, we will also be introducing these EZID user interface enhancements:

  • Enhanced support for DataCite metadata in the UI;
  • Reporting support for institution-level clients.

So, stay tuned: EZID just gets better and better!