The Science of the DeepSea Challenge

Recently the film director and National Geographic explorer-in-residence James Cameron descended to the deepest spot on Earth: the Challenger Deep in the Mariana Trench.  He partnered with lots of sponsors, including National Geographic and Rolex, to make this amazing trip happen.  A lot of folks outside of the scientific community might not realize this, but until this week, there had been only one successful descent to this the trench by a human-occupied vehicle (that’s a submarine for you non-oceanographers).  You can read more about that 1960 exploration here and here.

I could go on about how astounding it is that we know more about the moon than the bottom of the ocean, or discuss the seemingly intolerable physical conditions found at those depths– most prominently the extremely high pressure.  However what I immediately thought when reading the first few articles about this expedition was where are the scientists?

Before Cameron, Swiss Oceanographer Piccard and Navy officer Marsh went down in it to the virgin waters of the deep. From www.history.navy.mil/photos/sh-usn/usnsh-t/trste-b

After combing through many news stories, several National Geographic sites including the site for the expedition, and a few press releases, I discovered (to my relief) that there are plenty of scientists involved.  The team that’s working with Cameron includes scientists from Scripps Institution of Oceanography (the primary scientific partner and long-time collaborator with Cameron),  Jet Propulsion Lab, University of Hawaii, and University of Guam.

While I firmly believe that the success of this expedition will be a HUGE accomplishment for science in the United States, I wonder if we are sending the wrong message to aspiring scientists and youngsters in general.  We are celebrating the celebrity film director involved in the project in lieu of the huge team of well-educated, interesting, and devoted scientists who are also responsible for this spectacular feat (I found less than 5 names of scientists in my internet hunt).  Certainly Cameron deserves the bulk of the credit for enabling this descent, but I would like there to be a bit more emphasis on the scientists as well.

Better yet, how about emphasis on the science in general?  It’s a too early for them to release any footage from the journey down, however I’m interested in how the samples will be/were collected, how they will be stored, what analyses will be done, whether there are experiments planned, and how the resulting scientific advances will be made just as public as Cameron’s trip was.  The expedition site has plenty of information about the biology and geology of the trench, but it’s just background: there appears to be nothing about scientific methods or plans to ensure that this project will yield the maximum scientific advancement.

How does all of this relate to data and DCXL? I suppose this post falls in the category of data is important.  The general public and many scientists hear the word “data” and glaze over.  Data isn’t inherently interesting as a concept (except to a sick few of us).  It needs just as much bolstering from big names and fancy websites as the deep sea does.  After all, isn’t data exactly what this entire trip is about?  Collecting data on the most remote corners of our planet? Making sure we document what we find so others can learn from it?

Here’s a roundup of some great reads about the Challenger expedition:

The Digital Dark Age, Part 3

disk drive car

A whole new interpretation of “Disk Drive”. From Flickr by litlnemo

In my last two blog posts (here and here), I have covered what the phrase “Digital Dark Age” means, and how it might affect your scientific data.  Here, I will provide some basic tips on how to avoid losing your data in the event that the Digital Dark Age becomes a reality.  A lot of what I cover in this blog post can also be found in the DataONE primer on data management or my previous DCXL post, Data Management 101.

1. Use non-proprietary software. Keep in mind that even the most ubiquitous software programs are ephemeral.  Keep archival copies of your data and other files in open file formats that are able to be read by multiple programs.  This means using formats like .txt for documents, .csv for spreadsheets, and .mp3 for audio.  Check out Virginia Tech’s complete list of recommended file formats for more information.

2. Back up often and in multiple locations. This makes good sense for both short-term and long-term data preservation.  It’s a similar concept to storing your old-school photo negatives in a different location from your actual printed photos, just in case of fire/flood/apocalypse.  Read more about backing up in my blog post on the subject.

3. Transfer your backups to new media types every 3-4 years.  All of the data from my undergraduate research project are safely stored on zip disks, despite the fact that I have no zip drive to read these disks.  I am not alone: there are myriad stories about lost data stored on outdated media types.  Here is an example from the Council on Library and Information Resources:

10-20% of data from the Viking Mars mission that was recorded on magnetic tapes have significant errors, because, as Jet Propulsion Laboratory technicians now realize, the magnetic tape on which they are stored is “a disaster for an archival storage medium.”

To avoid becoming a statistic, create backups on the latest greatest media type every few years.

4. Document, document, document.  Let’s imagine that you save your data in a non-proprietary format, make plenty of backups, and transfer to new media types frequently.  These activities are only useful if others can effectively understand and use the data that you archived.  Create quality metadata, take notes on your workflow, and generally document how you generate your data and what you do with it.

5. For the really important stuff, create hard copies. I encourage scientists to move as much of their process into digital formats as possible: eliminate paper lab notebooks and stop filling out “data sheets” with a pencil.  I discourage these methods because they imply future manual data entry, and because they make it much more difficult to keep data documentation with the data it describes.  This advice does not, however, preclude creating paper copies of your digital data.  In fact, printing off the most important information and storing those hard copies in an offsite location is generally a good idea.  Paper is capabile of lasting much longer than digital formats, so you can be certain that your most important work will be available irrespective of the next amazing media type that emerges.

For more information, check out these great tips from the UK’s newspaper The Telegraph: How to stave off a digital ‘dark age’ and this piece from American Scientist: Avoiding the Digital Dark Age

The Importance of Data Citation

Data citation. This is a phrase you are likely to hear a lot in the next few years.  The idea is simple enough: cite a data set, just like you would a journal article. Note: much of the content from this article was borrowed from Robert Cook of Oak Ridge National Laboratory and the DataCite website.

Why should you care about data citation? Here are] a few reasons:

  1. Researchers can easily find data products associated with a publication.  If you’ve ever tried to re-use data from someone’s publication, you know how difficult it can be to find the raw data. Sometimes it involves using programs to generate data points from figures or tables in the PDF  (I am not linking to any of these programs since I don’t think this is a good technique to employ). Other times you might have to contact the author directly to ask for their files.  In general, it can be very time consuming and frustrating, and often results in failure to obtain the data.  If data are cited properly, finding data products associated with a publication would be much easier.  As the data provider, you have the added advantage of not needing to respond to any requests for your data; interested researchers can find it easily because you cited it.
  2. You get credit for your data AND your publications. Often the time it takes to write a paper for publication is only a small fraction of the time it took to collect the underlying data.  If that’s the case, it would be great to get some credit for the actual data collection. You can put it on your CV in a section called “Data”.  You can also include data in the set that wasn’t used for publication but might still be usable by others.
  3. Your data are discoverable via Web of Science. If you archive your data in a repository and get a digital object identifier, or DOI, for it (see Step 3 below), you can get citation metrics for your data AND your publications.
  4. You are allowing reproducibility of your results. Don’t be afraid of producing your data and analyses: if others are convinced that what you did is valid and reproducible, your clout as a researcher is sure to be high.

Data citation involves three steps on the part of the researcher:

  1. Prepare your data so you can archive it (see the Best Practices tab for more information). This includes documenting your data, i.e. creating metadata, and preparing your data for long-term storage.
  2. Put your data somewhere. Ideally this would be in a long-term stable archive or data center (there’s a list of repositories available on the DataCite website), but it can also be on your departmental website, your personal website, or as supplemental material on a journal’s website.
  3. Tell people how to cite and use your data.  You can provide an example reference that includes typical information like your name, the year of the data set, the name of the data set, and where it is located.  If you put your data in a repository, you can get a digital object identifier (available through services such as CDL’s EZID project), which can provide a way for others to find your data well into the future.

The concept of data citation is right in line with the DCXL project’s goals.  One of the potential features for the add-in is to enable links to CDL’s EZID for DOI generation. Another is to prompt the user for creating good metadata, which is critical for making data citable.

This is a repost of the DCXL Blog “Love at First Cite”, first posted on November 4, 2011.

The Digital Dark Age, Part 2

Earlier this week I blogged about the concept of a Digital Dark Age.  This is a phrase that some folks are using to describe some future scenario where we are not able to read historical digital documents and multimedia because they have been rendered obsolete or were otherwise poorly archived.  But what does this mean for scientific data?

Consider that Charles Darwin’s notebooks were recently scanned and made available online.  This was possible because they were properly stored and archived, in a long-lasting format (in this case, on paper).  Imagine if he had taken pictures of his finch beaks with a camera and saved the digital images in obsolete formats.  Or ponder a scenario where he had used proprietary software to create his famous Tree of Life sketch.  Would we be able to unlock those digital formats today?  Probably not.  We might have lost those important pieces of scientific history forever.   Although it seems like software programs such as Microsoft Excel and MATLAB will be around forever, people probably said similar things about the programs Lotus 1-2-3 and iWeb.

darwin by diana sudyka

“Darwin with Finches” by Diana Sudyka, from Flickr by Karen E James

It is a common misconception that things that are posted on the internet will be around “forever”.  While that might be true of embarrassing celebrity photos, it is much less likely to be true for things like scientific data.  This is especially the case if data are kept on a personal/lab website or archived as supplemental material, rather than being archived in a public repository (See Santos, Blake and States 2005 for more information).  Consider the fact that 10% of data published as supplemental material in the six top-cited journals was not available a mere five years later (Evangelou, Trikalinos, and Ioannidis, 2005).

Natalie Ceeney, chief executive of the National Archives, summed it up best in this quote from The Guardian’s 2007 piece on preventing a Digital Dark Age: “Digital information is inherently far more ephemeral than paper.”

My next post and final DDA installment will provide tips on how to avoid losing your data to the dark side.

The Digital Dark Age, Part 1

This will be known as the Digital Dark Age.  The first time I heard this statement was at Internet Archive, during the PDA 2012 Meeting (read my blog post about it here).  What did this mean?  What is a Digital Dark Age? Read on.

While serving in Vietnam, my father wrote letters to my grandparents about his life fighting a war in a foreign country.  One of his letters was sent to arrive in time for my grandfather’s birthday, and it contained a lovely poem that articulated my father’s warm feelings about his childhood, his parents, and his upbringing.  My grandparents kept the poem framed in a prominent spot in their home.  When I visited them as a child, I would read the poem written in my young dad’s  handwriting, stare at the yellowed paper, and think about how far that poem had to travel to relay its greetings to my grandparents.  It was special– for its history, the people involved, and the fact that these people were intimately connected to me.

Now fast forward to 2012.  Imagine modern-day soldiers all over the world, emailing, making satellite phone calls, and chatting with their families via video conferencing.  When compared to snail mail, these modern communication methods are likely a much preferred way of staying in touch for those families.  But how likely is it that future grandchildren will be able to listen those the conversations, read those emails, or watch those video calls?  The answer is extremely unlikely.

These two scenarios sum up the concept of a Digital Dark Age: compared to 40 years ago, we are doing a terrible job of ensuring that future generations will be able to read our letters, look at our pictures, or use our scientific data.

mix tapes

You mean future generations won't be able to listen to my mix tapes?! From Flickr by newrambler

The Digital Dark Age “refers to a possible future situation where it will be difficult or impossible to read historical digital documents and multimedia, because they have been stored in an obsolete and obscure digital format.”  The phrase “Dark Age” is a reference to The Dark Ages, a period in history around the beginning of the Middle Ages characterized by a scarcity of historical and other written records at least for some areas of Europe, rendering it obscure to historians.  Sounds scary, no?

How can we remedy this situation? What are people doing about it? Most importantly, what does this mean for scientific advancement? Check out my next post to find out.

Fun Uses for Excel

Friday movie

"Excel can do WHAT?" Image from Friday (the movie), from newsone.com

It’s Friday! Better still, it’s Friday afternoon!  To honor all of the hard work we’ve done this week, let’s have some fun with Excel.  Check out these interesting uses for Excel that have nothing to do with your data:

Want to see some silly spreadsheet movies? Here ya go.

Excel Hero: Download .xls files that create nifty optical illusions.  Here’s one of them.

From PCWorld, Fun uses for Excel, including a Web radio player that plays inside your worksheet (click to download the zip file and then select a station), or simulating dice rolls in case of a lack-of-dice emergency during Yatzee.

 Here’s the results of a Google Image Search for “Excel art:

excel art

 

Mona Lisa never looked so smart.  Want to know more? Check out the YouTube video tutorial or read Creating art with Microsoft Excel from the blog digital inspiration.

 

Data Publication: An Introduction

The concept of data publication is rather simple in theory: rather than relying on journal articles alone for scholarly communication, let’s publish data sets as “first class citizens” (hat tip to the DataCite group).  Data sets have inherent value that makes them standalone scholarly objects— they are more likely to be discovered by researchers in other domains and working on other questions if they are not associated with a specific journal and all of the baggage that entails.

Consider this example (taken from personal experience).  If you are a biologist interested in studying clam population connectivity, how likely are you to find the (extremely relevant) data related to clam shell chemistry that are associated with paleo-oceanography journals?  It took me several months before I discovered them during research for my graduate work.  If those datasets had been published in a repository, however, with a few well-chosen keywords and a quick web search, I would have located those datasets much more quickly.

Who would be against this idea, you ask?  It turns out data publication is similar to data management: no one is against the concept per se, but they are against all of the work, angst, and effort involved in making it a reality.  There is also considerable debate about how we should proceed to make data publication the norm in scientific communication.  In fact, there is debate about whether we should call it “data publication”.

A few months back, Mark Parsons of the National Snow and Ice Data Center and Peter Fox of Rensselaer Polytechnic Institute wrote a paper title “Is data publication the right metaphor?”, with plans to publish in Data Science Journal. Before publication, however, they opened the paper up for comments on the web. This move sparked a lively debate among folks in the information, data, and libraries community, which I will leave you to explore on the Parsons blog, the Open Citations and Semantic Publishing blog post about this, and Bryan Lawrence’s comments on his wiki.

The basic argument is that the word “publication” insinuates that we are beholden to the current broken system of journal publication.  The word itself has too much baggage.  The opposing argument is that bureaucrats, funders, and institutions have a familiarity with the word publication and that will ensure the success of the data publication goals, regardless of whether we break the mold in the process.

Do you have thoughts on the subject? Email us, comment on this post below, or comment on the Parsons and Fox paper.

books

Wouldn't it be great if data were as easy to find, read, and store as books? "Faculty Wives Book Fair" courtesy of San Joaquin Valley Library System, from Calisphere

 

Why You Should Floss

No, I won’t be discussing proper oral hygiene. What I mean by “flossing” is actually “backing up your data”.  Why the floss analogy? Here are the similarities between flossing in backing up your data:

  1. It’s undisputed that it’s important
  2. Most people don’t do it as often as they should
  3. You lie (to yourself, or your dentist) about how often you do it
dentist

Oral (and data) hygiene can be fun! From Calisphere, courtesy of UC Berkeley Bancroft Library

So think about backing up similarly to the way you think about flossing:  you probably aren’t doing it enough.  In this post, I will provide a general guidance about backing up your data; as always, the advice will vary greatly depending on the types of data you are generating, how often they change, and what computational resources are available to you.

First, create multiple copies in multiple locations.  The old rule of thumb is original, near, far.  The first copy is your working copy of data; the second copy is kept near your original (this is most likely an external hard drive or thumb drive); the third is kept far from your original (off site, such as at home or on a server outside of your office building).  This is the important part: all three of these copies should be up-to-date.  Which brings me to my second point.

Second, back up your data more often.  I have had many conversations with scientists over the last few months, and I always ask, “How do you back up your data?”  Answers range, but most of them scare me silly.  For instance, there was a 5th year graduate student who had all of her data on a six-year-old laptop, and only backed up once a month.  I get heart palpitations just typing that sentence.  Other folks have said things like “I use my external drive to back things up once every couple of months”, or worst case scenario, “I know I should, but I just don’t back up”.  It is strongly recommended that you back up every day. It’s a pain, right? There are two very easy ways to back up every day, and neither require any purchasing of hardware or software: (1) Keep a copy on Dropbox, or (2) Email yourself the data file as an attachment.  Note: these suggestions are not likely to work for large data sets.

Third, find out what resources are available to you. Institutions are becoming aware of the importance of good backup and data storage systems, which means there might be ways for you to back up your data regularly with minimal effort.  Check with your department or campus IT folks and ask about server space and automated backup service. If server space and/or backing up isn’t available, consider joining forces with other scientists to purchase servers for backing up (this is an option for professors more often than graduate students).

Finally, ensure that your backup plan is working.  This is especially important if others are in charge of data backup.  If your lab group has automated backup to a common computer, check to be sure your data are there, in full, and readable.  Ensure that the backup is actually occurring as regularly as you think it is.  More generally, you should be sure that if your laptop dies, or your office is flooded, or your home is burgled, you will be able to recover your data in full.

For more information on backing up, check out the DataONE education module “Protected back-ups”

Tweeting for Science

At risk of veering off course of this blog’s typical topics, I am going to post about tweeting.  This topic is timely given my previous post about the lack of social media use in Ocean Sciences, the blog post that it spawned at Words in mOcean,  and the Twitter hash tag #NewMarineTweep. A grad school friend asked me recently what I like about tweeting (ironically, this was asked using Facebook).  So instead of touting my thoughts on Twitter to my limited Facebook friends, I thought I would post here and face the consequences of avoiding DCXL almost completely this week on the blog.

First, there’s no need to reinvent the wheel.  Check out these resources about tweeting in science:

That being said, I will now pontificate on the value of Twitter for science, in handy numbered list form.

  1. It saves me time.  This might seem counter-intuitive, but it’s absolutely true.  If you are a head-in-the-sand kind of person, this point might not be for you. But I like to know what’s going on in science, science news, the world of science publishing, science funding, etc. etc.  That doesn’t even include regular news or local events.  The point here is that instead of checking websites, digging through RSS feeds, or having an overfull email inbox, I have filtered all of these things through HootSuite.  HootSuite is one of several free services for organizing your Twitter feeds; mine looks like a bunch of columns arranged by topic.  That way I can quickly and easily check on the latest info, in a single location. Here’s a screenshot of my HootSuite page, to give you an idea of the possibilities: click to open the PDF: HootSuite_Screenshot
  2. It is great for networking.  I’ve met quite a few folks via Twitter that I probably never would have encountered otherwise.  Some have become important colleagues, others have become friends, and all of them have helped me find resources, information, and insight.  I’ve been given academic opportunities based on these relationships and connections.  How does this happen? The Twittersphere is intimate and small enough that you can have meaningful interactions with folks.  Plus, there’s tweetups, where Twitter folks meet up at a designated physical location for in-person interaction and networking.
  3. It’s the best way to experience a conference, whether or not you are physically there. This is what spawned that previous post about Oceanography and the lack of social media use.  I was excited to experience my first Ocean Sciences meeting with all of the benefits of Twitter, only to be disappointed at the lack of participation.  In a few words, here’s how conference (or any event) tweeting works:
    1. A hash tag is declared. It’s something short and pithy, like #Oceans2012. How do you find out about the tag? Usually the organizing committee tells you, or in lieu of that you rely on your Twitter network to let you know.
    2. Everyone who tweets about a conference, interaction, talk, etc. uses the hash tag in their tweet. Examples:    
    3. Hash tags are ephemeral, but they allow you to see exactly who’s talking about something, whether you follow them or not.  They are a great way to find people on Twitter that you might want to network with… I’m looking at you, @rejectedbanana @miriamGoldste.
    4. If you are not able to attend a conference, you can “follow along” on your computer and get real-time feeds of what’s happening.  I’ve followed several conferences like this- over the course of the day, I will check in on the feed a few times and see what’s happening. It’s the next best thing to being there.

I could continue expounding the greatness of Twitter, but as I said before, others have done a better job than I could (see links above).  No, it’s not for everyone. But keep in mind that you can follow people, hash tags, etc. without actually ever tweeting. You can reap the benefits of everything I mentioned above, except for the networking.  Food for thought.

My friend from WHOI, who also attended the Ocean Sciences meeting, emailed me this comment later:

…I must say those “#tweetstars” were pretty smug about their tweeting, like they were the sitting at the cool kids table during lunch or something…

I countered that it was more like those tweeting at OS were incredulous at the lack of tweets, but yes, we are definitely the cool kids.