Clearing Up the Cloud of Confusion

The Cloud.  You have probably been hearing this phrase thrown around quite a bit lately. It reminds me of something straight out of Orwell’s dystopian classic 1984, but in this case Big Brother might be your friend.  If you are like I was about six months ago, you might be saying “What exactly IS The Cloud anyway??”  Here’s a very brief introduction to The Cloud and how it relates to DCXL.

The Cloud is a a metaphor for the internet in cloud computing.  (Don’t you hate it when definitions refer back to the thing they are defining??).  Now let’s try to define cloud computing. According to the ever-helpful Wikipedia, cloud computing is

…the delivery of computing as a service rather than a product, whereby shared resources, software, and information are provided to computers and other devises as a utility over the internet.

Let’s make a bullet list of descriptors of cloud computing that specific to science:

  • internet-based
  • shared software and data (these are kept on “The Cloud”, i.e. the internet)
  • end-users are not involved in configuring the computing

Jonathan Strickland from HowStuffWorks.com said it best:

In a cloud computing system, there’s a significant workload shift. Local computers no longer have to do all the heavy lifting when it comes to running applications. The network of computers that make up the cloud handles them instead.

Hardware and software demands on the user’s side decrease. The only thing the user’s computer needs to be able to run is the cloud computing system’s interface software, which can be as simple as a Web browser, and the cloud’s network takes care of the rest.

If you are familiar with using servers for storing your work or running applications remotely, cloud computing is quite familiar to you.  If you use Gmail, Dropbox, Flickr, or Evernote, you are already taking advantage of cloud computing.  The options for cloud computing are growing quickly; Google’s Apps for Business and Microsoft’s Azure are examples.  Apple recently introduced iCloud, which is geared towards individuals using the cloud for their pictures, movies, documents etc.

sunset clouds

Gratuitous pretty cloud shot. This was taken on the RV Atlantis during a research cruise to the equatorial Pacific, hence the oceanographic equipment.

So how does this relate to DCXL? One of the potential features of DCXL that might be included is a relationship with cloud computing.  This may be in the form of storing your data on the cloud and accessing it via DCXL, or it may be that the DCXL add-in for Excel ends up being accessible as a suite of cloud-based analytics.

Read more about The Cloud from people with more expertise than me here:

Open Science: What the Fuss is About

I advocate for open science.  I love the word open and all of the things that his word implies for science.  In keeping with last week’s post exploring what “data curation” means, here I touch very briefly on what open science means, and how it relates to the Excel add-in we are developing.  Let me admit up front that I’m certainly not the expert on this subject.  There is a great post at The Open Science Project’s blog, KQED (San Francisco’s public media outlet) recently ran an article that discusses the topic, and even the prestigious journal Nature (ironically for-profit) ran an article about the benefits of open research in chemistry.  A quick Internet search for “open science” will give you a wealth of resources on exploring this topic more.

So what’s the big deal with open science?  I argue that it harkens back to one of the most foundational pillars of science: reproducibility.  If no one else is able to recreate your results, then how are we to believe you?  The current system for scholarly communication relies on journal publications, which succinctly summarize the immense amount of work for a given project into a 5-20 page manuscript. The chances of recreating results from a journal article alone are effectively zero.  Over the course of scientific history all indications are that people have been relatively honest in their reporting of experimental results and observations (otherwise we wouldn’t have progressed this far).  But wouldn’t it be nice to know for sure that the science was good?

Iceberg

Journal articles are just the tip of the iceberg when it comes to scholarly communication. Photo by Felton, from Flickr, graceinhim

Here’s where open science is such a fabulous idea.  It suggests that rather than limiting our scholarly communication to the publication of a few journal articles every year, let’s communicate on a daily basis.  The Internet makes this possible with very little effort, and lays the foundation for rapid advancement of science.  The basic idea is that you expose as much of your thought process as possible to the public.  This may be by keeping an online lab notebook (via WordPress or OpenWetWare), publishing your code for scripted analysis, sharing your data and workflow online, or taking advantage of the many sites that facilitate open science like rOpenSciFigShare, SlideShare, ecologicaldata.org, myExperiment, EcologicalWebsDatabase, GeoCommons, and many, many others.  By making as much of your work public as possible, you reap the benefits like

  • public comments and suggestions on your work
  • increasing opportunities for collaboration
  • having a rebuttal for naysayers (“Go download my data and code if you don’t believe me”)

Of course, open science isn’t solely about reproducibility.  It also ensures/enables

  1. Trustworthiness
  2. Sharing methods and workflows
  3. Sharing data
  4. Making science possible for anyone, regardless of financial resources
  5. Promoting a community that is working towards common goals

How does the Excel add-in promote open science? First, the add-in is intended to be open source.  That means the code will be available for developers to take and mold into something they might think is more useful for their purposes- that is, they can reuse the code.  The add-in will also facilitate data sharing since it will streamline the process for data preparation and submission to archives.  Also, it will be free to download, maximizing its accessibility for Excel users.

Open science is a good thing.  Exposure might be a bit scary at first, but we all stand to benefit from shedding light on our work, our thought process, and our data.

raccoon with glowing eyes

It might feel a bit strange at first, but exposing our work to the light is a good thing. Flickr, by Eliya

Curation: It’s in the Project Title, But What Does It Mean?

I am a newcomer to the world of libraries and information science.  Being the new kid on the block is a familiar feeling for me: I have always been one to seek out new and interesting approaches to my questions, whether it be learning mathematical modeling and genetic sequencing techniques to explore clam populations, or exploring the ins and outs of Freeganism, I like learning new things.  So when I began working with questions related to scientific data, data sharing, and data reuse, I was comfortable asking “what does curation mean?”.

In case you are in the same boat as I was, this post seeks out a good description for “data curation”.   First, let’s start with curation in general.  It originates from the latin word cura for “care”.  Most people have heard the word in reference to museums (e.g. a museum curator).  These curators are charged with caring for museum collections; there are also art curators that focus on curating art collections.

The Cure Boys Don't Cry

The ultimate Curator? www.mostlyposters.com

A description specific to scientific data curation from Data Conservancy:

Data curation is a means to collect, organize, validate, and preserve data so that scientists can find new ways to address the grand research challenges that face society.

This description definitely touches on what we are interested in facilitating with the Excel add-in: most notably organization, and preservation of data.  Here is another take summarized from Wikipedia: data curation entails

  1. Collecting verifiable data
  2. Providing capabilities for data search and retrieval
  3. Ensuring integrity of collected data
  4. Ensuring semantic and ontological continuity (i.e. making sure the data are described in a consistent way)

This description makes data curation sound like something reserved for libraries and data centers to tackle, which isn’t necessarily the case.  I like the description laid out by the Digital Curation Centre based out of the UK because it touches on broader concepts related to curation.  For them, data curation comprises:

  1. Data management
  2. Adding value to data (perhaps this means adding good contextual metadata?)
  3. Data sharing for reuse
  4. Data preservation for later re-use

This description seems to coincide nicely with the goals of the DCXL project: we are interested in helping scientists with each of the four points above.  Really, data curation means managing, describing, and preserving your data so others might reuse it. So in summary: Go Forth and Curate!  (I couldn’t resist this Flickr result for the search “curation”):

"Go Forth and Curate!" cake

Curate Cake from Flickr by dolescum

Excel Tips and Tricks

I spent much of last week talking to scientists about Excel.  One of the questions I asked is What drives you crazy about Excel?   This definitely falls in the opening-a-can-of-worms category, but people were surprisingly clear and helpful in their comments.

Some of the complaints surfaced in multiple conversations, and I had to wonder whether the smart folks over at Microsoft Research had heard these complaints before.  I also figured that MSR they had, they might have already addressed them in Excel’s functionality.  So I did a little bit of poking about the Excel menus, and I was pleasantly surprised to find that Excel actually does a lot of the things people would like it to… here’s a few neat tricks I discovered:

  1. Automatically saving as .xls:  Microsoft introduced the .xlsx format in 2007; it is based on XML (eXtensible Markup Language, read more on Wikipedia). The .xlsx format came up quite a bit while talking with scientists, especially if they  work with collaborators who have problems opening or using this format.  I wish I could set it to automatically save as .xls instead of .xlsx.   Good news! This is possible and very easy- just change your Excel preferences.
    1. On a Mac: In Excel, go to “Preferences…” in the “Excel” dropdown menu. In the “Sharing and Privacy” row, select “Compatibility”.  The second section of this menu has a dropdown menu where you can designate what format you would like your Excel files to be saved as by default.
    2. On a PC: in Excel, click on the “Office Button”.  Select the “Excel Options” button at the bottom of the menu that appears.  Select “Save” from the bar on the left. Under “Save workbooks” near the top, you can choose the default format from the dropdown menu.
  2. Prompt you for workbook metadata: If you are terrible at documenting your data (like many of us are), you should think about turning on a feature in Excel that prompts you for workbook-level metadata. 
    1. On a Mac: In Excel, go to “Preferences…” in the “Excel” dropdown menu. In the “Authoring” row, select “General”.  There is a box you can check for “Prompt for Workbook Properties”.  Check that box. When you save your workbook, Excel will pull up this:
    2. On a PC: I’m still looking… but I’m sure it’s there somewhere.
  3. Regional settings: This is not technically an Excel setting, but it can influence how Excel handles your data, spell-checking, number formats, currency formats, etc.
    1. On a Mac: go to “System Preferences”. In the “Personal” row select “Language and Text”. You can change your language settings, how the dates, times, and numbers should be displayed, and your region. (I was surprised to learn that my laptop thought I was still in Edmonton, Alberta!)
    2. On a PC: Go to the Start menu, select the Control Panel, and open the “Regional and Language Options” Menu.

It’s not altogether surprising that scientists aren’t always aware of Excel’s potential.  Along with our less-than-ideal education about data management (see previous posts here and here), we are often left to trudge through Excel alone, learning tips and tricks along the way from classmates and colleagues.  More tips and tricks to come. Stay tuned!

No Scientist Left Behind: The Case for Data Education (Part II)

If you read my last post, you know that I am an advocate for better data education for scientists at all levels.  I focused on the need for better education of scientists-in-training (i.e. graduate students and postdocs), but this might actually be a bit late.  All science graduate students took classes with a mandatory laboratory component as undergraduates.  These same students are also likely to have taken courses in high school that involved laboratory experiments, data analysis, and rudimentary metadata generation (data documentation).

If our scientific training begins in high school and continues through our undergraduate courses, why should our data education and training not develop alongside more traditional skills, such as wearing safety goggles and close-toed shoes in the laboratory?

I know what you are thinking (because I’ve thought it myself): what about lab notebooks? Almost all science courses require students to keep some form of lab notebook; even in grammar school, students are often instructed to document their exercises on a lab worksheet.  The lab notebook is a mainstay in science education: document your work so others can verify or reproduce it.  I wholeheartedly agree with the importance of the notebook, however I contend that this crucial piece of data education (the only data education most students outside of graduate school receive), does not always make an appearance graduate school.  This is because the lab notebook has slowly fallen away with the digitization of scientific data and analysis.  Why keep an accurate, up-to-date notebook when all of  your notes, analyses, data, and visualizations are on a computer?

Lab bench with lab notebook

What happens to all of these handwritten notes? From flickr.com proteinbiochemist, CC0

So what does this mean for data education of scientists? I believe that data education should start earlier.  I would suggest MUCH earlier (grammar school!) but certainly it should be part of the science curriculum in undergraduate science courses.  Like the Logic course I was forced to take as part of my undergraduate liberal arts education, Data Management is at the core of so much of our daily lives, that its applicability reaches far beyond you what you might expect.

Next week I promise to return to the Excel add-in and where we are with the project. Stay tuned!

No Scientist Left Behind: The Case for Data Education (Part I)

I always assumed my advisor had a great data and computer file management system.  But when I asked her for a particular piece of information, it took her a week to find it and get it into a useable form for me.

This paraphrased statement was made by a Geology PhD graduate student while I interviewed him about his Excel use and data practices.  It speaks to a larger problem: there is a lack of basic data management knowledge among scientists at all levels.  More often than some advisors would like to admit, they are just as confused, sloppy, or disorganized as their students when it comes to data management- they just hide it better.

Students and postdocs often assume data management and organization is something to be figured out for themselves: they should develop their own organizational system, experiment with spreadsheet layouts, and perhaps occasionally contribute to organizing the data and files for their lab group.  This notion of trial-by-fire is likely coming from the top down.  The advisors of these scientists-in-training had to figure out how to manage their data without assistance, and therefore their students should too.  The consequences is that data management skills are sub-par all the way up the academic chain.

There is a case to be made for standing on the shoulders of giants, or not reinventing the wheel, or some other version of that colloquialism.  There is so much to learn in graduate school (and so many opportunities for failure): software, hardware, field techniques, laboratory techniques, standards and protocols, and instrument operation, note to mention course work and comprehensive exams.  I argue that a young scientist’s time is better spent figuring these myriad components of being a researcher, rather than fumbling through how to manage and organize their data.

Just like parents give their children allowance to prepare them for handling money in the future, advisors should instruct their “children” in good data management by first learning a bit about good data practices themselves, and then training up their advisees by giving them data to organize, showing them the system used by the lab, and encouraging them to experiment with software or hardware that might mesh well with their own style of data organization.  Final words:

To advisors: Train up your students in good data management!

To scientists-in-training: Ask your advisor about their data management and organization schemes.  Determine whether their systems might work for you, and how you can improve on them to fit your needs.