Over the last couple of months I have seen a lot of scientists’ Excel spreadsheets. Approximately 150 spreadsheets, to be exact. All of this important information will feed into the requirements that I send off to the developers for the DCXL add-in, but in the mean time I am collating some very interesting information about how folks use Excel, for better or for worse.
It’s the defining moment in the interview: the sheepish opening of the spreadsheet. Scientists are often embarrassed by their data management, and tend to make a lot of excuses about why their spreadsheets look the way that they do. What’s interesting is the people that are most worried about their current system are often better off in comparison to others. At the end of the day, it’s about whether your system works for you and your needs.
There is one theme that seems to surface over the course of these interviews: scientists who were “handed down” spreadsheets for a project are often better off than those that start from scratch. This certainly makes sense: it’s likely that someone has perfected the spreadsheet’s designs over time, after many hours of frustration. What can the average scientist learn from this? Take the time at the beginning of your project to design your spreadsheets well.
Benjamin Franklin wants you to to think about your spreadsheet design before you start collecting your data. From Flickr by elycefeliz
Too often we start collecting data and throw things in a spreadsheet to be sorted out later. The file morphs over time, sometimes for the better and sometimes for the worse. We end up copying and pasting columns and tables to create more visually appealing layouts. We insert links to other cells, other spreadsheets, and other files to minimize potential copy errors. We move files and need to update links. Etcetera. Rather than this rather winding path that often results in monotonous reorganization of data sets, consider thinking very carefully at the beginning of your project about the most effective way to design your spreadsheets.
It’s often easier said than done, but as Benjamin Franklin once said, an ounce of prevention is worth a pound of cure.
It’s American Thanksgiving this week, and I’m feeling a bit nostalgic and wistful. In honor of this occasion, I thought I would list a few of my favorite data-related websites, blogs, and services. Cue Maria and the von Trapps! Or John Coltrane, if you prefer.
- FlowingData. This is an incredible website full of interesting ideas about how to present data effectively. As data sets grow and science becomes increasingly online, we will all need to think hard about how best to display our results. This is a great place to stimulate your creativity.
Maria's favorite thing. From Flickr by outdoorPDK.
- EZID. This is a very nifty service offered by my own home base, California Digital Library. You can register an archived data set with EZID and get a DOI- a digital object identifier– that uniquely references your data set. That means you can list it on your CV, cite it in your publications, and share it with collaborators. See my recent post about data citation for more ideas and info.
- DataCite. This is a great organization of librarians and other folks interested in promoting data citation. They are far ahead of the pack in terms of thinking about data citation and promotion of data publication.
- rOpenSci. This collaboration is all about creating open source R packages to facilitate science. I have a love/hate relationship with R, but that doesn’t change my belief that it’s incredibly powerful and useful.
- DataONE. What a fabulous idea: linking together existing repositories so you can search them ALL with one interface! Public release is set for January 2012. I can barely wait (and yes, I know that’s incredibly nerdy).
- ORCID. This is a non-profit that’s working on solving the problem of one name, many versions. It stands for “Open Researcher and Contributor ID”. The idea is for each researcher to get a unique personal identifier that is linked to all versions of their identity (for example Carly A. Strasser, CA Strasser, C.A. Strasser, Carly Ann Strasser, C. Strasser). It will help ensure researchers get credit for all of their work and increase discovery/funding/collaboration efficiency. This will be great especially for those interested in changing their names to something more catchy (e.g. Carly von Trapp). It’s still in early phases but stay tuned.
All of the items listed above are very relevant to the DCXL project- early indications are that the add-in will have capabilities to create data citations, link to DataONE, incorporate identifiers created by ORCID and EZID, and generate R-ready spreadsheets.
I can’t resist including a link to whiskers on kittens: if you haven’t seen the top 10 cat videos on the internet, take a break and check these out.
A few weeks back I posted some nifty Excel functions that you might not have been aware of. I’m certainly no expert in Excel, but I have heard lots of scientists tell me “I wish Excel would… [fill in the blank]“, when in fact Excel already has that capability. So here’s a few more Excel tips to make your data organization a little bit easier.
- Track Changes. This feature in Microsoft Word is familiar to all academics- it’s the best way to give and get feedback from co-authors on manuscripts and grant proposals. Did you know you can track changes in Excel, too? On a Mac, go to the Tools menu, and select “Track Changes…”. You can then select “Highlight Changes”. This opens a new window where you can check “Track changes while editing”. Note that this requires that you share your notebook. There are a few other options available in this menu as well. After you’ve made your changes and saved them, you can then select the “List changes on a new sheet” button. This produces a new tab with a history of the changes made. In this case, I changed my blank cells into “NA”: which results in a history page like this: Pretty neat! Especially if you have a lot of collaborators working on your project. If you are a Windows user, the process looks a bit different but is the same. Select the “Review” tab, and there’s a section of the ribbon called “Changes” where you can select “Track Changes”, enable sharing, and protect the workbook. For more from the Microsoft folks, check out their help file.
- Prevent changes. Excel calls this “protecting”. You can protect particular sheets in a workbook, or the entire workbook. Do this by again going to the “Tools” menu and selecting “Protect…” and choose either sheet or workbook. You can even add a password. Why would you do this? Perhaps you want to prevent changes from being made to the “raw data” tab. Protect it! Do you have an undergrad lab technician entering values into your spreadsheet? Protect sheets you don’t want them to change.If you are a Windows user. you have many more options for specifying protections, such as protecting formatting, selecting certain cells for protection, and preventing insertions and deletions. For more information, here’s Excel’s help file on the topic.
- Create a “Chart” tab. Excel allows you to create a tab specifically for placing your graphs. This keeps you from muddying your lovely tabular data with inserted figures. To use this feature, first select the data you would like to use to create your figure. Then go to the “Insert” menu and select “Sheet” and “Chart Sheet”. This creates a new tab with Excel’s default graph for the data you selected; you can make changes to the type of graph and its format in the same way as if you had inserted it into the spreadsheet. I haven’t figured out how to do this on a PC, but I know it’s possible- I’ve seen it! A little help, Windows users?
In the previous post, I covered some basic steps for preventing Data Hangover, however all of my suggestions pertained to actions you can take prior to data collection. What if the project is already well underway? How does one conduct retroactive data management? This is a much more complicated and tricky question, and the answer will vary with things like
- type(s) of data and their format(s)
- number of collaborators/contributors
- current organization and management system
- ideal organization and management system
- resources (personnel and financial)
Although listed last, Resources are going to be the most accurate predictor of how much retroactive data management you can accomplish. Personnel resources can better be interpreted as time. Data management in general is quite time consuming, and it is certainly difficult to correct mistakes after a project is well underway. Time is therefore likely to be the biggest hurdle to retroactive data management.
So where should you start? My best suggestion is to pretend that you can travel back in time and create a data management plan. Carefully consider things like: What is the ideal organization scheme for file structure? Where should you archive the data? What metadata standards should you use? Build your data dictionary and describe what codes, units, ranges etc. apply to the data.
Grab Michael J. Fox and get those datasets cleaned up! Photo from Flickr by F1RSTBORN
Use this data management plan to get a little bit closer to your ideal data situation. What steps can you take today? Tomorrow? Come up with a timeline and plan for working on your data management. If there are multiple people involved in the project, assign specific tasks related to re-organization and standardization.
If your personnel and/or time are limited, there are a few options:
- Consider asking for funds to facilitate data management. This is especially important if you are attempting to make sense of a large project spanning multiple years with an abundance of collaborators and data types. There are sometimes calls for proposals specifically for these types of “small-scale” data projects.
- Hire someone. This person might be from a local information school (although why limit it to local?). They may also be a graduate student in need of funding or a technician you can hire for a couple of months.
- Break down the gigantic task into small, manageable chunks. Often, data management is tedious and repetitive. I loved having projects like this on my desk while I was a graduate student- they were great for breaking up the more mentally challenging tasks, and provided a mental break while allowing me to still be productive. Consider replacing your YouTube time with data management time.
Although I don’t envy you the task of retroactive data management, you can be sure that the sense of satisfaction you will receive from well-managed data will make all of your efforts worthwhile.
What happened? This is a common question among co-eds, Mardi Gras celebrants, and many scientists. I’m not referring to alcohol-soaked evenings at conferences, but instead to the question that inevitably results from poor data management. What happened? How did I get here?
It might seem a bit trite to pose poor data management as a parallel to poor decisions about alcohol consumption, but actually the results are quite similar: you have a hard time remembering what happened, you regret the decisions that led to your predicament, and it’s rather embarrassing.
Is your data management record any better than notes on cocktail napkins? From Flickr by sillydog
How can you avoid it? The solution for data hangover is easy: good data management. This is often easier said than done, especially if the project and data collection are all well underway. Let’s start with the easier scenario, which is that the project has not yet started. Preventing data hangover is easiest at this point since good planning will avoid most of the problematic issues. A few general guidelines (for more ideas see Data Management 101 on this website and the Resources tab on my website):
- Choose a repository or data center for your long-term data storage. Some repositories allow you to store versions of your data and provide for access among collaborators during the project, in which case it is good to establish the relationship with the data manager before data are collected. Don’t know where to put your data? Check out DataCite’s list of repositories, check with your institution’s librarians (many institutions have repositories for their researchers, for example MIT’s DSpace), or ask senior colleagues in your field.
- Establish standards, e.g. a “data dictionary” that sets the parameters and terms used for your data. It is wise start by determining the metadata standards used by the data center where you plan to archive your data (see #1).
- Assign roles and responsibilities to specific individuals for each component of the data life cycle: documentation of data collection, quality control measures, backup and storage, and long-term archiving.
- Create a budget to cover costs associated with good data documentation, including hardware, software and personnel. This eliminates many of the most common excuses for data hangover.
After a recent presentation at UC Davis, someone in the audience asked What do you do if the project is already underway? How do you try to fix the problems? That is a bit trickier to answer, and good answers will vary widely. Solutions depend on things like the types of data that exist, how they are currently stored, and the degree to which they have been documented already. I began to draft potential steps for solving this conundrum, however it became a rather lengthy blog post. I will save this particular answer next time. Stay tuned.
The DCXL add-in that this project will produce will reduce your chances for data hangover, although exactly what features will do so are not yet determined. Most likely, the add-in will assist with #2 above: establishing standards. It may also assist with #1: choosing a repository, with the potential for linking to a particular repository via the add-in. If you have other ideas about the way the add-in might be the cure for data hangover, please let me know.
Data citation. This is a phrase you are likely to hear a lot in the next few years. The idea is simple enough: cite a data set, just like you would a journal article. Note: much of the content from this article was borrowed from Robert Cook of Oak Ridge National Laboratory and the DataCite website.
Why should you care about data citation? Here are] a few reasons:
- Researchers can easily find data products associated with a publication. If you’ve ever tried to re-use data from someone’s publication, you know how difficult it can be to find the raw data. Sometimes it involves using programs to generate data points from figures or tables in the PDF (I am not linking to any of these programs since I don’t think this is a good technique to employ). Other times you might have to contact the author directly to ask for their files. In general, it can be very time consuming and frustrating, and often results in failure to obtain the data. If data are cited properly, finding data products associated with a publication would be much easier. As the data provider, you have the added advantage of not needing to respond to any requests for your data; interested researchers can find it easily because you cited it.
- You get credit for your data AND your publications. Often the time it takes to write a paper for publication is only a small fraction of the time it took to collect the underlying data. If that’s the case, it would be great to get some credit for the actual data collection. You can put it on your CV in a section called “Data”. You can also include data in the set that wasn’t used for publication but might still be usable by others.
- Your data are discoverable via Web of Science. If you archive your data in a repository and get a digital object identifier, or DOI, for it (see Step 3 below), you can get citation metrics for your data AND your publications.
- You are allowing reproducibility of your results. Don’t be afraid of producing your data and analyses: if others are convinced that what you did is valid and reproducible, your clout as a researcher is sure to be high.
Data citation involves three steps on the part of the researcher:
- Prepare your data so you can archive it (see the Best Practices tab for more information). This includes documenting your data, i.e. creating metadata, and preparing your data for long-term storage.
- Put your data somewhere. Ideally this would be in a long-term stable archive or data center (there’s a list of repositories available on the DataCite website), but it can also be on your departmental website, your personal website, or as supplemental material on a journal’s website.
- Tell people how to cite and use your data. You can provide an example reference that includes typical information like your name, the year of the data set, the name of the data set, and where it is located. If you put your data in a repository, you can get a digital object identifier (available through services such as CDL’s EZID project), which can provide a way for others to find your data well into the future.
The concept of data citation is right in line with the DCXL project’s goals. One of the potential features for the add-in is to enable links to CDL’s EZID for DOI generation. Another is to prompt the user for creating good metadata, which is critical for making data citable.
Love at First Cite Tat: nerdy or awesome?