Software Carpentry and Data Management

About a year ago, I started hearing about Software Carpentry. I wasn’t sure exactly what it was, but I envisioned tech-types showing up at your house with routers, hard drives, and wireless mice to repair whatever software was damaged by careless fumblings. Of course, this is completely wrong. I now know that it is actually an ambitious and awesome project that was recently adopted by Mozilla, and recently got a boost from the Alfred P. Sloan Foundation (how is it that they always seem to be involved in the interesting stuff?).

From their website:

Software Carpentry helps researchers be more productive by teaching them basic computing skills. We run boot camps at dozens of sites around the world, and also provide open access material online for self-paced instruction.

SWC got its start in 1990s, when its founder, Greg Wilson, realized that many of the scientists who were trying to use supercomputers didn’t actually know how to build and troubleshoot their code, much less use things like version control. More specifically, most had never been shown how to do four basic tasks that are fundamentally important to any science involving computation (which is increasingly all science):

  • growing a program from 10 to 100 to 100 lines without creating a mess
  • automating repetitive tasks
  • basic quality assurance
  • managing and sharing data and code
Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Software Carpentry is too cool for a reference to the Carpenters. From marshallmatlock.com (click for more).

Greg started teaching these topics (and others) at Los Alamos National Laboratory in 1998. After a bit of stop and start, he left a faculty position at the University of Toronto in April 2010 to devote himself to it full-time. Fast forward to January 2012, and Software Carpentry became the first project of what is now the Mozilla Science Lab, supported by funding from the Alfred P. Sloan Foundation.

This new incarnation of Software Carpentry has focused on offering intensive, two-day workshops aimed at grad students and postdocs. These workshops (which they call “boot camps”) are usually small – typically 40 learners – with low student-teacher ratios, ensuring that those in attendance get the attention and help they need.

Other than Greg himself, whose role is increasingly to train new trainers, Software Carpentry is a volunteer organization. More than 50 people are currently qualified to instruct, and the number is growing steadily. The basic framework for a boot camp is this:

  1. Someone decides to host a Software Carpentry workshop for a particular group (e.g., a flock of macroecologists, or a herd of new graduate students at a particular university). This can be fellow researchers, department chairs, librarians, advisors — you name it.
  2. Organizers round up funds to pay for travel expenses for the instructors and any other anticipated workshop expenses.
  3. Software Carpentry matches them with instructors according to the needs of their group; together, they and the organizers choose dates and open up enrolment.
  4. The boot camp itself runs eight hours a day for two consecutive days (though there are occasionally variations). Learning is hands-on: people work on their own laptops, and see how to use the tools listed below to solve realistic problems.

That’s it! They have a great webpage on how to run a bootcamp, which includes checklists and thorough instructions on how to ensure your boot camp is a success. About 2300 people have gone through a SWC bootcamp, and the organization hopes to double that number by mid-2014.

The core curriculum for the two-day boot camp is usually:

Software Carpentry also offers over a hundred short video lessons online, all of which are CC-BY licensed  (go to the SWC webpage for a hyperlinked list):

  • Version Control
  • The Shell
  • Python
  • Testing
  • Sets and Dictionaries
  • Regular Expressions
  • Databases
  • Using Access
  • Data
  • Object-Oriented Programming
  • Program Design
  • Make
  • Systems Programming
  • Spreadsheets
  • Matrix Programming
  • MATLAB
  • Multimedia Programming
  • Software Engineering

Why focus on grad students and postdocs? They focus on graduate students and post-docs because professors are often too busy with teaching, committees, and proposal writing to improve their software skills, while undergrads have less incentive to learn since they don’t have a longer-term project in mind yet. They’re also playing a long game: today’s grad students are tomorrow’s professors, and the day after that, they will be the ones setting parameters for funding programs, editing journals, and shaping science in other ways. Teaching them these skills now is one way – maybe the only way – to make computational competence a “normal” part of scientific practice.

So why am I blogging about this? When Greg started thinking about training researchers to understand the basics of good computing practice and coding, he couldn’t have predicted that huge explosion in the availability of data, the number of software programs to analyze those datasets, and the shortage of training that researchers receive in dealing with this new era. I believe that part of the reason funders stepped up to help the mission of software caprentry is because now, more than ever, reseachers need these skills to successfully do science. Reproducibility and accountability are in more demand, and data sharing mandates will likely morph into workflow sharing mandates. Ensuring reproducibility in analysis is next to impossible without the skills Software Carpentry’s volunteers teach.

My secret motive for talking about SWC? I want UC librarians to start organizing bootcamps for groups of researchers on their campuses!

Software for Reproducibility Part 2: The Tools

From the Calisphere collection "American war posters from the Second World War", Contributed by UC Berkeley Bancroft Library (Click on image for more information)

From the Calisphere collection “American war posters from the Second World War”, Contributed by UC Berkeley Bancroft Library (Click on image for more information)

Last week I wrote about the workshop I attended (Workshop on Software Infrastructure for Reproducibility in Science), held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows. I provided some broad-strokes overviews last week; this week, I’ve created a list of some of the tools we saw during the workshop. Note: the level of detail for tools is consistent with my level of fatigue during their presentation!

Sumatra

  • Presenter: Andrew Davison
  • Short description: Sumatra is a library of components and graphical user interfaces (GUIs) for viewing and using tools via Python. This is an “electronic lab notebook” kind of idea.
  • Sumatra needs to interact with version control systems, such as Subversion, Git, Mercurial, or Bazaar.
  • Future plans: Integrate with R, Fortran, C/C++, Ruby.

From the Sumatra website:

The solution we propose is to develop a core library, implemented as a Python package, sumatra, and then to develop a series of interfaces that build on top of this: a command-line interface, a web interface, a graphical interface. Each of these interfaces will enable: (1) launching simulations/analyses with automated recording of provenance information; sand (2) managing a computational project: browsing, viewing, deleting simulations/analyses.

Taverna

  • Presenter: Carole Goble
  • Short description: Taverna is an “execution environment”, i.e. a way to design and execute formal workflows.
  • Other details: Written in Java. Consists of the Taverna Engine (the workhorse), the Taverna Workbench (desktop client) and Taverna Server (remote workflow execution server) that sit on top of the Engine.
  • Links up with myExperiment, a publication environment for sharing workflows. It allows you to put together the workflow, description, files, data, documents etc. and upload/share with others.

From the Taverna website:

Taverna is an open source and domain-independent Workflow Management System – a suite of tools used to design and execute scientific workflows and aid in silico experimentation.

IPython Notebook

  • Presenter: Brian Granger
  • IPython notebook is a web-based computing environment and open document format for telling stories about code and data. Spawn of the IPython Project (focused on interactive computing)
  • Very code-centric. Can integrate text (i.e., Markdown), LaTeX, images, code etc. can be interwoven. Linked up with GitHub.

Galaxy

  • Presenter: James Taylor
  • Galaxy is an open source, free web service integrating a wealth of tools, resources, etc. to simplify researcher workflows in genomics.
  • Biggest challenge is archiving. They currently have 1 PB of user data due to the abundance of sequences used.

From their website:

Galaxy is an open, web-based platform for data intensive biomedical research. Whether on the free public server or your own instance, you can perform, reproduce, and share complete analyses.

Madagascar

  • Presenter: Sergey Fomel
  • Geophysics-focused project management system.

From the Madagascar website:

Madagascar is an open-source software package for multidimensional data analysis and reproducible computational experiments. Its mission is to provide

  • a convenient and powerful environment
  • a convenient technology transfer tool

for researchers working with digital image and data processing in geophysics and related fields. Technology developed using the Madagascar project management system is transferred in the form of recorded processing histories, which become “computational recipes” to be verified, exchanged, and modified by users of the system.

VisTrails

  • Presenter: David Koop
  • VisTrails is an open-source scientific workflow and provenance management system that provides support for simulations, data exploration and visualization.
  • It was built with ideas of provenance and transparency in mind.  Essentially the research “trail” is followed as users generate and test a hypothesis.
  • Focus on change-based provenance. keeps track of all changes via a version tree.
  • There’s also execution provenance – start time and end time; where it was done etc. this is instrument-specific metadata.
  • Crowdlabs.org: associated social website for sharing workflows and provenance

RCloud

  • No formal website; GitHub site only
  • Presenter: Carlos Scheidegger
  • RCloud was developed for programmers who use R at AT&T Labs. They were having interactive sessions in R for data exploration and exploratory data analysis. Their idea: what if every R session was transparent and automatically versioned>
  • I know this is a bit of a thin description… but this is an ongoing active project. I’m anxious to see where it goes next

ReproZip 

  • No formal website, but see a 15 minute demo here
  • Presenter: Fernando Chirigati
  • The premise: few computational experiments are reproducible. To get closer to reproducibility, we need a record of: data description, experiment specs, description of environment (in addition to code, data etc).
  • ReproZip automatically and systematically captures required provenance of existing experiments. It does this by “packing” experiments. How it works:
    • ReproZip executes experiment. system tap captures the provenance. each node of process has details on it (provenance tree)
    • Necessary components identified.
    • Specification of workflow happens. fed into workflow.
  • When ready to examine/explore/verify experiments: the package is extracted; ReproZip unloads the experiment and workflows

Open Science Framework

  • Presenters: Brian Nosek & Jeff Spies
  • In brief, the OSF is a nifty way to track projects, work with collaborators, and link together tools from your workflow.
  • You simply go to the website and start a project (for free). Then add contributors and components to the project. Voila!
  • A neat feature – you can fork projects.
  • Provides a log of activities (i.e., version control) & access control (i.e., who can see your stuff)>
  • As long as two software systems work with OSF, they can work together – OSF allows the APIs to “talk”.

From the OSF website:

The Open Science Framework (OSF) is part network of research materials, part version control system, and part collaboration software. The purpose of the software is to support the scientist’s workflow and help increase the alignment between scientific values and scientific practices.

RunMyCode

  • Presenter: Victoria Stodden
  • The idea behind this service is that researchers create “companion websites” associated with their publications. These websites allow others to implement the methods used in the paper.

From the RunMyCode website:

RunMyCode is a novel cloud-based platform that enables scientists to openly share the code and data that underlie their research publications. This service is based on the innovative concept of a companion website associated with a scientific publication. The code is run on a computer cloud server and the results are immediately displayed to the user.

Dexy

  • Presenter: Ana Nelson
  • Tagline: make | docs | sexy (what more can I say?)
  • Dexy is an Open Source platform that allows those writing up documents that have underpinning code to combine the two.
  • Can mix/match coding languages and documentation formats. (e.g., Python, C, Markdown to HTML, data from APIs, WordPress, etc.)

From their website:

Dexy lets you to continue to use your favorite documentation tools, while getting more out of them than ever, and being able to combine them in new and powerful ways. With Dexy you can bring your scripting and testing skills into play in your documentation, bringing project automation and integration to new levels.

DuraSpace 

  • Presenter: Jonathan Markow
  • DuraSpace’s mission is access, discovery, preservation of scholarly digital data. Their primary stakeholders and audience are libraries. As part of this role, DuraSpace is a steward open source projects (Fedora, DSpace, VIVO)
  • Their main service is DuraCloud: an online storage and preservation service that does things like repair corrupt files, move online copies offsite, distribute content geographically, and scale up or down as needed.

Dataverse 

  • Presenter: Merce Crosas
  • Dataverse is a virtual archive for “research studies”. These research studies are containers for data, documentation, and code that are needed for reproducibility.

From their website:

A repository for research data that takes care of long term preservation and good archival practices, while researchers can share, keep control of and get recognition for their data.

Software for Reproducibility

The ultimate replication machine: DNA. Sculpture at Lawrence Berkeley School of Science, Berkeley CA. From Flickr by D.H. Parks.

The ultimate replication machine: DNA. Sculpture at Lawrence Berkeley School of Science, Berkeley CA. From Flickr by D.H. Parks.

Last week I thought a lot about one of the foundational tenets of science: reproducibility. I attended the Workshop on Software Infrastructure for Reproducibility in Science, held in Brooklyn at the new Center for Urban Science and Progress, NYU. This workshop was made possible by the Alfred P. Sloan Foundation and brought together heavy-hitters from the reproducibility world who work on software for workflows.

New to workflows? Read more about workflows in old blog posts on the topic, here and here. Basically, a workflow is a formalization of “process metadata”.  Process metadata is information about the process used to get to your final figures, tables, and other representations of your results. Think of it as a precise description of the scientific procedures you follow.

After sitting through demos and presentations on the different tools folks have created, my head was spinning, in a good way. A few of my takeaways are below. For my next Data Pub post I will provide list of the tools we discussed.

Takeaway #1: Reuse is different from reproducibility.

The end-goal of documenting and archiving a workflow may be different for different people/systems. Reuse of a workflow, for instance, is potentially much easier than exactly reproducing the results .  Any researcher will tell you: reproducibility is virtually impossible. Of course, this differs a bit depending on discipline: anything involving a living thing is much more unpredictable (i.e., biology), while engineering experiments are more likely to be spot-on when reproduced. The level of detail needed to reproduce results is likely to dwarf details and information needed for reuse of workflows.

Takeaway #2: Think of reproducibility as archiving.

This was something Josh Greenberg said, and it struck a chord with me. It was said in the context of considering exactly how much stuff should be captured for reproducibility. Josh pointed out that there is a whole body of work out there addressing this very question: archival science.

Example: an archivist at a library gets boxes of stuff from a famous author who recently passed away. How does s/he decide what is important? What should be kept, and what should be thrown out? How should the items be arranged to ensure that they are useful? What metadata, context, or other information (like a finding aid) should be provided?

The situation with archiving workflows is similar: how much information is needed? What are the likely uses for the workflow? How much detail is too much? Too little? I like considering the issues around capturing the scientific process as similar to archival science scenarios– it makes the problem seem a bit more manageable.

Takeaway #3: High-quality APIs are critical for any tool developed.

We talked about MANY different tools. The one thing we could all agree on was that they should play nice with other tools. In the software world, this means having a nice, user-friendly Application Program Interface (API) that basically tells two pieces of software how to talk to one another.

Takeaway #4: We’ve got the tech-savvy researchers covered. Others? not so much.

The software we discussed is very nifty. That said, many of these tools are geared towards researchers with some impressive tech chops. The tools focus on helping capture code-based work, and integrate with things like LaTeX, Git/Github, the command line. Did I lose you there? You aren’t alone… many of the researchers I interact with are not familiar with these tools, and would therefore not be able to effectively use the software we discussed.

Takeaway #5: Closing the gap between the tools and the researchers that should use them is hard. But not impossible.

There are three basic approaches that we can take:

  1. Focus on better user experience design
  2. Emphasize researcher training via workshops, one-on-one help from experts, et cetera
  3. Force researchers to close the gap on their own. (i.e., Wo/man up).

The reality is that it’s likely to be some combination of these three. Those at the workshop recognized the need for better user interfaces, and some projects here at the CDL are focusing on extensive usability testing prior to release. Funders are beginning to see the value of funding new positions for “human bridges” to help sync up researcher skill sets with available tools. And finally, researchers are slowly recognizing the need to learn basic coding– note the massive uptake of R in the Ecology community as an example.