Git/GitHub: A Primer for Researchers

The Beastie Boys knew what’s up: Git it together. From

I might be what a guy named Everett Rogers would call an “early adopter“. Rogers wrote a book back in 1962 call The Diffusion of Innovation, wherein he explains how and why technology spreads through cultures. The “adoption curve” from his book has been widely used to  visualize the point at which a piece of technology or innovation reaches critical mass, and divides individuals into one of five categories depending on at what point in the curve they adopt a given piece of technology: innovators are the first, then early adopters, early majority, late majority, and finally laggards.

At the risk of vastly oversimplifying a complex topic, being an early adopter simply means that I am excited about new stuff that seems promising; in other words, I am confident that the “stuff” – GitHub, in this case –will catch on and be important in the future. Let me explain.

Let’s start with version control.

Before you can understand the power GitHub for science, you need to understand the concept of version control. From, “Version control is a system that records changes to a file or set of files over time so that you can recall specific versions later.”  We all deal with version control issues. I would guess that anyone reading this has at least one file on their computer with “v2” in the title. Collaborating on a manuscript is a special kind of version control hell, especially if those writing are in disagreement about systems to use (e.g., LaTeX versus Microsoft Word). And figuring out the differences between two versions of an Excel spreadsheet? Good luck to you. The Wikipedia entry on version control makes a statement that brings versioning into focus:

The need for a logical way to organize and control revisions has existed for almost as long as writing has existed, but revision control became much more important, and complicated, when the era of computing began.

Ah, yes. The era of collaborative research, using scripting languages, and big data does make this issue a bit more important and complicated. Enter Git. Git is a free, open-source distributed version control system, originally created for Linux kernel development in 2005. There are other version control systems– most notably, Apache Subversion (aka SVN) and Mercurial. However I posit that the existence of GitHub is what makes Git particularly interesting for researchers.

So what is GitHub?

GitHub is a web-based hosting service for projects that use the Git revision control system. It’s free (with a few conditions) and has been quite successful since its launch in 2008. Historically, version control systems were developed for and by software developers. GitHub was created primarily as a way for efficiently developing software projects, but its reach has been growing in the last few years. Here’s why.

Note: I am not going into the details of how git works, its structure, or how to incorporate git into your daily workflow. That’s a topic best left to online courses and Software Carpentry Bootcamps

What’s in it for researchers?

At this point it is good to bring up a great paper by Karthik Ram titled “Git can facilitate greater reproducibility and increased transparency in science“, which came out in 2013 in the journal Source Code for Biology and Medicine. Ram goes into much more detail about the power of Git (and GitHub by extension) for researchers. I am borrowing heavily from his section on “Use cases for Git in science” for the four benefits of Git/GitHub below.

1. Lab notebooks make a comeback. The age-old practice of maintaining a lab notebook has been challenged by the digital age. It’s difficult to keep all of the files, software, programs, and methods well-documented in the best of circumstances, never mind when collaboration enters the picture. I see researchers struggling to keep track of their various threads of thought and work, and remember going through similar struggles myself. Enter online lab notebooks. recently ran a piece about digital lab notebooks, which provides a nice overview of this topic. To really get a feel fore the power of using GitHub as a lab notebook, see GitHubber and ecologist Carl Boettiger’s site. The gist is this: GitHub can serve as a home for all of the different threads of your project, including manuscripts, notes, datasets, and methods development.

2. Collaboration is easier. You and your colleagues can work on a manuscript together, write code collaboratively, and share resources without the potential for overwriting each others’ work. No more v23.docx or appended file names with initials. Instead, a co-author can submit changes and document those with “commit messages” (read about them on GitHub here).

3. Feedback and review is easier. The GitHub issue tracker allows collaborators (potential or current), reviewers, and colleagues to ask questions, notify you of problems or errors, and suggest improvements or new ideas.

4. Increased transparency. Using a version control system means you and others are able to see decision points in your work, and understand why the project proceeded in the way that it did. For the super savvy GitHubber, you can make available your entire manuscript, from the first datapoint collected to the final submitted version, traceable on your site. This is my goal for my next manuscript.

Final thoughts

Git can be an invaluable tool for researchers. It does, however, have a bit of a high activation energy. That is, if you aren’t familiar with version control systems, are scared of the command line, or are married to GUI-heavy proprietary programs like Microsoft Word, you will be hard pressed to effectively use Git in the ways I outline above. That said, spending the time and energy to learn Git and GitHub can make your life so. much. easier. I advise graduate students to learn Git (along with other great open tools like LaTeX and Python) as early in their grad careers as possible. Although it doesn’t feel like it, grad school is the perfect time to learn these systems. Don’t be a laggard; be an early adopter.

References and other good reads