Thursday, December 13, 2007

The role of bioinformatics in Open Notebook Science

At one time in my life, I must have subscribed to "what's going on in the blogosphere" email updates from Genome Technology. Recently they were promoting a blog post about an "open notebook" bioinformatics project by Pedro Beltrao. I think it's great that Pedro is testing out Google Projects as a repository for developing bioinformatics applications and providing a forum for discussion, code versioning, and code releases. However, what surprised me a little was the title of the Genome Technology post: "Yes, it's OK to share your results".

From that title, you'd think that a bioinformatics person sharing their code was a rare event. It's true that raw data sharing as is still a rarity of amongst experimenters; the no-insider-information, real-time posting of raw experimental data pre-publication (i.e. Open Notebook Science) is an even greater rarity amongst experimenters. However, if having public code repositories is the bioinformatics version of Open Notebook Science (ONS), then ONS is hardly new to bioinformatics. A search for "bioinformatics" on Sourceforge (the largest open source software repository, which is similar to the Google Code repository used for Pedro's project) yields 126 results with projects dating back to 2001. I personally have a bioinformatics project in Sourceforge dating back to 2003.

But the truth is that openness is not a rare event in computer science or bioinformatics. Although, it is pretty common to read a publication where the authors don't provide their source code, such publications are generally looked down upon and are typically less cited (if you make me email you for your code and another person has the same code for free and easily accessible, why should I email you). Bioinformatics already benefits from the open atmosphere that pervades computer science. When I write a new bioinformatics algorithm, I almost always take advantage of the vast amount of publicly available tutorials and software (from C libraries and perl modules to bioinformatics-specific code like bioperl and bioconductor). Almost all of the large collaborative bioinformatics projects like Bioperl and Gbrowse provide live versioned repositorys like cvs or subversion, these projects have been around for years.

So is a project repository such as Sourceforge or Google Code the bioinformatics equivalent of Open Notebook Science? Yes, I believe it is. Congratulations computer programmers and bioinformaticians of the world, you already experience the value created by open sharing. In fact when I switched over from a pure bioinformatics job to become a hybrid scientist that spends half of the time doing experiments, I was shocked at how few tutorials there are on the internet to teach people experimental biology. All of these attitudes of data hiding seemed odd to me, so when I bumped into Jean-Claude's ONS article, it was great to finally see someone willing to dispel the fear-of-being-scooped myth and overcome the organizational hurdle requiring a level of annotation so that others could read and understand your code (i.e. experiments). Computer science removed those myths many years ago via the heroic efforts of those now famous names like Richard Stallman and Linus Torvalds. People in computer science don't fear being scoped, they typically praise it. How many variants of internet browsers derive their code from the Mozilla project? Computer science has also created standards and social norms for code annotation; poorly annotated computer code is very much looked down upon by true hackers. In the future, a poorly annotated open lab notebook from an experimental biologist will be viewed in a similar light.

Summary so far:
  1. Bioinformatics is already open
  2. Experimentalists can learn from bioinformatics (as bioinformatics learned from computer science) that:
    1. openness does not lead to widespread, uncontrolled idea theft; furthermore, idea theft (if properly acknowledged) is actually the highest form of praise; if hundreds of people are using your data before you've even finished your project, congratulations, you're doing one hell-of-a-job as a scientist
    2. annotation standards must become a part of the social atmosphere of the open notebook science community; we must praise notebooks written well enough that any scientist in a similar field could immediately understand, interpret, and replicate the experiment from the notebook.
I think that bioinformatics has already done a great service for ONS by setting an example for experimentalists to follow. Therefore, I feel we do a disservice to the future of ONS, by promoting open bioinformatics projects as great new contributions. Yes they are contributions, but they are hardly new. It's a little bit like patting America on the back for lowering HIV survival rates for yet another year, when we should really be focused on whether it's possible to do the same lowering for Africa, where the heart of the problem lies.

So what is the role of bioinformatics in the future of Open Notebook Science?
First, bioinformatics programmers need to continue doing what they've been striving for since bioinformatics began:
  1. develop your code as open source projects on one of the standard code repositories or at least put a link to your downloadable sourcecode on a public website
  2. provide a README file, installation instructions, and a few example data files so that people can get up and running easily
  3. if possible develop your code to work on a wide variety of platforms
  4. ensure that your code is annotated well-enough that other programmers can read it (preferably using one of the standard formats like perldoc with perl or doxygen with C)
Beyond this, bioinformatics will have a large role to play in the future of ONS for experimentalists. Being a hacker myself, it's clear that the current breed of software for ONS is far from ideal. In an ideal world, the software (web interface, GUI or whatever) I'm running to log my ONS experimental work would check that the experiment I'm about to run is at least sensible (e.g. the software should warn me if the buffer in my protocol is not compatible with the reaction I'm trying to run; or if the annealing temperature I'm using for my PCR is too low for the primers in the reaction). When I run a digestion, I want to know the success rate of everyone else that's ever run a similar digestion, I want to know their success rate with different buffers, I want to know their success rate given how old their restriction enzyme is and the batch it is from. I want the raw sequence data I enter into the ONS to be viewable in a traceviewer, with an interface that automatically BLASTs the sequence against the species I'm interested in to help me figure out what I've sequenced. I want to upload all of my sequences in a single zip archive and have the software organize it form me rather than uploading my files one at a time. I want scientists to be able to leave comments in my notebook, I want them to be able to received emails when someone replies to their comments (the lack of this feature kills many blog-based discussions). Like a wiki, I want to have the entire history of every file accessible to all readers. Like a blog, I want to provide RSS feeds so that all of the project's collaborators receive automagic updates when new experiments are added. I want all of this to be easy and intuitive. And someone that understands biology has to write the code...


Jean-Claude Bradley said...

The question of what constitutes Open Notebook Science in non experimental fields is not simple to answer. You are right that software has been created openly for a long time now. In fact when I started doing this work I was using the term "Open Source Science", thinking that the analogy would be obvious but it was not. Too many people assumed that meant Open Source Scientific Software.

That's why I started to use the term Open Notebook Science, bringing the focus to the laboratory notebook. My assumption was that anyone doing experimental science must keep a lab notebook. If that notebook is completely public you are doing ONS - if it isn't then you're doing something else.

When we started collaborating with people doing non-experimental work, like docking, things got a lot more challenging. I've tried to maintain "experiment-like" pages on our wiki with links to the libraries, algorithms and result files. But because so much information gets generated I don't think it is possible to capture all the mistakes like we can with a chemistry experiment.

I think the key distinction to make is again that of "no insider information". When a student and PI get together to write a paper, are they using only public data and files to construct that paper? Could someone else not in that group (human or otherwise) construct the equivalent of that same paper using only files made public by the research group? If so then I think they are doing ONS. If not then they are doing something else - it might be a form of Open Science but not ONS.

Pedro Beltrão said...

Terminologies aside I think bioinformatics should be playing a strong role in ONS because of the open source culture and tech aptitude. Still, it is not that easy to gain critical mass in a particular field of research to get several people around a discovery project.

In a related note, I just gave a talk to a PhD program meeting in Portugal about web resources for scientists. I ended the talk mentioning open notebook science and the very first question was about scooping. Most people find the concept interesting but they also think it is very risky.