Thursday, December 13, 2007

The role of bioinformatics in Open Notebook Science

At one time in my life, I must have subscribed to "what's going on in the blogosphere" email updates from Genome Technology. Recently they were promoting a blog post about an "open notebook" bioinformatics project by Pedro Beltrao. I think it's great that Pedro is testing out Google Projects as a repository for developing bioinformatics applications and providing a forum for discussion, code versioning, and code releases. However, what surprised me a little was the title of the Genome Technology post: "Yes, it's OK to share your results".

From that title, you'd think that a bioinformatics person sharing their code was a rare event. It's true that raw data sharing as is still a rarity of amongst experimenters; the no-insider-information, real-time posting of raw experimental data pre-publication (i.e. Open Notebook Science) is an even greater rarity amongst experimenters. However, if having public code repositories is the bioinformatics version of Open Notebook Science (ONS), then ONS is hardly new to bioinformatics. A search for "bioinformatics" on Sourceforge (the largest open source software repository, which is similar to the Google Code repository used for Pedro's project) yields 126 results with projects dating back to 2001. I personally have a bioinformatics project in Sourceforge dating back to 2003.

But the truth is that openness is not a rare event in computer science or bioinformatics. Although, it is pretty common to read a publication where the authors don't provide their source code, such publications are generally looked down upon and are typically less cited (if you make me email you for your code and another person has the same code for free and easily accessible, why should I email you). Bioinformatics already benefits from the open atmosphere that pervades computer science. When I write a new bioinformatics algorithm, I almost always take advantage of the vast amount of publicly available tutorials and software (from C libraries and perl modules to bioinformatics-specific code like bioperl and bioconductor). Almost all of the large collaborative bioinformatics projects like Bioperl and Gbrowse provide live versioned repositorys like cvs or subversion, these projects have been around for years.

So is a project repository such as Sourceforge or Google Code the bioinformatics equivalent of Open Notebook Science? Yes, I believe it is. Congratulations computer programmers and bioinformaticians of the world, you already experience the value created by open sharing. In fact when I switched over from a pure bioinformatics job to become a hybrid scientist that spends half of the time doing experiments, I was shocked at how few tutorials there are on the internet to teach people experimental biology. All of these attitudes of data hiding seemed odd to me, so when I bumped into Jean-Claude's ONS article, it was great to finally see someone willing to dispel the fear-of-being-scooped myth and overcome the organizational hurdle requiring a level of annotation so that others could read and understand your code (i.e. experiments). Computer science removed those myths many years ago via the heroic efforts of those now famous names like Richard Stallman and Linus Torvalds. People in computer science don't fear being scoped, they typically praise it. How many variants of internet browsers derive their code from the Mozilla project? Computer science has also created standards and social norms for code annotation; poorly annotated computer code is very much looked down upon by true hackers. In the future, a poorly annotated open lab notebook from an experimental biologist will be viewed in a similar light.

Summary so far:
  1. Bioinformatics is already open
  2. Experimentalists can learn from bioinformatics (as bioinformatics learned from computer science) that:
    1. openness does not lead to widespread, uncontrolled idea theft; furthermore, idea theft (if properly acknowledged) is actually the highest form of praise; if hundreds of people are using your data before you've even finished your project, congratulations, you're doing one hell-of-a-job as a scientist
    2. annotation standards must become a part of the social atmosphere of the open notebook science community; we must praise notebooks written well enough that any scientist in a similar field could immediately understand, interpret, and replicate the experiment from the notebook.
I think that bioinformatics has already done a great service for ONS by setting an example for experimentalists to follow. Therefore, I feel we do a disservice to the future of ONS, by promoting open bioinformatics projects as great new contributions. Yes they are contributions, but they are hardly new. It's a little bit like patting America on the back for lowering HIV survival rates for yet another year, when we should really be focused on whether it's possible to do the same lowering for Africa, where the heart of the problem lies.

So what is the role of bioinformatics in the future of Open Notebook Science?
First, bioinformatics programmers need to continue doing what they've been striving for since bioinformatics began:
  1. develop your code as open source projects on one of the standard code repositories or at least put a link to your downloadable sourcecode on a public website
  2. provide a README file, installation instructions, and a few example data files so that people can get up and running easily
  3. if possible develop your code to work on a wide variety of platforms
  4. ensure that your code is annotated well-enough that other programmers can read it (preferably using one of the standard formats like perldoc with perl or doxygen with C)
Beyond this, bioinformatics will have a large role to play in the future of ONS for experimentalists. Being a hacker myself, it's clear that the current breed of software for ONS is far from ideal. In an ideal world, the software (web interface, GUI or whatever) I'm running to log my ONS experimental work would check that the experiment I'm about to run is at least sensible (e.g. the software should warn me if the buffer in my protocol is not compatible with the reaction I'm trying to run; or if the annealing temperature I'm using for my PCR is too low for the primers in the reaction). When I run a digestion, I want to know the success rate of everyone else that's ever run a similar digestion, I want to know their success rate with different buffers, I want to know their success rate given how old their restriction enzyme is and the batch it is from. I want the raw sequence data I enter into the ONS to be viewable in a traceviewer, with an interface that automatically BLASTs the sequence against the species I'm interested in to help me figure out what I've sequenced. I want to upload all of my sequences in a single zip archive and have the software organize it form me rather than uploading my files one at a time. I want scientists to be able to leave comments in my notebook, I want them to be able to received emails when someone replies to their comments (the lack of this feature kills many blog-based discussions). Like a wiki, I want to have the entire history of every file accessible to all readers. Like a blog, I want to provide RSS feeds so that all of the project's collaborators receive automagic updates when new experiments are added. I want all of this to be easy and intuitive. And someone that understands biology has to write the code...

Tuesday, December 4, 2007

Optimized ChIP Protocols

This page contains links to a chromatin immunoprecipitation protocol optimized with factorial and response surface methods.

The optimization resulted in two protocols that both have more than 10-fold higher throughput than the original. One of the protocols was optimized for speed and requires only 1.5 days to complete with a 46% average improvement in signal-to-noise ratio over the original protocol. The second protocol, optimized for signal-to-noise, requires 2.5 days to complete and achieves a 293% average improvement in signal-to-noise ratio over the original protocol. The optimizations were done in E. coli using three different transcription factors (two were primarily for validation).

Download the new ChIP Protocols (pdf)

Download the original ChIP Protocol (pdf)

The above links will always refer to the most recent version of the protocol if future improvements are added. If you have any questions or comments on the protocol, please post them to this blog.

All raw data and experiments that went towards this protocol optimization are in my Lab Notebook in the Chapter entitled: Towards a faster, more reliable ChIP protocol.

Older Versions

The protocol has a Change Log to describe what changes have occured between versions of the protocol. Archived older versions are below:

ChIP Protocol Version 1.2

ChIP Protocol Version 1.1

ChIP Protocol Version 1.0