Thursday, December 13, 2007

The role of bioinformatics in Open Notebook Science

At one time in my life, I must have subscribed to "what's going on in the blogosphere" email updates from Genome Technology. Recently they were promoting a blog post about an "open notebook" bioinformatics project by Pedro Beltrao. I think it's great that Pedro is testing out Google Projects as a repository for developing bioinformatics applications and providing a forum for discussion, code versioning, and code releases. However, what surprised me a little was the title of the Genome Technology post: "Yes, it's OK to share your results".

From that title, you'd think that a bioinformatics person sharing their code was a rare event. It's true that raw data sharing as is still a rarity of amongst experimenters; the no-insider-information, real-time posting of raw experimental data pre-publication (i.e. Open Notebook Science) is an even greater rarity amongst experimenters. However, if having public code repositories is the bioinformatics version of Open Notebook Science (ONS), then ONS is hardly new to bioinformatics. A search for "bioinformatics" on Sourceforge (the largest open source software repository, which is similar to the Google Code repository used for Pedro's project) yields 126 results with projects dating back to 2001. I personally have a bioinformatics project in Sourceforge dating back to 2003.

But the truth is that openness is not a rare event in computer science or bioinformatics. Although, it is pretty common to read a publication where the authors don't provide their source code, such publications are generally looked down upon and are typically less cited (if you make me email you for your code and another person has the same code for free and easily accessible, why should I email you). Bioinformatics already benefits from the open atmosphere that pervades computer science. When I write a new bioinformatics algorithm, I almost always take advantage of the vast amount of publicly available tutorials and software (from C libraries and perl modules to bioinformatics-specific code like bioperl and bioconductor). Almost all of the large collaborative bioinformatics projects like Bioperl and Gbrowse provide live versioned repositorys like cvs or subversion, these projects have been around for years.

So is a project repository such as Sourceforge or Google Code the bioinformatics equivalent of Open Notebook Science? Yes, I believe it is. Congratulations computer programmers and bioinformaticians of the world, you already experience the value created by open sharing. In fact when I switched over from a pure bioinformatics job to become a hybrid scientist that spends half of the time doing experiments, I was shocked at how few tutorials there are on the internet to teach people experimental biology. All of these attitudes of data hiding seemed odd to me, so when I bumped into Jean-Claude's ONS article, it was great to finally see someone willing to dispel the fear-of-being-scooped myth and overcome the organizational hurdle requiring a level of annotation so that others could read and understand your code (i.e. experiments). Computer science removed those myths many years ago via the heroic efforts of those now famous names like Richard Stallman and Linus Torvalds. People in computer science don't fear being scoped, they typically praise it. How many variants of internet browsers derive their code from the Mozilla project? Computer science has also created standards and social norms for code annotation; poorly annotated computer code is very much looked down upon by true hackers. In the future, a poorly annotated open lab notebook from an experimental biologist will be viewed in a similar light.

Summary so far:
  1. Bioinformatics is already open
  2. Experimentalists can learn from bioinformatics (as bioinformatics learned from computer science) that:
    1. openness does not lead to widespread, uncontrolled idea theft; furthermore, idea theft (if properly acknowledged) is actually the highest form of praise; if hundreds of people are using your data before you've even finished your project, congratulations, you're doing one hell-of-a-job as a scientist
    2. annotation standards must become a part of the social atmosphere of the open notebook science community; we must praise notebooks written well enough that any scientist in a similar field could immediately understand, interpret, and replicate the experiment from the notebook.
I think that bioinformatics has already done a great service for ONS by setting an example for experimentalists to follow. Therefore, I feel we do a disservice to the future of ONS, by promoting open bioinformatics projects as great new contributions. Yes they are contributions, but they are hardly new. It's a little bit like patting America on the back for lowering HIV survival rates for yet another year, when we should really be focused on whether it's possible to do the same lowering for Africa, where the heart of the problem lies.

So what is the role of bioinformatics in the future of Open Notebook Science?
First, bioinformatics programmers need to continue doing what they've been striving for since bioinformatics began:
  1. develop your code as open source projects on one of the standard code repositories or at least put a link to your downloadable sourcecode on a public website
  2. provide a README file, installation instructions, and a few example data files so that people can get up and running easily
  3. if possible develop your code to work on a wide variety of platforms
  4. ensure that your code is annotated well-enough that other programmers can read it (preferably using one of the standard formats like perldoc with perl or doxygen with C)
Beyond this, bioinformatics will have a large role to play in the future of ONS for experimentalists. Being a hacker myself, it's clear that the current breed of software for ONS is far from ideal. In an ideal world, the software (web interface, GUI or whatever) I'm running to log my ONS experimental work would check that the experiment I'm about to run is at least sensible (e.g. the software should warn me if the buffer in my protocol is not compatible with the reaction I'm trying to run; or if the annealing temperature I'm using for my PCR is too low for the primers in the reaction). When I run a digestion, I want to know the success rate of everyone else that's ever run a similar digestion, I want to know their success rate with different buffers, I want to know their success rate given how old their restriction enzyme is and the batch it is from. I want the raw sequence data I enter into the ONS to be viewable in a traceviewer, with an interface that automatically BLASTs the sequence against the species I'm interested in to help me figure out what I've sequenced. I want to upload all of my sequences in a single zip archive and have the software organize it form me rather than uploading my files one at a time. I want scientists to be able to leave comments in my notebook, I want them to be able to received emails when someone replies to their comments (the lack of this feature kills many blog-based discussions). Like a wiki, I want to have the entire history of every file accessible to all readers. Like a blog, I want to provide RSS feeds so that all of the project's collaborators receive automagic updates when new experiments are added. I want all of this to be easy and intuitive. And someone that understands biology has to write the code...

Tuesday, December 4, 2007

Optimized ChIP Protocols

This page contains links to a chromatin immunoprecipitation protocol optimized with factorial and response surface methods.

The optimization resulted in two protocols that both have more than 10-fold higher throughput than the original. One of the protocols was optimized for speed and requires only 1.5 days to complete with a 46% average improvement in signal-to-noise ratio over the original protocol. The second protocol, optimized for signal-to-noise, requires 2.5 days to complete and achieves a 293% average improvement in signal-to-noise ratio over the original protocol. The optimizations were done in E. coli using three different transcription factors (two were primarily for validation).

Download the new ChIP Protocols (pdf)

Download the original ChIP Protocol (pdf)

The above links will always refer to the most recent version of the protocol if future improvements are added. If you have any questions or comments on the protocol, please post them to this blog.

All raw data and experiments that went towards this protocol optimization are in my Lab Notebook in the Chapter entitled: Towards a faster, more reliable ChIP protocol.

Older Versions

The protocol has a Change Log to describe what changes have occured between versions of the protocol. Archived older versions are below:

ChIP Protocol Version 1.2

ChIP Protocol Version 1.1

ChIP Protocol Version 1.0

Wednesday, November 7, 2007

My first appearance in a scientific credit list

Tim Gardner (my PI) and Michael Molla wrote a guest blog post at PLoS on how science can learn from the movie industry. At the end of a movie, the role of everyone in the movie is clearly spelled out in the movie credits, while on scientific publications you only have a list of ordered names from which to try and infer the authors' role.
Excerpt from their scientific credit list post:
There is a better system, and it's already in use in the film industry -- a credit list. Each person who contributed to a movie has a specific credit describing his or her contribution. If one's contribution fills more than one role, that person's name can appear more than once.

Apparently, I contributed enough to this particular scientific endeavor of Tim and Michael to earn a spot in the credits:

Excerpt from their scientific credit list post:
Roll Credits: Sometimes the Authorship Byline Isn't Enough

Michael Molla (1) and Tim Gardner (2)
Writer: M. Molla
Editor: T. Gardner
Readers: Jeff Hasty (3) Jeremiah Faith (4)

(1) Research Associate, Biomedical Engineering, Boston University
(2) Assistant Professor, Biomedical Engineering, Boston University
(3) Associate Professor, Department of Bioengineering, University of California, San Diego
(4) Ph.D. Candidate, Bioinformatics Program, Boston University

I've written before about how our current publishing systems are certainly falling behind their potential. And while revolutionary ideas about how science should be done may be the way of the future, I still think we can benefit now from these types of incremental improvements to our current system.
Excerpt from their scientific credit list post:
Such a research credit system would have huge benefits for one's career prospects; and it might encourage more effective collaborations. Moreover, these credits could easily be tracked by scientist or project in a database akin to the Internet Movie Database (IMDB). It could provide an alternative to the ever-so-important citation factors as a means of assessing one's scientific impact. And maybe one day there will even be an Academy Awards of Science.

Here's hopin I win the Best Reader award at the 2008 Academy Awards of Science.

Wednesday, October 24, 2007

How this blog works

I've been blogging for about 6 months now, and it has definitely been more interesting and productive than I initially thought it would be. One thing I've found I don't like about blogging is that most blogs just throw stuff out there. I understand this is part of the blogginess of blogging, but it makes it really hard to get oriented. When you go to someone's blog for the first time, it often feels like random stuff is just being tossed onto the web. Only after following the blog for a while will you really figure out if the author has an overarching point.

So this post is just to help new folks orient themselves to my blog.

J's blog is primarily focused on developing and abiding by an Open Science system. Since Open Science is a relatively new idea, things are changing as I go along, and no one really has any standards yet, because Open Science people (and in particular Open Notebook Science People) are still trying to figure out best practices for science in the open.

That said here is my current schema:
  1. Open Ideas: I try to blog all of the experimental ideas that I'm considering pursuing. The hope is that I can find other folks interested in the same things as myself, and if I don't pursue the ideas, perhaps they'll be of use to someone else. I maintain an index of these Open Ideas on this blog.
  2. Open Projects: The Open Ideas I decide to pursue become a chapter in J's Lab Notebook. Following the basic idea of Open Notebook Science, all of raw data for the projects I pursue is publically available in real time (updated nitely). The hope is that folks who might find my work useful don't have to wait two years until I publish it. I know following someone's experiments in raw form can be difficult, but similar to reading someones computer code, I think we need some rules or general guidelines to make such tasks easier. I do not expect folks to read and follow the notebook as I go along. Rather I expect folks to stumble upon the notebook through internet searches and such. Whereupon, folks can email if they're interested in more information or clarification of anything. I maintain an index of these Open Projects on this blog.
  3. Open Publishing: After I finish projects, I typically publish them in scientific journals. In the future, I hope to publish the failed or smaller experiments to this blog or to an archive. Because the current set of open access journals doesn't yet cover the entire range of experimental and computational biology subjects, I do not publish exclusively in Open Access journals (though I think Jonathan Eisen has some interesting ideas on this topic, I think it's a little early to limit yourself to only open access journals unless you're already well known [which Eisen is] ). I maintain an index of these completed projects on this blog.
So if you're new to my blog, and you'd like to learn more. You might start by skimming the short descriptions available in the indexes: J's Open Ideas index, J's Open Projects index, J's Open Publishing index.

J's Open Projects index

This post is an index with a one-line description of the projects I'm working on and the relevant chapter of the work in J's Lab Notebook. Read how this blog works for more general information.

1) Cheaper, faster, better ChIP
Description: Using statistical experimental design methods to shorten, cheapen, and optimize a Chromatin Immunoprecipitation protocol for experimentally determining transcription factor binding sites.
Detailed description: Factorial and response surface optimization of a chromatin immunoprecipitation protocol
Lab Notebook Chapter: Towards a faster, more reliable ChIP protocol.
Date Project Started: Apr 26, 2007