Thursday, December 13, 2007

The role of bioinformatics in Open Notebook Science

At one time in my life, I must have subscribed to "what's going on in the blogosphere" email updates from Genome Technology. Recently they were promoting a blog post about an "open notebook" bioinformatics project by Pedro Beltrao. I think it's great that Pedro is testing out Google Projects as a repository for developing bioinformatics applications and providing a forum for discussion, code versioning, and code releases. However, what surprised me a little was the title of the Genome Technology post: "Yes, it's OK to share your results".

From that title, you'd think that a bioinformatics person sharing their code was a rare event. It's true that raw data sharing as is still a rarity of amongst experimenters; the no-insider-information, real-time posting of raw experimental data pre-publication (i.e. Open Notebook Science) is an even greater rarity amongst experimenters. However, if having public code repositories is the bioinformatics version of Open Notebook Science (ONS), then ONS is hardly new to bioinformatics. A search for "bioinformatics" on Sourceforge (the largest open source software repository, which is similar to the Google Code repository used for Pedro's project) yields 126 results with projects dating back to 2001. I personally have a bioinformatics project in Sourceforge dating back to 2003.

But the truth is that openness is not a rare event in computer science or bioinformatics. Although, it is pretty common to read a publication where the authors don't provide their source code, such publications are generally looked down upon and are typically less cited (if you make me email you for your code and another person has the same code for free and easily accessible, why should I email you). Bioinformatics already benefits from the open atmosphere that pervades computer science. When I write a new bioinformatics algorithm, I almost always take advantage of the vast amount of publicly available tutorials and software (from C libraries and perl modules to bioinformatics-specific code like bioperl and bioconductor). Almost all of the large collaborative bioinformatics projects like Bioperl and Gbrowse provide live versioned repositorys like cvs or subversion, these projects have been around for years.

So is a project repository such as Sourceforge or Google Code the bioinformatics equivalent of Open Notebook Science? Yes, I believe it is. Congratulations computer programmers and bioinformaticians of the world, you already experience the value created by open sharing. In fact when I switched over from a pure bioinformatics job to become a hybrid scientist that spends half of the time doing experiments, I was shocked at how few tutorials there are on the internet to teach people experimental biology. All of these attitudes of data hiding seemed odd to me, so when I bumped into Jean-Claude's ONS article, it was great to finally see someone willing to dispel the fear-of-being-scooped myth and overcome the organizational hurdle requiring a level of annotation so that others could read and understand your code (i.e. experiments). Computer science removed those myths many years ago via the heroic efforts of those now famous names like Richard Stallman and Linus Torvalds. People in computer science don't fear being scoped, they typically praise it. How many variants of internet browsers derive their code from the Mozilla project? Computer science has also created standards and social norms for code annotation; poorly annotated computer code is very much looked down upon by true hackers. In the future, a poorly annotated open lab notebook from an experimental biologist will be viewed in a similar light.

Summary so far:
  1. Bioinformatics is already open
  2. Experimentalists can learn from bioinformatics (as bioinformatics learned from computer science) that:
    1. openness does not lead to widespread, uncontrolled idea theft; furthermore, idea theft (if properly acknowledged) is actually the highest form of praise; if hundreds of people are using your data before you've even finished your project, congratulations, you're doing one hell-of-a-job as a scientist
    2. annotation standards must become a part of the social atmosphere of the open notebook science community; we must praise notebooks written well enough that any scientist in a similar field could immediately understand, interpret, and replicate the experiment from the notebook.
I think that bioinformatics has already done a great service for ONS by setting an example for experimentalists to follow. Therefore, I feel we do a disservice to the future of ONS, by promoting open bioinformatics projects as great new contributions. Yes they are contributions, but they are hardly new. It's a little bit like patting America on the back for lowering HIV survival rates for yet another year, when we should really be focused on whether it's possible to do the same lowering for Africa, where the heart of the problem lies.

So what is the role of bioinformatics in the future of Open Notebook Science?
First, bioinformatics programmers need to continue doing what they've been striving for since bioinformatics began:
  1. develop your code as open source projects on one of the standard code repositories or at least put a link to your downloadable sourcecode on a public website
  2. provide a README file, installation instructions, and a few example data files so that people can get up and running easily
  3. if possible develop your code to work on a wide variety of platforms
  4. ensure that your code is annotated well-enough that other programmers can read it (preferably using one of the standard formats like perldoc with perl or doxygen with C)
Beyond this, bioinformatics will have a large role to play in the future of ONS for experimentalists. Being a hacker myself, it's clear that the current breed of software for ONS is far from ideal. In an ideal world, the software (web interface, GUI or whatever) I'm running to log my ONS experimental work would check that the experiment I'm about to run is at least sensible (e.g. the software should warn me if the buffer in my protocol is not compatible with the reaction I'm trying to run; or if the annealing temperature I'm using for my PCR is too low for the primers in the reaction). When I run a digestion, I want to know the success rate of everyone else that's ever run a similar digestion, I want to know their success rate with different buffers, I want to know their success rate given how old their restriction enzyme is and the batch it is from. I want the raw sequence data I enter into the ONS to be viewable in a traceviewer, with an interface that automatically BLASTs the sequence against the species I'm interested in to help me figure out what I've sequenced. I want to upload all of my sequences in a single zip archive and have the software organize it form me rather than uploading my files one at a time. I want scientists to be able to leave comments in my notebook, I want them to be able to received emails when someone replies to their comments (the lack of this feature kills many blog-based discussions). Like a wiki, I want to have the entire history of every file accessible to all readers. Like a blog, I want to provide RSS feeds so that all of the project's collaborators receive automagic updates when new experiments are added. I want all of this to be easy and intuitive. And someone that understands biology has to write the code...

Tuesday, December 4, 2007

Optimized ChIP Protocols

This page contains links to a chromatin immunoprecipitation protocol optimized with factorial and response surface methods.

The optimization resulted in two protocols that both have more than 10-fold higher throughput than the original. One of the protocols was optimized for speed and requires only 1.5 days to complete with a 46% average improvement in signal-to-noise ratio over the original protocol. The second protocol, optimized for signal-to-noise, requires 2.5 days to complete and achieves a 293% average improvement in signal-to-noise ratio over the original protocol. The optimizations were done in E. coli using three different transcription factors (two were primarily for validation).

Download the new ChIP Protocols (pdf)

Download the original ChIP Protocol (pdf)

The above links will always refer to the most recent version of the protocol if future improvements are added. If you have any questions or comments on the protocol, please post them to this blog.

All raw data and experiments that went towards this protocol optimization are in my Lab Notebook in the Chapter entitled: Towards a faster, more reliable ChIP protocol.

Older Versions

The protocol has a Change Log to describe what changes have occured between versions of the protocol. Archived older versions are below:

ChIP Protocol Version 1.2

ChIP Protocol Version 1.1

ChIP Protocol Version 1.0

Wednesday, November 7, 2007

My first appearance in a scientific credit list

Tim Gardner (my PI) and Michael Molla wrote a guest blog post at PLoS on how science can learn from the movie industry. At the end of a movie, the role of everyone in the movie is clearly spelled out in the movie credits, while on scientific publications you only have a list of ordered names from which to try and infer the authors' role.
Excerpt from their scientific credit list post:
There is a better system, and it's already in use in the film industry -- a credit list. Each person who contributed to a movie has a specific credit describing his or her contribution. If one's contribution fills more than one role, that person's name can appear more than once.

Apparently, I contributed enough to this particular scientific endeavor of Tim and Michael to earn a spot in the credits:

Excerpt from their scientific credit list post:
Roll Credits: Sometimes the Authorship Byline Isn't Enough

Michael Molla (1) and Tim Gardner (2)
Writer: M. Molla
Editor: T. Gardner
Readers: Jeff Hasty (3) Jeremiah Faith (4)

(1) Research Associate, Biomedical Engineering, Boston University
(2) Assistant Professor, Biomedical Engineering, Boston University
(3) Associate Professor, Department of Bioengineering, University of California, San Diego
(4) Ph.D. Candidate, Bioinformatics Program, Boston University

I've written before about how our current publishing systems are certainly falling behind their potential. And while revolutionary ideas about how science should be done may be the way of the future, I still think we can benefit now from these types of incremental improvements to our current system.
Excerpt from their scientific credit list post:
Such a research credit system would have huge benefits for one's career prospects; and it might encourage more effective collaborations. Moreover, these credits could easily be tracked by scientist or project in a database akin to the Internet Movie Database (IMDB). It could provide an alternative to the ever-so-important citation factors as a means of assessing one's scientific impact. And maybe one day there will even be an Academy Awards of Science.

Here's hopin I win the Best Reader award at the 2008 Academy Awards of Science.

Wednesday, October 24, 2007

How this blog works

I've been blogging for about 6 months now, and it has definitely been more interesting and productive than I initially thought it would be. One thing I've found I don't like about blogging is that most blogs just throw stuff out there. I understand this is part of the blogginess of blogging, but it makes it really hard to get oriented. When you go to someone's blog for the first time, it often feels like random stuff is just being tossed onto the web. Only after following the blog for a while will you really figure out if the author has an overarching point.

So this post is just to help new folks orient themselves to my blog.

J's blog is primarily focused on developing and abiding by an Open Science system. Since Open Science is a relatively new idea, things are changing as I go along, and no one really has any standards yet, because Open Science people (and in particular Open Notebook Science People) are still trying to figure out best practices for science in the open.

That said here is my current schema:
  1. Open Ideas: I try to blog all of the experimental ideas that I'm considering pursuing. The hope is that I can find other folks interested in the same things as myself, and if I don't pursue the ideas, perhaps they'll be of use to someone else. I maintain an index of these Open Ideas on this blog.
  2. Open Projects: The Open Ideas I decide to pursue become a chapter in J's Lab Notebook. Following the basic idea of Open Notebook Science, all of raw data for the projects I pursue is publically available in real time (updated nitely). The hope is that folks who might find my work useful don't have to wait two years until I publish it. I know following someone's experiments in raw form can be difficult, but similar to reading someones computer code, I think we need some rules or general guidelines to make such tasks easier. I do not expect folks to read and follow the notebook as I go along. Rather I expect folks to stumble upon the notebook through internet searches and such. Whereupon, folks can email if they're interested in more information or clarification of anything. I maintain an index of these Open Projects on this blog.
  3. Open Publishing: After I finish projects, I typically publish them in scientific journals. In the future, I hope to publish the failed or smaller experiments to this blog or to an archive. Because the current set of open access journals doesn't yet cover the entire range of experimental and computational biology subjects, I do not publish exclusively in Open Access journals (though I think Jonathan Eisen has some interesting ideas on this topic, I think it's a little early to limit yourself to only open access journals unless you're already well known [which Eisen is] ). I maintain an index of these completed projects on this blog.
So if you're new to my blog, and you'd like to learn more. You might start by skimming the short descriptions available in the indexes: J's Open Ideas index, J's Open Projects index, J's Open Publishing index.

J's Open Projects index

This post is an index with a one-line description of the projects I'm working on and the relevant chapter of the work in J's Lab Notebook. Read how this blog works for more general information.

1) Cheaper, faster, better ChIP
Description: Using statistical experimental design methods to shorten, cheapen, and optimize a Chromatin Immunoprecipitation protocol for experimentally determining transcription factor binding sites.
Detailed description: Factorial and response surface optimization of a chromatin immunoprecipitation protocol
Lab Notebook Chapter: Towards a faster, more reliable ChIP protocol.
Date Project Started: Apr 26, 2007

J's Open Publishing index

This post is an index with a one-line description of the projects I've published. Read how this blog works for more general information.

2) Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata
Description: Microarray compendia for E. coli, Shewanella, and yeast (currently 524, 530, and 14 arrays for each of these species respectively). The arrays are normalized together to allow all arrays for each species to be analyzed as a single group.
Blog post on this publication: When will gene expression data become collective knowledge?
Date work was published: Sept 18, 2007

1) Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles
Description: How well can we computationally infer regulatory interactions between transcription factors and their targets using microarray data? Predictions from several algorithms were validated using the 3500 experimentally determined interactions in RegulonDB plus and additional 300 interactions were tested with ChIP. Most importantly, once we've reliably inferred this networks, what can we do with them?
Useful resources: The microarray data used in this paper is available at M3D. In addition, we have supplemental site containing links to the algorithms and the RegulonDB known interactions we used in the publication.
Date work was published: January 9, 2007

J's Open Ideas index

This post is an index with a one-line description of the projects I'm considering working on. Read how this blog works for more general information.

3) Effect of sequence level mutations on transcription, translation, and noise
Description: A technique I think would allow us to determine the effect of millions of promoter variants on the rate of transcription and translation in single-cells.
Date idea was blogged: September 1, 2007

2) Mutations, gene passing, and the evolution of gut microbes
Description: Can we use gnotobiotic mice to obtain experimental estimates of mutation rates and gene transfer rates in different intestinal environments (e.g. under different stresses and with different combinations of microbes).
Date idea was blogged: June 26, 2007

1) Live imaging of host-microbe interactions
Description: Can we apply the deep-imaging 2-photon microscopy techniques, which have been developed over the past few years for imaging neuronal systems, to the imaging of the special distribution and interactions between different gut microbes and their host.
Date idea was blogged:
June 12, 2007

Tuesday, October 23, 2007

When will gene expression data become collective knowledge?

Published research: this post describes some of my published research. The relevant publication is: Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured metadata, which describes the M3D database. This post serves as a place for folks to provide comments and suggestions for the future of the database and the future of expression data in general.

The focus of my PhD thesis has been network inference: how can we efficiently determine regulatory networks in prokaryotes (i.e. which transcription factors regulate which genes). Since mRNA concentration is the only data we can easily measure for all genes simultaneously, my PhD has also be an exploration of the potential and limitations of expression data. In joining Tim Gardner's lab at BU, I was a little reluctant to analyze microarray data, because I'd heard it was noisy and too easy to find whatever answer you were looking for. After 4+ year of working with gene expression data from microarrays, I'm convinced that it is a little noisy and that it is extremely easy to find whatever answer you're looking for. But with careful analysis, particularly regarding the statistics of large datasets and multiple hypothesis testing, microarrays hold an unparalleled wealth of knowledge about the dynamic, concerted actions of cells.

Genome sequencing has provided the cell's nouns. Microarrays are providing the cell's verbs. Currently, most people use the nouns as collective knowledge while the verbs are generated and analyzed in-house. How are we ever going to understand the language of life when everyone has to generate a personal set of verbs?

What I mean for genome sequencing is this: you want to knockout, clone, tag, or whatever a gene in species X. You look up the location of that gene in the genome browser for species X, download the surrounding sequence, and use your intuition or some primer design software to help you construct the chemicals (typically oligonucleotide primers) you need to experiment on your gene. Or maybe you're one of these folks that likes to take fancy trips with your science budget, so you're out in Hawaii to collect sea water for metagenomic sequencing. When you are sufficiently tanned and you head back home to sequence the seawater, the first thing you do is compare your seawater DNA sequence to all DNA sequence available for any species on the planet to see if you can find anything to help you figure out what was in your seawater.

What I mean for microarrays is this: you want to figure out what the cell is doing when you apply X to it (e.g. heat shock, acid shock, DNA damage, glucose growth, etc...). You run 3 chips in a standard condition and 3 chips in a standard condition plus X. You take the two conditions run a statistical test - ttest, fold change, or FDR if you're getting fancy - to produce a list of N genes that changed expression when you did X. You write a 5 page paper where you publish the list and describe what those changes might mean.

Final summary of the microarray isolationist problem
With genome sequencing everyone is taking advantage of the wealth of collective sequencing knowledge to improve their own research. For microarrays, with few exceptions, knowledge is created and remains in isolation. One could make the case that microarrays are a relatively young technology and the collected knowledge has yet to accumulate, but microarrays were invented in 1995 - they're older than Google Inc.

How can we promote gene expression as collective knowledge?
For our network inference work, it was necessary to collect as many microarrays as possible for the species of interest (E. coli). Since Tim's lab is full of computer nerds like myself, from the start we decided to collect all of the data
in a database. Originally, the database was just a storage dump that ensured that all the folks writing network inference algorithms (i.e. Boris Hayete and myself) would use the same starting dataset. Having the microarray datadump in a standard format and standard location certainly helped our network inference efforts. In the end, Boris developed a network inference algorithm - CLR - that currently remains the top performing algorithm on the E. coli microarrays (if you're interested in network inference and think you can top CLR - please try, here's a site to help you get started; and let me know how it goes!).

More recently, we improved our microarray database and published a piece, Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured metadata, that I hope starts to move gene expression into the collective knowledge space. In that paper we tried to address three problems that are currently hindering the wide-s
cale adoption of microarrays:
  1. the presence of platform-specific biases in expression data due to the use of many different microarray platforms in a compendium
  2. the lack of a uniformly applied normalization standard for expression datasets, even within a single expression platform. Different software algorithms are used by different labs for preprocessing and normalizing the raw microarray intensity values
  3. the incompleteness and inconsistency in the curation of metadata describing the details of each experimental condition.
To address point 1, we only allowed a single platform (Affymetrix) for each of the three species currently in the database (E. coli, Shewanella, and yeast). To address point 2, we collected unnormalized raw CEL files for all of the experiments and uniformly normalized them as a group with RMA. In our previous work, we found that this RMA normalization makes comparisons possible on microarrays of the same platform that are run in different laboratories (see the section "Verification of array data normalization and consistency" and Figure S5 in the supplement Protocol S1 to our network inference paper). And to address point 3, we generated human curated (and computationally validated) experimental metadata for each microarray publication—converting each chemical and growth attribute into a structured and computable set of experimental features with consistent naming conventions and units.

I believe this Many Microbe Microarrays Database (M3D) provides the essential starting point for moving towards the use of microarrays as collective knowledge: a set of curated microarray datasets that have already proven useful in a large-scale application.

If you build it, some will come
M3D has been online for about a year now. The site currently gets around 200 unique visitors and 3000 hits a day. But based on the emails I get from folks, the audience i
s primarily computational folks like myself that are eager to test their new algorithm on a large, well-annotated experimental dataset.

But will the biologists ever arrive?

The wetlab biologists are the folks I would really like to begin adopting this resource. Since I do experimental work in E. coli, whenever I want to find out what a particular gene does or what genes might regulate it, I dig around in RegulonDB or EcoCyc to see what's known about the gene. How might expression data be incorporated into such a website to aid understanding the biology of E. coli? For one, if RegulonDB has published evidence that lexA transcriptionally regulates recA, it would be nice to see if the collective microarray knowledge currently supports the published evidence. This type of information could be provided by including a scatterplot of the expression values of the transcription factor -vs- those of its target (see the image below generated from M3D). If you generate this plot on M3D, you can also mouseover each point to receive the details of the experiment represented by the datapoint.
I think the key to adoption is probably integration with the currently available resources. Folks don't want to go to yet-another-website and figure out how to work it. Because of this, I've created a way for external websites to automatically include M3D generated images on their own websites (instructions). Here are a couple examples, these are drawn on-the-fly from M3D rather than uploaded to this blog:

The problem with this approach is that if M3D goes down, the automatically generated images also fail on the remote website as well. So perhaps, remote websites would want to automatically generate and locally cache the images.

Hey microarrays, what can you do for me?
So perhaps integration with other databases will increase the awareness of microarray data, but I really want people using the data. So the question is what would folks like to do with microarray data? Like a sequence database, can a microarray database allow scientists to better understand their own data? In general people don't want to just browse around NCBI, they want to BLAST their sequence to help them improve and publish their own work.

So I'd be really interested if anyone had ideas about applications that might allow more people to use the collective expression knowledge in M3D (or any other microarray database for that matter).

Here are some things that are already available M3D:
  • what genes changed expression in condition X (z-test)
  • what genes changed relative expression between conditions X and Y (t-test or fold-change)
  • do genes in particular chromosomal regions tend to change expression as a group
Here are some things I'm considering adding:
  • array-blast: submit your raw microarray data (CEL file) and you get back a list of the most similar arrays in the database along with the condition information for those arrays
  • expression-based-function annotation: across the compendium, what is the effect of each experimental feature (e.g. glucose concentration is correlated with the expression of genes X,Y,Z)
  • array-changed-genes: submit your CEL file(s) and run a z-test against the entire compendium or a t-test against a particular subset of arrays in the compendium to determine the set of genes whose expression changed in your microarrays
How you can help!
If you have any ideas/suggestions for tools, applications, or anything else that might be done to a website like M3D to help folks use this collective expression knowledge, I'd like to hear your ideas (just leave a comment on this post).

Friday, October 19, 2007

Factorial and response surface optimization of a chromatin immunoprecipitation protocol

Research in progress: this post describes some of my ongoing research. The raw data and all experimental details are updated daily in J's Lab Notebook in the chapter entitled: Towards a faster, more reliable ChIP protocol.

The following text is largely taken from my PhD Oral Qualifier. I tried to blogify it a little, but it is still a little formal for a blog post. I also don't have many citations. Appropriate citations will be in the published version if I complete this project (if you have opinions about how we should deal with citations in very preliminary results please post a comment - I'd like to hear your opinion).

Short Version of What I'm trying to do: Chromatin Precipitation (ChIP) is often used to experimentally verify or discover transcription factor binding sites. In my experience, ChIP is lengthy, costly, and noisy. I'm trying to use statistical experimental design techniques to shorten, cheapen, and reduce the noise of the ChIP procedure. I'd really like it if ChIP were simple enough to become a standard technique that all experimentalists learn (i.e. like a miniprep and PCR), so we can really start to determine the transcriptional regulatory network structure of many organisms.

In general, I think there is a lot of unnessary folklore in our experimental procedures, and the methods I'm applying here would be applicable to almost any experimental protocol optimization - if broadly applied, experimental biology would be a much less time-consuming endeavor.

Longer Version of What I'm trying to do:
We plan to optimize and shorten the chromatin immunoprecipitation (ChIP) protocol for in vivo validation of transcription factor targets. Verifying a transcription factor's genomic binding regions with ChIP requires: 1) fixing the transcription factor to the regions of the genome it binds via a crosslinking agent like formaldehyde, 2) cell lysis, 3) chromatin shearing (to enable isolation of only small regions of DNA bound by the transcription factor), and 4) multiple washes to remove background noise [1]. Once the ChIP procedure is complete, the DNA bound by the transcription factor should be enriched relative to unbound DNA. This enrichment can be assayed by qPCR, microarray, or DNA sequencing (less common), providing confirmation of the transcription factor bindings sites (and therefore presumably, the gene targets of the transcription factors).

ChIP is used by numerous labs across many model organisms, yet the ChIP protocol is anything but standardized; ChIP protocols are as numerous as the number of investigators using the technique, suggesting that we are far from an optimal protocol. The ChIP protocol we previously used to validate network inference targets in E. coli [2] required almost a week of long experimental days to go from cells to verified transcription factor targets. Because of this length, the procedure is error-prone and only tractable to the most experienced bench scientists. We aim to use modern statistical methods of experimental design to optimize the ChIP protocol [3]. In particular, we will use fractional factorial designs to screen for unnecessary steps that can be removed to shorten the protocol. In addition, we will optimize the protocol steps that have the most significant influence on the enrichment of known transcription factor targets to improve the signal to noise ratio of the ChIP procedure.

Successful completion of this work will result in a markedly shorter and more effective ChIP protocol for verifying transcription factor targets. The new protocol will make verification of transcription factor binding sites approachable and practical to a wider range of bench scientists, promoting the experimental validation of future network inference predictions. In addition, the knowledge gained by an in-depth analysis of the ChIP technique will help optimize the protocol for different tasks such as highly parallel sequencing of ChIP DNA for transcription factor target discovery. Finally, the ChIP protocol optimization highlights the untapped experimenter efficiency potential these statistical methods could unleash on molecular biology if these experimental design techniques were broadly applied to experimental protocols.


Most experimental protocols can be represented mathematically as y = f(q) where y is the product resulting from the protocol and q are the parameters of the protocol. In a PCR experiment for example, y would represent the yield of DNA (e.g. in micrograms), while q represents the parameters of the reaction (e.g. concentrations of template, primers, magnesium chloride, etc...). The statistics of experimental design contains numerous methods to expedite the empirical optimization of y through the intelligent exploration of q (for two excellent books on experimental design see [3,4]).

Fractional factorial methods. For each experimental protocol, there are thousands of parameters, (q), whose values could be altered in an infinite number of combinations to potentially optimize the protocol output (y). For example, with PCR we could alter the melting temperature, the duration at the melting temperature, the amount of each primer, and the variant of Taq. On another level, changing the tubes, pipettes, the PCR machine, and the experimenter could also lead to changes in the output, y, of our PCR reaction. The first step in experimental design is to identify the parameters that contribute most to the output, so that they can be further optimized.

Fractional factorial methods provide an efficient way to screen these parameters (Note: parameters are termed factors in experimental design). Traditional factor screening methods take a one-at-a-time approach. For example to optimize a PCR protocol, you might try the reaction with and without DMSO, with various concentrations of magnesium chloride, or with different annealing temperatures. Reliable determination of the effect of each of these factors (qi) on the PCR output (y) requires several replicates for each tested factor level. Because of this replication, a large number of experiments is required to test a small number of factors with a one-at-a-time approach. Fractional factorial methods screen many factors at the same time and remove the need for time-consuming and expensive replication. An example fractional factorial design for optimizing a PCR protocol might look like:

annealing temp primer concentration hot start extension time
56C 150 nM no 30 seconds
62C 150 nM no 90 seconds
56C 600 nM no 90 seconds
62C 600 nM no 30 seconds
56C 150 nM yes 90 seconds
62C 150 nM yes 30 seconds
56C 600 nM yes 30 seconds
62C 600 nM yes 90 seconds

For efficiently reasons, factors in factorial designs are typically only sampled at two states. Experimenter intuition plays a role in these designs via the selection of the initial set of factors to screen and in the selection of the values of the two states to test for each factor.

The result of a fractional factorial can be represented in a table listing the effect size and p-value for each tested factor. For example, an analysis of our qPCR fraction factorial data might yield :
factor effect (change in mg) p-value
annealing temp 27 0.001
primer concentration -1 0.6
hot start 2 0.5
extension time 10 0.05

From result in the table above, the experimenter might decide to focus their efforts on further optimization of the annealing temperature to increase the PCR yield, rather than on the three other tested factors that had little effect on our qPCR output.

Response surface methods. In a localized region, our function of interest y = f(q) can be fit using first (linear) and second order models. Fitting these models allows us to obtain a prediction of the parameter landscape of our function. Response surface methods use these models to estimate the most efficient path to the peak of the model (i.e. the maximum value of y). It is at this peak where our experimental protocol is optimized (or at least locally optimal). Response surface methods are relatively time consuming, so fractional factorial methods are typically used to screen for factors to be later optimized by response surface methods.

Research Plan:

For our ChIP protocol, we want to optimize the enrichment, y, of DNA bound to our transcription factor of interest. At the same time, we want to shorten the protocol as much as possible, so that the laborious protocol becomes more manageable. For this study, we will calculate y as the change in enrichment of genes known to be bound by our transcription factor relative to the enrichment for randomly chosen genes (which are presumably not bound by our transcription factor). We calculate this relative enrichment from qPCR data. For each known target gene and random target gene, we first calculate their enrichment from an immunoprecipitation reaction with and without antibody as N = log((1+Ei)Ci+Ui), where Ei is the median efficiency of the PCR primers for gene i, Ci is the qPCR Ct value for the DNA enriched using correct antibody for the transcription factor regulating gene i, and Ui is the qPCR Ct value for the DNA enriched without using an antibody for the transcription factor regulating gene i. We then calculate the increase in enrichment of our known targets relative to the random targets as y = mean(Nk) - mean(Nr) where Nk is the ChIP enrichment for the known targets and Nr is the ChIP enrichment for our random targets. Our goal is to maximize the value of y in the most directed manner possible using statistical methods coupled with intuition rather than simply intuition alone.

We will initially use fractional factorial methods to screen a large number of factors of potential importance to the ChIP protocol. For tested factors that are not found to be significant, we will select the factor state that requires the shortest time. For example if a 10 min incubation and a 2 hr incubation produce insignificant changes in y, we can save 1 hr 50 min by using a 10 min incubation. Factors found to be significant in the fractional factorial screen will be optimized using response surface methods.

Preliminary Results:

Note: these should be taken with caution, since I've not written the paper yet and haven't really sat down to analyze all of the results in detail yet.

We will use fractional factorial experimental designs to screen for unnecessary steps and factors that can be removed or shortened in the ChIP procedure.

Thus far, we have screened twenty-three factors in the ChIP protocol. By choosing the fastest and cheapest alternatives for factors that did not significantly alter the enrichment of known targets relative to random targets (y = mean(Nk) - mean(Nr)), we were able to reduce the cost of the protocol by three-quarters and to cut the total procedure time in half (from 5 work days to 2.5). The four most significant factors were formaldehyde concentration, shearing time, antibody concentration, and bead concentration.

Factors that have a significant influence on the enrichment of known transcription factor targets will be optimized using response surface methods.

We plan to optimize all four of the most significant factors in the ChIP protocol. As an initial step, we focused on the optimization of the antibody and bead concentrations. We assume that values of these parameters taken in a local area will result in smooth changes in y that can be modeled with first and second order models (Figure 1). We can then use these models to efficiently direct us towards the optimal values of our bead and antibody concentrations.

Figure 1. A hypothetical response surface describing the enrichment of our ChIP procedure as a function of the antibody and bead concentrations. By sequential experimentation and model refinement, response surface methods can locally define this surface and efficiently lead to local optima of the parameters to maximize ChIP enrichment.

From the fractional factorial screening experiments above, we have already obtained four initial points in our surface for bead and antibody concentration (i.e. LA+LB, LA+HB, HA+LB, HA+HB where L = low, H = high, A = antibody concentration, and B = bead concentration). Unfortunately, we do not yet know the surface, so we can't know where our points lie on the surface. However, we can fit a plane using the data for these four combinations of antibody and bead concentration (e.g. P = a0 + a1x1 + a2x2, where x1 and x2 are the concentrations of antibody and bead respectively and ai are the regression coefficients). If we assume that the local area around our points is a linear plane, we can use the ai coefficients to estimate the direction of steepest ascent. For instance in our hypothetical example in Figure 1, our four combinations might land us for example in the cyan region. A plane fit through these points can then be traversed in the direction of steepest assent to efficiently direct our future parameter value selections towards the red peak.

We fit such a plane to our bead and antibody concentration factorial data, and we choose new concentrations of these two factors along the direction of steepest ascent. These new concentrations led to a marked increase in the enrichment of our ChIP procedure (Figure 2a). It appeared that we had not yet reached saturation, so we tried an additional set of points further along the path of steepest ascent (Figure 2b). Although these new datapoints indicated that we might be close to the saturation point for these bead and antibody concentrations, the method of steepest ascent has pushed us into an expensive optimum, with almost three times the commonly used amount of beads for the ChIP procedure. We hypothesized that the amount of crosslinked-DNA was saturating our bead and antibody at low concentration - necessitating the use of large amounts of bead and antibody.

Figure 2: (A) Antibody and bead concentrations were optimized using the direction of steepest ascent determined by a linear model. (B) Further concentrations were tested to determine if we had reached a saturation point.

To test this saturation hypothesis, we performed a factorial design using bead concentration, antibody concentration, and sheared chromatin concentration as factors. By using one-forth of the typical DNA concentration, we were able to obtain to improve our enrichment procedure using lower amounts of bead and antibody (Figure 3). With this lower concentration of DNA, we should be able to estimate more cost-effective optima for the bead and antibody concentrations.

Figure 3: A factorial design was run using bead concentration, antibody concentration, and crosslinked chromatin concentration. For visualization purposes, the values for low chromatin concentration are plotted to the left of those with high chromatin (shifting them slightly along the x-axis), even though both experiments used the same concentration of bead. By using less crosslinked chromatin, we obtain larger enrichment using the standard concentrations of bead and antibody. The results suggest that at the standard values for these concentrations, the beads and antibody are saturated with chromatin.


[1] Tong Ihn Lee, Sarah E Johnstone, and Richard A Young. Chromatin immunoprecipitation and microarray-based analysis of protein location. Nat Protoc, 1(2):729-748, 2006.
[2] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, and Gardner TS. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, 5(1):e8, 2007.
[3] GEP Box, Hunter JS, and Hunter WG. Statistics for experimenters. Wiley-Interscience, 2nd edition, 2005.
[4] Box G and Draper N. Empirical Model-Building and Response Surfaces. John Wiley and Sons, 1987.

Sunday, September 23, 2007

Center for Contributory Science

This article describes the Center for Contributory Science (CFCS), an imaginary journal, which I envision as the next generation scientific literature. See my previous post for the motivation for a next generation scientific literature.

The CFCS submission process

Submitting a paper

Scientists are encouraged to submit rigorous scientific research for publication in the journal. Choose the subject category appropriate for your paper. Based on your chosen categories, an editor with appropriate expertise will be randomly assigned to your paper.

Authors must first review / contribute

All authors must be registered CFCS users. Before manuscript submission is finalized every author on your manuscript (including the corresponding author) must do one of the following:
  • If there are any papers in Limbo in a category the author feels qualified in, the author must review a paper. Authors on the same manuscript submission can't review the same paper. (Papers are presented oldest to newest to prevent any one paper from remaining in Limbo for too long. See the section below, "the status of a manuscript", for details on Limbo)
  • If an author claims they are not qualified for any of the papers in their qualified category, the papers are placed in the author's public "not qualified for" list along with an optional comment by the author (this public acknowledgment prevents people from always claiming they aren't qualified to review papers).
  • If the author does not feel qualified to review any of the available manuscripts in Limbo, the author must score any 3 papers in Purgatory or Heaven with a thumbs up or down and a corresponding comment to each score (see below for definitions of Purgatory and Heaven).
The above features of CFCS aim to ensure
  1. there are at least as many reviewers as there are papers (and most likely many more)
  2. authors along for the ride at least have to contribute to the review process
  3. professors can't get out of reviews (and get credit for reviews) by sending work to their students
    1. students get credit for their review work getting their name out early
  4. if you want to submit 100 papers in a year, you must be willing to review 100 as well
For details about how the review process works at CFCS, please see the section "The CFCS Reviewer Process" below.

Authors decide a direction for their manuscript

Upon completion of the review/contribution requirement by all of the manuscript's authors, the manuscript submission will be finalized. Authors may then send their submitted manuscript on Purgatory track or on Heaven track (see below for details on Purgatory and Heaven).

Editors decide paper status

Editors decide if a Purgatory track paper goes to Purgatory or if a Heaven track paper goes to Limbo. This editorial step is simply to weed out complete rubbish before it goes to review. Almost every manuscript should pass this minor screening.

The CFCS reviewer process

Reviewing a paper for CFCS works in a similar manner to most contemporary journals. However, reviews are not anonymous and are publicly visible with the manuscript upon submission. Reviewers do not have to be authors. Any user can do a review to get a credit, so they can later submit a manuscript without having to review. Ideally in the CFCS system, few if any reviewers must be asked to review a manuscript by the editor.

In general, reviewers choose the manuscripts they want to review from the set of all manuscripts in Limbo they feel qualified to review. Each manuscript in Limbo requires four separate reviews. Upon receiving the authors' revised manuscript and response to the reviewer comments, each reviewer places a vote to decide if a manuscript belongs in Heaven or Purgatory. The reviewed manuscript goes to Heaven if the manuscript gets at least three out of four reviewers suggesting the manuscript for Heaven. In the case of a tie, the editor holds the tie-breaking vote, which he casts upon reading all four reviews.

All reviews and the authors' responses to the reviews are publicly available with alongside the final manuscript. Both the original and the revised manuscript drafts are available as well.

The CFCS editor process

Editors are either
  1. reviewers whose quality reviews have gained them a large reviewer impact score and who agree to the job
  2. invited editors (if there aren't enough high ranked reviewers)

Editors decide if a Purgatory track paper goes to Purgatory or is sent to Earth. Editors decide if a Heaven track paper goes to Limbo or is sent to Earth. The main job of the editor is to eliminate rubbish (pseudoscience and just bad science). Editors must also decide if the subject categories selected by the authors are appropriate. Most importantly, editors hold the tie-breaking vote when there are two Heaven votes and two Purgatory votes from the four reviewers. In cases with no tie, the reviewers alone decide the final destination of the manuscript.

The CFCS user process

Any registered user can score and comment any papers, comments, and reviews besides their own. A reader cannot score a paper, comment, or review without leaving a comment to explain their score. Scores and comments are publicly available with the manuscript and on the users' CFCS page.

All users have a reviewer impact score, a comment impact score, and an author impact score.

The status of a manuscript

The status of a paper follows two of the key ideas of CFCS: 1) information is always public; and 2) information is never deleted. Everything that happens to a paper on its route to Heaven is recorded and posted for all to see. All reviewer comments, all responses to reviewer comments, and both versions of the manuscript are available for download.

A publication search in CFCS can be limited to certain types of papers (to allow for example only peer-reviewed work) or it can draw from all of the CFCS library.


Heaven is the pinnacle of CFCS. Manuscripts in Heaven have been peer-reviewed by four reviewers, the authors have responded to the reviewer comments to improve their manuscript, and the manuscript received a majority Heaven vote from the reviewers. Voting is carried out by the four reviewers plus the editor. The votes do not become public (to the reviewers or the editors) until all the votes are in (to prevent biased voting). Papers in Heaven are charged a modest processing fee to allow them to be uniformly typeset in the style of the journal. Typeset papers are submitted to pubmed. The four reviewers set the initial paper impact score with their votes. These initial seed scores count double the normal reader submitted score. Once entering Heaven, the manuscript can be scored and commented by all readers of CFCS to adjust each manuscript's impact score.


Manuscripts for peer-review work are initially sent to Limbo. A manuscript remains in Limbo until it has received the necessary number of reviews, responded to those reviews, and been voted into Heaven. Failure to respond to the reviewers (within a fixed time) and failure to receive a majority vote result in the manuscript being sent to Purgatory.


Purgatory track manuscripts only need to pass the editor's inspection (otherwise they go to Earth). Purgatory is an option for works where the authors don't want to go through peer-review. Examples of good pieces for purgatory include: reviews and reports of failed experiments. Upon entering Purgatory, the manuscript can be scored by all CFCS readers to determine its paper impact score.


Manuscripts not passing the editor's initial quality screen go to Earth. Authors get one petition to get out of Earth and back into Limbo or Purgatory.


Manuscripts discovered to be fraudulent go to hell. (perhaps papers where the equation to word ratio is greater than one belong here too?)

  • score: a vote by a CFCS reader; a score can be positive (thumbs-up) or negative (thumbs-down); reviewer comments, reader comments, and manuscripts can all be scored by all CFCS readers; all scores must be accompanied by a comment where the reader explains their reasoning for the score
  • review: similar to the current scientific literature, a review in CFCS aims to strengthen the quality, rigor, and focus of the submitted manuscript; reviews are publicly viewable with the manuscript as are the author's response to the review
  • comment: a comment is a CFCS reader's written opinion of a manuscript, review, or another person's comment
  • reviewer impact score: for each individual, this metric is determined by the number positive scores minus the number of negative scores from CFCS users for all of the reviews written by the individual
  • comment impact score: for each individual, this metric is determined by the number positive scores minus the number of negative scores from CFCS users for all of the comments written by the individual
  • paper impact score: for each manuscript, this metric is determined by the number positive scores minus the number of negative scores from CFCS users for that particular manuscript

towards a richer scientific literature

The scientific review and publication process has received increasing attention over the last ten years; internet technologies have changed the way we search and read science; open access has changed our ability to share science; and highly publicized fraud causes have reminded us that our system has inherent flaws that may prove difficult to fix. Articles in this domain discuss the positive impact of open access, the growing problem of gift-authorship, and the burden on the review system caused by scientists who increasing opt for the top-down system of paper submission (i.e. submit to the top journals first and submit to increasingly lower impact factor journals as you get rejected and reedit your manuscript). I'm typically underwhelmed by the solutions proposed by such articles as they tend to send the message that grass roots revolution, "We're not gonna take it! No, we ain't gonna take it!", is needed to fix the system from the bottom-up. We should tell our deans, department chairs, PIs, etc... that we don't want to be ranked by our h-index, number of citations, and journal impact factors.

But the reality is that we're all doing the best we can with the system that we have. If the system does not change, I guarantee that if I have my own lab someday, I'll submit my papers to the best journal I think they have a shot-in-hell of getting into. The truth is that when I submit to a good journal, I think that my paper belongs there. It's just the editors that incorrectly label my work as not novel enough. It's just the reviewers, defending their territory, that incorrectly label my work as lacking rigor because I don't know that the correct term for mismatches on the end of a RNA:RNA duplex is dangling-ends not shaggy-ends (I still like my term better mystery reviewer man).

In my opinion, we have three problems with our current system:

1) editors are not qualified to judge what will be a high impact paper
I don't care if the editor is an active scientist or a full time editor. I don't care if he has two Nobel prizes. I don't care if he is related to Nostradamus. Besides the obvious, have-to-be-cited papers like complete genome sequences, it's impossible to know what research done today will still be important 5 years from now. So why do we make this the first hurdle to publication?

2) reviewers are helpful but are too focused on self-preservation to do the best job
Please don't make me cite your paper, because in some obscure way you thought of my idea first. Please don't steal my result, because I can't easily identify you. Please don't nail me to a cross and treat me like an idiot, because I'm wearing a blind fold. The temptation is too strong. I noticed this in my own reviews, so I read something to tame my ego before starting and submitting every review I write.

3) we have no good way to quickly judge papers, journals, and scientists
Impact factors and h-indexes were designed to help, not hinder, science. Particularly in the USA, we strive for a meritocracy. Thus, we need some metric for sorting journals and scientists. I think most people would agree that the GRE, LSAT, and MCAT are poor predictors of a person's graduate school potential, but what else can a medical school with 3000 applications for 30 spots do? Perhaps there's no metric we can invent that is better than the opinion of human experts, but expert panels and opinions also suck a lot of time that could be used to do science.

To me, most of the other issues with our current publishing process derive from these three problems. Professors schmooze with editors at conferences, so that the editors will hopefully predict the future more favorably on their next submission. Reviewers reject valuable papers, because impact factor leery editors stress their journal's high rejection rate and the importance of novelty. Professors provide and receive gift-authorship, because they need a high h-index, lots of citations, and visibility in big journals to keep their jobs, get higher pay, and retain the respect of their peers.

We are only human
The writers of the US constitution and the great economists of the world accept our humanness and try to develop government and market systems that thrive because of and despite our human attributes. Checks-and-balances keep the government's power in check, while elections provide change as a society's goals evolve. Free market economic ideas allow efficient prices and economic growth, while federal monetary policies keep things like inflation from getting out of hand.

How can we integrate checks-and-balances into scientific review?
Since transitions are often the trickest part, let's assume we're starting over from scratch with the scientific publication process. I think we can adapt ideas from, Slashdot, and Digg to create a better system. People have already mentioned or even tried some of these things, but so far nothing has struck me as likely to be successful. Journals are dabbling with these ideas, trying out one or two, but it is the combination of all of these in one journal that I think has a chance of adoption and really modernizing the publication process. For example, few people are going to use the rating system at PLoS One, because 1) it involves unnecessary work; and 2) it requires written public criticism of another scientist's work. Reason 2 alone will keep most people away, as flaky scientific egos are easily hurt, and science is a particularly bad field to accidentally burn your bridges. So a workable system would somehow need to compel people to comment and create an atmosphere where written criticism is the norm (and thus less dangerous; more like verbal criticism at a talk; note that good critical scientific debates do occur once in a while in the blogosphere - here's a good example on Steven Salzberg's blog).

In my opinion,

1) the new system must be comment/ratings rich
Readers can rate reviewers and papers. Similar to, readers should be able to give an article a thumbs up or a thumbs down. The final score of a paper is just the sum of the up and down thumbs (e.g. 126 people like your article and 26 don't, your article has a score of 100). With these scores you can find the papers receiving the most attention (sum of up and down thumbs), most positive attention, and most negative attention.

2) reviewers and commenters are reviewed
If someone on writes an idiotic review, there's a nice ReviewNotHelpful button you can click to make sure more people to waste time reading the review in the future. Slashdot has a similar, though more advanced, commenter scoring system. We need a similar button to rate the ratings in the scientific publication process.

3) the best set of reviewers in each subject category are invited to be editors
Rather than having a good-ole-boy pass the editorial torch to his former student, we can allow the hard working thoughtful reviewers to be our judges.

4) the new system must be completely open
No one is anonymous and all information is public. As reviewers accept a paper for review, their name should become publicly associated with the article. When they submit their review, the review should become available for everyone to see. The reviewer's score (determined by other people rating the reviewer) and all of their previous reviews and comments should also be available.

5) nothing is destroyed
There should be no such thing as a rejected paper that no one sees. Trash science should be labeled as such by the community review and commenting system but not deleted. One man's trash might be another man's treasure.

6) review or comment is a prerequirement to submission
Before a paper goes to the editor, all authors on the paper must review another paper in the journal. A paper with 50 authors, contributes 50 reviews before going to review. A professor that slaps his name on 100 publications a year must be willing to write 100 reviews a year. If the professor has their student write the review for them, they will at least know they are putting their own reputation on the line, because the review is associated with the professor's name, and the review is public. If there are papers to be reviewed, they must choose a paper if the paper is in their subject area. Otherwise, they must comment on a certain number of reviewers or papers (e.g. at least three). By forcing comments, you alleviate the laziness factor, which I think will cause other rating systems like PLoS One to fail. We barely have enough time as it is to read a paper no less leave a comment on it. But if doing so is a prerequisite to publication, we'll do so. And if we know that our comments will be publicly available and associated with our name, we'll make sure not to write rubbish.

the ranking problems we don't need to worry about
Two problems with internet ratings systems are that they thrive on sensationalism and that they collect rubbish comments (e.g. youtube comments are often just idiots making fun of the people in the movie). Since a good reputation is vitally important to a scientist, we needn't worry too much about rubbish comments. I also think that scientists already have averse reactions to flashy papers driven more towards publicity than science, so perhaps a commenting system will actually reduce sensationalism.

I've written up the details of a hypothetical journal that incorporates these scientific publishing ideas in a separate blog article.

What I read before I write a review

Writing an anonymous scientific review can make even the tamest human take a jab or two at their blind-folded peer. Because of this, I'm a fan of moving towards open peer review where we can treat each other like humans.

I noticed this aggressive tendency in myself when I first started being asked to write reviews four years ago. To make sure I don't step beyond where I'd like to be as a reviewer (i.e. critical and honest but not aggressive), I read the following text before starting and before submitting every review.

When reviewing papers

  • don't be evil
  • start with a compliment
    • say the positive general comments before you say the negative general comments. If you don't have positive comments, read it again. The editor probably wouldn't give you total crap.
  • don't nitpick too much just to feel powerful
  • try to say things you'd like to be told if it were your paper (i.e. comments to strengthen the manuscript not belittle the authors)
  • number the comments so the authors can easily refer to them if they resubmit
  • don't be evil

Saturday, September 1, 2007

Effect of sequence level mutations on transcription, translation, and noise

One of the main biological questions explored when DNA sequencing first became a practical laboratory technique was how the nucletides in a gene's promoter and ribosomal binding site define the gene's interactions with transcription factors and the ribosome translation apparatus. At least in prokaryotes, these interactions largely determine the levels of transcript and protein available for each gene, and thus provide the crucial information of how a genome regulates itself.

This early work resulted in many of the promoter analysis tools that are still widely used today. In particular, the promoters were often analyzed in terms of information content, and this information content was visualized using sequence logos. These sequence logos are still the most popular way to display DNA binding sites. I'm not sure why this field tappered off a little. My guess is that the people in this field had maxed out the information that was affordably obtainable with the available technologies.

But as most biotech loving biologists know, the times are a changin in biotech, and we have new sequencing technologies that enable drastically larger sequencing studies to be undertaken. Importantly, we are faced with several quite-different sequencing methods (unlike the previous era in sequencing biotech which was almost exclusively driven by ABI's advances in Sanger sequencing). I think with these new technologies coming online, it's time we dusted off our promoters and start figuring out how they work.

What we still must learn about promoters

For several promoters (ideally for all promoters), we need to exhaustively determine how base-pair changes in the promoter lead to changes in the amount of transcription and translation. We must determine this information across time, so we can also determine the rates of transcription and translation. Finally, we must determine these values at the level of single-cells, so that we can also obtain information about the noise inherent in each promoter sequence.

In the early 90s, these type of analyses were at least partially undertaken with populations of cells and 100-200 promoter mutations. Now we must study several promoters, with millions of different mutations for each promoter, and with multiple single-cell replicates of each mutation so we can estimate noise.

What is this new level of promoter knowledge good for?

We need to understand to what extent it is possible to build a computational model to predict translation and transcription from sequence alone. Such a model could act like a molecular biologist's version of Hardy-Weinberg equilibrium. That is if a promoter does not fit the model, it would suggest that there is some additional regulation (e.g. small RNA) that is not explained by the binding of transcription factors and the ribosome. In addition, the ability to screen vast numbers of promoter variants could be of huge value to forward biological engineering. Synthetic biologists often tune their human created networks using directed evolution. While directed evolution is a very powerful and massively parallel way to optimize a genetic system, the human that created the system in the first place has limited control over the final result. For example, it may be that the network evolved to generate ethanol from cellulose is extremely noise and could be made more efficient by fine-tuning the promoters in a more intelligent design.

An idea for a massively parallel method to determine the effect of sequence level mutations on transcription, translation, and noise

I think the tools are already available to determine the effect of sequence level mutations on transcription, translation, and noise in single-cells. One approach I've thought about is shown in the figure on the right (click to see an easier to read/print pdf version of the figure). The idea takes a cell-in-emulsion approach (see Directed evolution of polymerase function by compartmentalized self-replication) and combines it with the polony sequencing method pioneered in the Church lab (see Accurate multiplex polony sequencing of an evolved bacterial genome).

The first step (top right) is to synthesize a known promoter with a large number of random nucleotides. This is very similar to the method used by Stormo's lab many years ago (see Quantitative analysis of ribosome binding sites in E.coli), except that with modern sequencing methods we can drastically increase the number of random sites that we explore. A GFP reporter is placed directly after the promoter so that we can measure the amount of protein generated. Since, the vast majority of mutations will probably result in little to no expression, it may be useful to also add a bactericidal antibiotic resistance gene after the GFP to provide an easy way to get rid of unproductive promoters (for some studies, you would probably not want to remove these low output promoters).

The second step (top left) is to take a dynal bead and attach a primer to amplify the promoter, a second primer to amplify the GFP sequence, and an anti-GFP antibody.

Next the bead and the bacteria a placed together into an emulsion. In the emulsion solution, we also need to include reverse primers for the promoter and GFP sequence, reverse transcriptase, and PCR reagents. You would need to mess around a little with the dilutions and concentrations of beads and cells to maximize the case where you have only one bead and one cell in each emulsion.

Now we have the cells isolated into separate chambers, and we have one bead with the bacteria. This bead will provide the source of our future information read out. Also remember that by synthesizing our promoters with N's, we actually have generated a huge library of different promoters. So that each emulsion will have a different variant of our promoter. We then lyse the cells. I'm not sure of the best way to lyse the cell. But in the figure, I just assumed we used extreme heat. Because of the next step, it may be wise to use a gentler method to lyse them, such as placing a protein that will cause cell lysis (e.g. lysozyme or ccdB) on a promoter that is heat inducible, so you'd only need to heat the cells up to 42C rather than 95C. Once the cells are lysed, the GFP expressed from the synthetic promoter should diffuse around the emulsion until they meet and bind the antiGFP attached to our dynal bead.

We've got the protein on our bead, now we need to attach the DNA. Since one of the things we want to measure is transcript concentration which is mRNA, we need to do a reverse transcription reaction. Reverse transcriptase is not very heat stable, which is why I stressed above that we might want to lyse our cells more gently than by heating them to 95C. However, Superscript III from Invitrogen is pretty heat stable, so that might be worth a shot too. Since we include a reverse primer to our GFP sequence into the emulsion, we should have a fairly specific reverse transcription.

Finally, we need to attach the DNA to our bead, we can do so by running a multiplex PCR reaction for a few cycles. Since the forward primers are on the dynal bead, the PCR reaction results in the DNA being stuck to the bead.

And now for the fun part, let's measure protein concentration, transcript concentration, and determine the promoter sequence for our single cells (bottom row of the figure). To do this we lay our beads out on a microscope slide or some type of microfluidic device. We can measure the protein concentration directly as GFP fluorescence. Next we measure the transcript concentration as the amount of GFP cDNA attached to the bead. For increased accuracy, we might want to measure this concentration using cycled PCR reactions (like qPCR on a microscope). The concentration of the promoter sequence and GFP sequence attached to the bead can be measured in each round using two different molecular beacons. The concentration of promoter sequence can be used to normalize the concentration of the transcript (i.e. to try and remove artifacts due to the variance in emulsion sizes, emulsion PCR efficiency, and to remove the background transcript value that is due to amplification from the DNA rather than cDNA sequence). This bead based DNA quantitation can borrow some ideas from the BEAMing method (see BEAMing: single-molecule PCR on microparticles in water-in-oil emulsions).
Now that we've measured protein concentration and transcript concentration, it is time to determine the promoter sequence responsible for these concentrations. In some ways, this is the most difficult step. But in practice, it may be the easiest step, as the polony sequencing method does exactly that.

With the size of the beads and the massively parallel nature of this protocol, it should be possible to have the same sequence appear multiple times, allowing the estimation of noise for the tested promoters.

Again, I haven't tried any of this stuff, and I'm not sure it'll work. I just wanted to throw the idea out there in case someone else is thinking about this problem too.

Open questions with this idea
  1. how quantitative is emulsion PCR and how is noise influenced by the size of the emulsion
    1. can we increase quantitative accuracy of our mRNA concentration by running very few emulsion cycles and then running a microscope based qPCR on our bead
  2. how strongly does the GFP bind to our bead (e.g. when we break the emulsions, can GFP move from one bead to the next? we can test this by using mCherry in one sample, GFP in another, and then breaking the emulsions together to see if any beads have both proteins attached)
  3. Is crowding on the bead going to cause problems; that bead has a lot on it. does this bias our results in unpredictable ways?
  4. there are a lot of steps. long protocols can lead to excess experimenter derived error and slow the techniques adoption