Wednesday, October 24, 2007

How this blog works

I've been blogging for about 6 months now, and it has definitely been more interesting and productive than I initially thought it would be. One thing I've found I don't like about blogging is that most blogs just throw stuff out there. I understand this is part of the blogginess of blogging, but it makes it really hard to get oriented. When you go to someone's blog for the first time, it often feels like random stuff is just being tossed onto the web. Only after following the blog for a while will you really figure out if the author has an overarching point.

So this post is just to help new folks orient themselves to my blog.

J's blog is primarily focused on developing and abiding by an Open Science system. Since Open Science is a relatively new idea, things are changing as I go along, and no one really has any standards yet, because Open Science people (and in particular Open Notebook Science People) are still trying to figure out best practices for science in the open.

That said here is my current schema:
  1. Open Ideas: I try to blog all of the experimental ideas that I'm considering pursuing. The hope is that I can find other folks interested in the same things as myself, and if I don't pursue the ideas, perhaps they'll be of use to someone else. I maintain an index of these Open Ideas on this blog.
  2. Open Projects: The Open Ideas I decide to pursue become a chapter in J's Lab Notebook. Following the basic idea of Open Notebook Science, all of raw data for the projects I pursue is publically available in real time (updated nitely). The hope is that folks who might find my work useful don't have to wait two years until I publish it. I know following someone's experiments in raw form can be difficult, but similar to reading someones computer code, I think we need some rules or general guidelines to make such tasks easier. I do not expect folks to read and follow the notebook as I go along. Rather I expect folks to stumble upon the notebook through internet searches and such. Whereupon, folks can email if they're interested in more information or clarification of anything. I maintain an index of these Open Projects on this blog.
  3. Open Publishing: After I finish projects, I typically publish them in scientific journals. In the future, I hope to publish the failed or smaller experiments to this blog or to an archive. Because the current set of open access journals doesn't yet cover the entire range of experimental and computational biology subjects, I do not publish exclusively in Open Access journals (though I think Jonathan Eisen has some interesting ideas on this topic, I think it's a little early to limit yourself to only open access journals unless you're already well known [which Eisen is] ). I maintain an index of these completed projects on this blog.
So if you're new to my blog, and you'd like to learn more. You might start by skimming the short descriptions available in the indexes: J's Open Ideas index, J's Open Projects index, J's Open Publishing index.

J's Open Projects index

This post is an index with a one-line description of the projects I'm working on and the relevant chapter of the work in J's Lab Notebook. Read how this blog works for more general information.

1) Cheaper, faster, better ChIP
Description: Using statistical experimental design methods to shorten, cheapen, and optimize a Chromatin Immunoprecipitation protocol for experimentally determining transcription factor binding sites.
Detailed description: Factorial and response surface optimization of a chromatin immunoprecipitation protocol
Lab Notebook Chapter: Towards a faster, more reliable ChIP protocol.
Date Project Started: Apr 26, 2007

J's Open Publishing index

This post is an index with a one-line description of the projects I've published. Read how this blog works for more general information.

2) Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured experimental metadata
Description: Microarray compendia for E. coli, Shewanella, and yeast (currently 524, 530, and 14 arrays for each of these species respectively). The arrays are normalized together to allow all arrays for each species to be analyzed as a single group.
Blog post on this publication: When will gene expression data become collective knowledge?
Date work was published: Sept 18, 2007

1) Large-Scale Mapping and Validation of Escherichia coli Transcriptional Regulation from a Compendium of Expression Profiles
Description: How well can we computationally infer regulatory interactions between transcription factors and their targets using microarray data? Predictions from several algorithms were validated using the 3500 experimentally determined interactions in RegulonDB plus and additional 300 interactions were tested with ChIP. Most importantly, once we've reliably inferred this networks, what can we do with them?
Useful resources: The microarray data used in this paper is available at M3D. In addition, we have supplemental site containing links to the algorithms and the RegulonDB known interactions we used in the publication.
Date work was published: January 9, 2007

J's Open Ideas index

This post is an index with a one-line description of the projects I'm considering working on. Read how this blog works for more general information.

3) Effect of sequence level mutations on transcription, translation, and noise
Description: A technique I think would allow us to determine the effect of millions of promoter variants on the rate of transcription and translation in single-cells.
Date idea was blogged: September 1, 2007

2) Mutations, gene passing, and the evolution of gut microbes
Description: Can we use gnotobiotic mice to obtain experimental estimates of mutation rates and gene transfer rates in different intestinal environments (e.g. under different stresses and with different combinations of microbes).
Date idea was blogged: June 26, 2007

1) Live imaging of host-microbe interactions
Description: Can we apply the deep-imaging 2-photon microscopy techniques, which have been developed over the past few years for imaging neuronal systems, to the imaging of the special distribution and interactions between different gut microbes and their host.
Date idea was blogged:
June 12, 2007

Tuesday, October 23, 2007

When will gene expression data become collective knowledge?

Published research: this post describes some of my published research. The relevant publication is: Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured metadata, which describes the M3D database. This post serves as a place for folks to provide comments and suggestions for the future of the database and the future of expression data in general.

The focus of my PhD thesis has been network inference: how can we efficiently determine regulatory networks in prokaryotes (i.e. which transcription factors regulate which genes). Since mRNA concentration is the only data we can easily measure for all genes simultaneously, my PhD has also be an exploration of the potential and limitations of expression data. In joining Tim Gardner's lab at BU, I was a little reluctant to analyze microarray data, because I'd heard it was noisy and too easy to find whatever answer you were looking for. After 4+ year of working with gene expression data from microarrays, I'm convinced that it is a little noisy and that it is extremely easy to find whatever answer you're looking for. But with careful analysis, particularly regarding the statistics of large datasets and multiple hypothesis testing, microarrays hold an unparalleled wealth of knowledge about the dynamic, concerted actions of cells.

Genome sequencing has provided the cell's nouns. Microarrays are providing the cell's verbs. Currently, most people use the nouns as collective knowledge while the verbs are generated and analyzed in-house. How are we ever going to understand the language of life when everyone has to generate a personal set of verbs?

What I mean for genome sequencing is this: you want to knockout, clone, tag, or whatever a gene in species X. You look up the location of that gene in the genome browser for species X, download the surrounding sequence, and use your intuition or some primer design software to help you construct the chemicals (typically oligonucleotide primers) you need to experiment on your gene. Or maybe you're one of these folks that likes to take fancy trips with your science budget, so you're out in Hawaii to collect sea water for metagenomic sequencing. When you are sufficiently tanned and you head back home to sequence the seawater, the first thing you do is compare your seawater DNA sequence to all DNA sequence available for any species on the planet to see if you can find anything to help you figure out what was in your seawater.

What I mean for microarrays is this: you want to figure out what the cell is doing when you apply X to it (e.g. heat shock, acid shock, DNA damage, glucose growth, etc...). You run 3 chips in a standard condition and 3 chips in a standard condition plus X. You take the two conditions run a statistical test - ttest, fold change, or FDR if you're getting fancy - to produce a list of N genes that changed expression when you did X. You write a 5 page paper where you publish the list and describe what those changes might mean.

Final summary of the microarray isolationist problem
With genome sequencing everyone is taking advantage of the wealth of collective sequencing knowledge to improve their own research. For microarrays, with few exceptions, knowledge is created and remains in isolation. One could make the case that microarrays are a relatively young technology and the collected knowledge has yet to accumulate, but microarrays were invented in 1995 - they're older than Google Inc.

How can we promote gene expression as collective knowledge?
For our network inference work, it was necessary to collect as many microarrays as possible for the species of interest (E. coli). Since Tim's lab is full of computer nerds like myself, from the start we decided to collect all of the data
in a database. Originally, the database was just a storage dump that ensured that all the folks writing network inference algorithms (i.e. Boris Hayete and myself) would use the same starting dataset. Having the microarray datadump in a standard format and standard location certainly helped our network inference efforts. In the end, Boris developed a network inference algorithm - CLR - that currently remains the top performing algorithm on the E. coli microarrays (if you're interested in network inference and think you can top CLR - please try, here's a site to help you get started; and let me know how it goes!).

More recently, we improved our microarray database and published a piece, Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured metadata, that I hope starts to move gene expression into the collective knowledge space. In that paper we tried to address three problems that are currently hindering the wide-s
cale adoption of microarrays:
  1. the presence of platform-specific biases in expression data due to the use of many different microarray platforms in a compendium
  2. the lack of a uniformly applied normalization standard for expression datasets, even within a single expression platform. Different software algorithms are used by different labs for preprocessing and normalizing the raw microarray intensity values
  3. the incompleteness and inconsistency in the curation of metadata describing the details of each experimental condition.
To address point 1, we only allowed a single platform (Affymetrix) for each of the three species currently in the database (E. coli, Shewanella, and yeast). To address point 2, we collected unnormalized raw CEL files for all of the experiments and uniformly normalized them as a group with RMA. In our previous work, we found that this RMA normalization makes comparisons possible on microarrays of the same platform that are run in different laboratories (see the section "Verification of array data normalization and consistency" and Figure S5 in the supplement Protocol S1 to our network inference paper). And to address point 3, we generated human curated (and computationally validated) experimental metadata for each microarray publication—converting each chemical and growth attribute into a structured and computable set of experimental features with consistent naming conventions and units.

I believe this Many Microbe Microarrays Database (M3D) provides the essential starting point for moving towards the use of microarrays as collective knowledge: a set of curated microarray datasets that have already proven useful in a large-scale application.

If you build it, some will come
M3D has been online for about a year now. The site currently gets around 200 unique visitors and 3000 hits a day. But based on the emails I get from folks, the audience i
s primarily computational folks like myself that are eager to test their new algorithm on a large, well-annotated experimental dataset.

But will the biologists ever arrive?

The wetlab biologists are the folks I would really like to begin adopting this resource. Since I do experimental work in E. coli, whenever I want to find out what a particular gene does or what genes might regulate it, I dig around in RegulonDB or EcoCyc to see what's known about the gene. How might expression data be incorporated into such a website to aid understanding the biology of E. coli? For one, if RegulonDB has published evidence that lexA transcriptionally regulates recA, it would be nice to see if the collective microarray knowledge currently supports the published evidence. This type of information could be provided by including a scatterplot of the expression values of the transcription factor -vs- those of its target (see the image below generated from M3D). If you generate this plot on M3D, you can also mouseover each point to receive the details of the experiment represented by the datapoint.
I think the key to adoption is probably integration with the currently available resources. Folks don't want to go to yet-another-website and figure out how to work it. Because of this, I've created a way for external websites to automatically include M3D generated images on their own websites (instructions). Here are a couple examples, these are drawn on-the-fly from M3D rather than uploaded to this blog:

The problem with this approach is that if M3D goes down, the automatically generated images also fail on the remote website as well. So perhaps, remote websites would want to automatically generate and locally cache the images.

Hey microarrays, what can you do for me?
So perhaps integration with other databases will increase the awareness of microarray data, but I really want people using the data. So the question is what would folks like to do with microarray data? Like a sequence database, can a microarray database allow scientists to better understand their own data? In general people don't want to just browse around NCBI, they want to BLAST their sequence to help them improve and publish their own work.

So I'd be really interested if anyone had ideas about applications that might allow more people to use the collective expression knowledge in M3D (or any other microarray database for that matter).

Here are some things that are already available M3D:
  • what genes changed expression in condition X (z-test)
  • what genes changed relative expression between conditions X and Y (t-test or fold-change)
  • do genes in particular chromosomal regions tend to change expression as a group
Here are some things I'm considering adding:
  • array-blast: submit your raw microarray data (CEL file) and you get back a list of the most similar arrays in the database along with the condition information for those arrays
  • expression-based-function annotation: across the compendium, what is the effect of each experimental feature (e.g. glucose concentration is correlated with the expression of genes X,Y,Z)
  • array-changed-genes: submit your CEL file(s) and run a z-test against the entire compendium or a t-test against a particular subset of arrays in the compendium to determine the set of genes whose expression changed in your microarrays
How you can help!
If you have any ideas/suggestions for tools, applications, or anything else that might be done to a website like M3D to help folks use this collective expression knowledge, I'd like to hear your ideas (just leave a comment on this post).

Friday, October 19, 2007

Factorial and response surface optimization of a chromatin immunoprecipitation protocol

Research in progress: this post describes some of my ongoing research. The raw data and all experimental details are updated daily in J's Lab Notebook in the chapter entitled: Towards a faster, more reliable ChIP protocol.

The following text is largely taken from my PhD Oral Qualifier. I tried to blogify it a little, but it is still a little formal for a blog post. I also don't have many citations. Appropriate citations will be in the published version if I complete this project (if you have opinions about how we should deal with citations in very preliminary results please post a comment - I'd like to hear your opinion).

Short Version of What I'm trying to do: Chromatin Precipitation (ChIP) is often used to experimentally verify or discover transcription factor binding sites. In my experience, ChIP is lengthy, costly, and noisy. I'm trying to use statistical experimental design techniques to shorten, cheapen, and reduce the noise of the ChIP procedure. I'd really like it if ChIP were simple enough to become a standard technique that all experimentalists learn (i.e. like a miniprep and PCR), so we can really start to determine the transcriptional regulatory network structure of many organisms.

In general, I think there is a lot of unnessary folklore in our experimental procedures, and the methods I'm applying here would be applicable to almost any experimental protocol optimization - if broadly applied, experimental biology would be a much less time-consuming endeavor.

Longer Version of What I'm trying to do:
We plan to optimize and shorten the chromatin immunoprecipitation (ChIP) protocol for in vivo validation of transcription factor targets. Verifying a transcription factor's genomic binding regions with ChIP requires: 1) fixing the transcription factor to the regions of the genome it binds via a crosslinking agent like formaldehyde, 2) cell lysis, 3) chromatin shearing (to enable isolation of only small regions of DNA bound by the transcription factor), and 4) multiple washes to remove background noise [1]. Once the ChIP procedure is complete, the DNA bound by the transcription factor should be enriched relative to unbound DNA. This enrichment can be assayed by qPCR, microarray, or DNA sequencing (less common), providing confirmation of the transcription factor bindings sites (and therefore presumably, the gene targets of the transcription factors).

ChIP is used by numerous labs across many model organisms, yet the ChIP protocol is anything but standardized; ChIP protocols are as numerous as the number of investigators using the technique, suggesting that we are far from an optimal protocol. The ChIP protocol we previously used to validate network inference targets in E. coli [2] required almost a week of long experimental days to go from cells to verified transcription factor targets. Because of this length, the procedure is error-prone and only tractable to the most experienced bench scientists. We aim to use modern statistical methods of experimental design to optimize the ChIP protocol [3]. In particular, we will use fractional factorial designs to screen for unnecessary steps that can be removed to shorten the protocol. In addition, we will optimize the protocol steps that have the most significant influence on the enrichment of known transcription factor targets to improve the signal to noise ratio of the ChIP procedure.

Successful completion of this work will result in a markedly shorter and more effective ChIP protocol for verifying transcription factor targets. The new protocol will make verification of transcription factor binding sites approachable and practical to a wider range of bench scientists, promoting the experimental validation of future network inference predictions. In addition, the knowledge gained by an in-depth analysis of the ChIP technique will help optimize the protocol for different tasks such as highly parallel sequencing of ChIP DNA for transcription factor target discovery. Finally, the ChIP protocol optimization highlights the untapped experimenter efficiency potential these statistical methods could unleash on molecular biology if these experimental design techniques were broadly applied to experimental protocols.


Most experimental protocols can be represented mathematically as y = f(q) where y is the product resulting from the protocol and q are the parameters of the protocol. In a PCR experiment for example, y would represent the yield of DNA (e.g. in micrograms), while q represents the parameters of the reaction (e.g. concentrations of template, primers, magnesium chloride, etc...). The statistics of experimental design contains numerous methods to expedite the empirical optimization of y through the intelligent exploration of q (for two excellent books on experimental design see [3,4]).

Fractional factorial methods. For each experimental protocol, there are thousands of parameters, (q), whose values could be altered in an infinite number of combinations to potentially optimize the protocol output (y). For example, with PCR we could alter the melting temperature, the duration at the melting temperature, the amount of each primer, and the variant of Taq. On another level, changing the tubes, pipettes, the PCR machine, and the experimenter could also lead to changes in the output, y, of our PCR reaction. The first step in experimental design is to identify the parameters that contribute most to the output, so that they can be further optimized.

Fractional factorial methods provide an efficient way to screen these parameters (Note: parameters are termed factors in experimental design). Traditional factor screening methods take a one-at-a-time approach. For example to optimize a PCR protocol, you might try the reaction with and without DMSO, with various concentrations of magnesium chloride, or with different annealing temperatures. Reliable determination of the effect of each of these factors (qi) on the PCR output (y) requires several replicates for each tested factor level. Because of this replication, a large number of experiments is required to test a small number of factors with a one-at-a-time approach. Fractional factorial methods screen many factors at the same time and remove the need for time-consuming and expensive replication. An example fractional factorial design for optimizing a PCR protocol might look like:

annealing temp primer concentration hot start extension time
56C 150 nM no 30 seconds
62C 150 nM no 90 seconds
56C 600 nM no 90 seconds
62C 600 nM no 30 seconds
56C 150 nM yes 90 seconds
62C 150 nM yes 30 seconds
56C 600 nM yes 30 seconds
62C 600 nM yes 90 seconds

For efficiently reasons, factors in factorial designs are typically only sampled at two states. Experimenter intuition plays a role in these designs via the selection of the initial set of factors to screen and in the selection of the values of the two states to test for each factor.

The result of a fractional factorial can be represented in a table listing the effect size and p-value for each tested factor. For example, an analysis of our qPCR fraction factorial data might yield :
factor effect (change in mg) p-value
annealing temp 27 0.001
primer concentration -1 0.6
hot start 2 0.5
extension time 10 0.05

From result in the table above, the experimenter might decide to focus their efforts on further optimization of the annealing temperature to increase the PCR yield, rather than on the three other tested factors that had little effect on our qPCR output.

Response surface methods. In a localized region, our function of interest y = f(q) can be fit using first (linear) and second order models. Fitting these models allows us to obtain a prediction of the parameter landscape of our function. Response surface methods use these models to estimate the most efficient path to the peak of the model (i.e. the maximum value of y). It is at this peak where our experimental protocol is optimized (or at least locally optimal). Response surface methods are relatively time consuming, so fractional factorial methods are typically used to screen for factors to be later optimized by response surface methods.

Research Plan:

For our ChIP protocol, we want to optimize the enrichment, y, of DNA bound to our transcription factor of interest. At the same time, we want to shorten the protocol as much as possible, so that the laborious protocol becomes more manageable. For this study, we will calculate y as the change in enrichment of genes known to be bound by our transcription factor relative to the enrichment for randomly chosen genes (which are presumably not bound by our transcription factor). We calculate this relative enrichment from qPCR data. For each known target gene and random target gene, we first calculate their enrichment from an immunoprecipitation reaction with and without antibody as N = log((1+Ei)Ci+Ui), where Ei is the median efficiency of the PCR primers for gene i, Ci is the qPCR Ct value for the DNA enriched using correct antibody for the transcription factor regulating gene i, and Ui is the qPCR Ct value for the DNA enriched without using an antibody for the transcription factor regulating gene i. We then calculate the increase in enrichment of our known targets relative to the random targets as y = mean(Nk) - mean(Nr) where Nk is the ChIP enrichment for the known targets and Nr is the ChIP enrichment for our random targets. Our goal is to maximize the value of y in the most directed manner possible using statistical methods coupled with intuition rather than simply intuition alone.

We will initially use fractional factorial methods to screen a large number of factors of potential importance to the ChIP protocol. For tested factors that are not found to be significant, we will select the factor state that requires the shortest time. For example if a 10 min incubation and a 2 hr incubation produce insignificant changes in y, we can save 1 hr 50 min by using a 10 min incubation. Factors found to be significant in the fractional factorial screen will be optimized using response surface methods.

Preliminary Results:

Note: these should be taken with caution, since I've not written the paper yet and haven't really sat down to analyze all of the results in detail yet.

We will use fractional factorial experimental designs to screen for unnecessary steps and factors that can be removed or shortened in the ChIP procedure.

Thus far, we have screened twenty-three factors in the ChIP protocol. By choosing the fastest and cheapest alternatives for factors that did not significantly alter the enrichment of known targets relative to random targets (y = mean(Nk) - mean(Nr)), we were able to reduce the cost of the protocol by three-quarters and to cut the total procedure time in half (from 5 work days to 2.5). The four most significant factors were formaldehyde concentration, shearing time, antibody concentration, and bead concentration.

Factors that have a significant influence on the enrichment of known transcription factor targets will be optimized using response surface methods.

We plan to optimize all four of the most significant factors in the ChIP protocol. As an initial step, we focused on the optimization of the antibody and bead concentrations. We assume that values of these parameters taken in a local area will result in smooth changes in y that can be modeled with first and second order models (Figure 1). We can then use these models to efficiently direct us towards the optimal values of our bead and antibody concentrations.

Figure 1. A hypothetical response surface describing the enrichment of our ChIP procedure as a function of the antibody and bead concentrations. By sequential experimentation and model refinement, response surface methods can locally define this surface and efficiently lead to local optima of the parameters to maximize ChIP enrichment.

From the fractional factorial screening experiments above, we have already obtained four initial points in our surface for bead and antibody concentration (i.e. LA+LB, LA+HB, HA+LB, HA+HB where L = low, H = high, A = antibody concentration, and B = bead concentration). Unfortunately, we do not yet know the surface, so we can't know where our points lie on the surface. However, we can fit a plane using the data for these four combinations of antibody and bead concentration (e.g. P = a0 + a1x1 + a2x2, where x1 and x2 are the concentrations of antibody and bead respectively and ai are the regression coefficients). If we assume that the local area around our points is a linear plane, we can use the ai coefficients to estimate the direction of steepest ascent. For instance in our hypothetical example in Figure 1, our four combinations might land us for example in the cyan region. A plane fit through these points can then be traversed in the direction of steepest assent to efficiently direct our future parameter value selections towards the red peak.

We fit such a plane to our bead and antibody concentration factorial data, and we choose new concentrations of these two factors along the direction of steepest ascent. These new concentrations led to a marked increase in the enrichment of our ChIP procedure (Figure 2a). It appeared that we had not yet reached saturation, so we tried an additional set of points further along the path of steepest ascent (Figure 2b). Although these new datapoints indicated that we might be close to the saturation point for these bead and antibody concentrations, the method of steepest ascent has pushed us into an expensive optimum, with almost three times the commonly used amount of beads for the ChIP procedure. We hypothesized that the amount of crosslinked-DNA was saturating our bead and antibody at low concentration - necessitating the use of large amounts of bead and antibody.

Figure 2: (A) Antibody and bead concentrations were optimized using the direction of steepest ascent determined by a linear model. (B) Further concentrations were tested to determine if we had reached a saturation point.

To test this saturation hypothesis, we performed a factorial design using bead concentration, antibody concentration, and sheared chromatin concentration as factors. By using one-forth of the typical DNA concentration, we were able to obtain to improve our enrichment procedure using lower amounts of bead and antibody (Figure 3). With this lower concentration of DNA, we should be able to estimate more cost-effective optima for the bead and antibody concentrations.

Figure 3: A factorial design was run using bead concentration, antibody concentration, and crosslinked chromatin concentration. For visualization purposes, the values for low chromatin concentration are plotted to the left of those with high chromatin (shifting them slightly along the x-axis), even though both experiments used the same concentration of bead. By using less crosslinked chromatin, we obtain larger enrichment using the standard concentrations of bead and antibody. The results suggest that at the standard values for these concentrations, the beads and antibody are saturated with chromatin.


[1] Tong Ihn Lee, Sarah E Johnstone, and Richard A Young. Chromatin immunoprecipitation and microarray-based analysis of protein location. Nat Protoc, 1(2):729-748, 2006.
[2] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, Kasif S, Collins JJ, and Gardner TS. Large-scale mapping and validation of escherichia coli transcriptional regulation from a compendium of expression profiles. PLoS Biol, 5(1):e8, 2007.
[3] GEP Box, Hunter JS, and Hunter WG. Statistics for experimenters. Wiley-Interscience, 2nd edition, 2005.
[4] Box G and Draper N. Empirical Model-Building and Response Surfaces. John Wiley and Sons, 1987.