Saturday, September 1, 2007

Effect of sequence level mutations on transcription, translation, and noise

One of the main biological questions explored when DNA sequencing first became a practical laboratory technique was how the nucletides in a gene's promoter and ribosomal binding site define the gene's interactions with transcription factors and the ribosome translation apparatus. At least in prokaryotes, these interactions largely determine the levels of transcript and protein available for each gene, and thus provide the crucial information of how a genome regulates itself.

This early work resulted in many of the promoter analysis tools that are still widely used today. In particular, the promoters were often analyzed in terms of information content, and this information content was visualized using sequence logos. These sequence logos are still the most popular way to display DNA binding sites. I'm not sure why this field tappered off a little. My guess is that the people in this field had maxed out the information that was affordably obtainable with the available technologies.

But as most biotech loving biologists know, the times are a changin in biotech, and we have new sequencing technologies that enable drastically larger sequencing studies to be undertaken. Importantly, we are faced with several quite-different sequencing methods (unlike the previous era in sequencing biotech which was almost exclusively driven by ABI's advances in Sanger sequencing). I think with these new technologies coming online, it's time we dusted off our promoters and start figuring out how they work.

What we still must learn about promoters

For several promoters (ideally for all promoters), we need to exhaustively determine how base-pair changes in the promoter lead to changes in the amount of transcription and translation. We must determine this information across time, so we can also determine the rates of transcription and translation. Finally, we must determine these values at the level of single-cells, so that we can also obtain information about the noise inherent in each promoter sequence.

In the early 90s, these type of analyses were at least partially undertaken with populations of cells and 100-200 promoter mutations. Now we must study several promoters, with millions of different mutations for each promoter, and with multiple single-cell replicates of each mutation so we can estimate noise.

What is this new level of promoter knowledge good for?

We need to understand to what extent it is possible to build a computational model to predict translation and transcription from sequence alone. Such a model could act like a molecular biologist's version of Hardy-Weinberg equilibrium. That is if a promoter does not fit the model, it would suggest that there is some additional regulation (e.g. small RNA) that is not explained by the binding of transcription factors and the ribosome. In addition, the ability to screen vast numbers of promoter variants could be of huge value to forward biological engineering. Synthetic biologists often tune their human created networks using directed evolution. While directed evolution is a very powerful and massively parallel way to optimize a genetic system, the human that created the system in the first place has limited control over the final result. For example, it may be that the network evolved to generate ethanol from cellulose is extremely noise and could be made more efficient by fine-tuning the promoters in a more intelligent design.

An idea for a massively parallel method to determine the effect of sequence level mutations on transcription, translation, and noise

I think the tools are already available to determine the effect of sequence level mutations on transcription, translation, and noise in single-cells. One approach I've thought about is shown in the figure on the right (click to see an easier to read/print pdf version of the figure). The idea takes a cell-in-emulsion approach (see Directed evolution of polymerase function by compartmentalized self-replication) and combines it with the polony sequencing method pioneered in the Church lab (see Accurate multiplex polony sequencing of an evolved bacterial genome).

The first step (top right) is to synthesize a known promoter with a large number of random nucleotides. This is very similar to the method used by Stormo's lab many years ago (see Quantitative analysis of ribosome binding sites in E.coli), except that with modern sequencing methods we can drastically increase the number of random sites that we explore. A GFP reporter is placed directly after the promoter so that we can measure the amount of protein generated. Since, the vast majority of mutations will probably result in little to no expression, it may be useful to also add a bactericidal antibiotic resistance gene after the GFP to provide an easy way to get rid of unproductive promoters (for some studies, you would probably not want to remove these low output promoters).

The second step (top left) is to take a dynal bead and attach a primer to amplify the promoter, a second primer to amplify the GFP sequence, and an anti-GFP antibody.

Next the bead and the bacteria a placed together into an emulsion. In the emulsion solution, we also need to include reverse primers for the promoter and GFP sequence, reverse transcriptase, and PCR reagents. You would need to mess around a little with the dilutions and concentrations of beads and cells to maximize the case where you have only one bead and one cell in each emulsion.

Now we have the cells isolated into separate chambers, and we have one bead with the bacteria. This bead will provide the source of our future information read out. Also remember that by synthesizing our promoters with N's, we actually have generated a huge library of different promoters. So that each emulsion will have a different variant of our promoter. We then lyse the cells. I'm not sure of the best way to lyse the cell. But in the figure, I just assumed we used extreme heat. Because of the next step, it may be wise to use a gentler method to lyse them, such as placing a protein that will cause cell lysis (e.g. lysozyme or ccdB) on a promoter that is heat inducible, so you'd only need to heat the cells up to 42C rather than 95C. Once the cells are lysed, the GFP expressed from the synthetic promoter should diffuse around the emulsion until they meet and bind the antiGFP attached to our dynal bead.

We've got the protein on our bead, now we need to attach the DNA. Since one of the things we want to measure is transcript concentration which is mRNA, we need to do a reverse transcription reaction. Reverse transcriptase is not very heat stable, which is why I stressed above that we might want to lyse our cells more gently than by heating them to 95C. However, Superscript III from Invitrogen is pretty heat stable, so that might be worth a shot too. Since we include a reverse primer to our GFP sequence into the emulsion, we should have a fairly specific reverse transcription.

Finally, we need to attach the DNA to our bead, we can do so by running a multiplex PCR reaction for a few cycles. Since the forward primers are on the dynal bead, the PCR reaction results in the DNA being stuck to the bead.

And now for the fun part, let's measure protein concentration, transcript concentration, and determine the promoter sequence for our single cells (bottom row of the figure). To do this we lay our beads out on a microscope slide or some type of microfluidic device. We can measure the protein concentration directly as GFP fluorescence. Next we measure the transcript concentration as the amount of GFP cDNA attached to the bead. For increased accuracy, we might want to measure this concentration using cycled PCR reactions (like qPCR on a microscope). The concentration of the promoter sequence and GFP sequence attached to the bead can be measured in each round using two different molecular beacons. The concentration of promoter sequence can be used to normalize the concentration of the transcript (i.e. to try and remove artifacts due to the variance in emulsion sizes, emulsion PCR efficiency, and to remove the background transcript value that is due to amplification from the DNA rather than cDNA sequence). This bead based DNA quantitation can borrow some ideas from the BEAMing method (see BEAMing: single-molecule PCR on microparticles in water-in-oil emulsions).
Now that we've measured protein concentration and transcript concentration, it is time to determine the promoter sequence responsible for these concentrations. In some ways, this is the most difficult step. But in practice, it may be the easiest step, as the polony sequencing method does exactly that.

With the size of the beads and the massively parallel nature of this protocol, it should be possible to have the same sequence appear multiple times, allowing the estimation of noise for the tested promoters.

Again, I haven't tried any of this stuff, and I'm not sure it'll work. I just wanted to throw the idea out there in case someone else is thinking about this problem too.

Open questions with this idea
  1. how quantitative is emulsion PCR and how is noise influenced by the size of the emulsion
    1. can we increase quantitative accuracy of our mRNA concentration by running very few emulsion cycles and then running a microscope based qPCR on our bead
  2. how strongly does the GFP bind to our bead (e.g. when we break the emulsions, can GFP move from one bead to the next? we can test this by using mCherry in one sample, GFP in another, and then breaking the emulsions together to see if any beads have both proteins attached)
  3. Is crowding on the bead going to cause problems; that bead has a lot on it. does this bias our results in unpredictable ways?
  4. there are a lot of steps. long protocols can lead to excess experimenter derived error and slow the techniques adoption


Jean-Claude Bradley said...

Awesome post - are you in a position to do some of this work now? I hope you get some good feedback from other researchers working in your field.

J said...

I'm trying to finish grad school, so I probably won't pursue this idea anytime soon.

Based on your feedback to my Tips (rules?) for Open Notebook Science post, I'm going to use this blog for the milestones and key problems.

I figure I'll have: 1) untested ideas. By putting my untested ideas out in the open, hopefully it'll help me bump into people working on similar stuff. For the select few untested ideas that I decide to pursue, I'll write a short 2) in progress post that links to my open notebook. Then for the in progress stuff that I 3) publish, I'll add a final post to link to the paper.