Tuesday, October 23, 2007

When will gene expression data become collective knowledge?

Published research: this post describes some of my published research. The relevant publication is: Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured metadata, which describes the M3D database. This post serves as a place for folks to provide comments and suggestions for the future of the database and the future of expression data in general.



The focus of my PhD thesis has been network inference: how can we efficiently determine regulatory networks in prokaryotes (i.e. which transcription factors regulate which genes). Since mRNA concentration is the only data we can easily measure for all genes simultaneously, my PhD has also be an exploration of the potential and limitations of expression data. In joining Tim Gardner's lab at BU, I was a little reluctant to analyze microarray data, because I'd heard it was noisy and too easy to find whatever answer you were looking for. After 4+ year of working with gene expression data from microarrays, I'm convinced that it is a little noisy and that it is extremely easy to find whatever answer you're looking for. But with careful analysis, particularly regarding the statistics of large datasets and multiple hypothesis testing, microarrays hold an unparalleled wealth of knowledge about the dynamic, concerted actions of cells.

Genome sequencing has provided the cell's nouns. Microarrays are providing the cell's verbs. Currently, most people use the nouns as collective knowledge while the verbs are generated and analyzed in-house. How are we ever going to understand the language of life when everyone has to generate a personal set of verbs?

What I mean for genome sequencing is this: you want to knockout, clone, tag, or whatever a gene in species X. You look up the location of that gene in the genome browser for species X, download the surrounding sequence, and use your intuition or some primer design software to help you construct the chemicals (typically oligonucleotide primers) you need to experiment on your gene. Or maybe you're one of these folks that likes to take fancy trips with your science budget, so you're out in Hawaii to collect sea water for metagenomic sequencing. When you are sufficiently tanned and you head back home to sequence the seawater, the first thing you do is compare your seawater DNA sequence to all DNA sequence available for any species on the planet to see if you can find anything to help you figure out what was in your seawater.

What I mean for microarrays is this: you want to figure out what the cell is doing when you apply X to it (e.g. heat shock, acid shock, DNA damage, glucose growth, etc...). You run 3 chips in a standard condition and 3 chips in a standard condition plus X. You take the two conditions run a statistical test - ttest, fold change, or FDR if you're getting fancy - to produce a list of N genes that changed expression when you did X. You write a 5 page paper where you publish the list and describe what those changes might mean.

Final summary of the microarray isolationist problem
With genome sequencing everyone is taking advantage of the wealth of collective sequencing knowledge to improve their own research. For microarrays, with few exceptions, knowledge is created and remains in isolation. One could make the case that microarrays are a relatively young technology and the collected knowledge has yet to accumulate, but microarrays were invented in 1995 - they're older than Google Inc.

How can we promote gene expression as collective knowledge?
For our network inference work, it was necessary to collect as many microarrays as possible for the species of interest (E. coli). Since Tim's lab is full of computer nerds like myself, from the start we decided to collect all of the data
in a database. Originally, the database was just a storage dump that ensured that all the folks writing network inference algorithms (i.e. Boris Hayete and myself) would use the same starting dataset. Having the microarray datadump in a standard format and standard location certainly helped our network inference efforts. In the end, Boris developed a network inference algorithm - CLR - that currently remains the top performing algorithm on the E. coli microarrays (if you're interested in network inference and think you can top CLR - please try, here's a site to help you get started; and let me know how it goes!).

More recently, we improved our microarray database and published a piece, Many Microbe Microarrays Database: uniformly normalized Affymetrix compendia with structured metadata, that I hope starts to move gene expression into the collective knowledge space. In that paper we tried to address three problems that are currently hindering the wide-s
cale adoption of microarrays:
  1. the presence of platform-specific biases in expression data due to the use of many different microarray platforms in a compendium
  2. the lack of a uniformly applied normalization standard for expression datasets, even within a single expression platform. Different software algorithms are used by different labs for preprocessing and normalizing the raw microarray intensity values
  3. the incompleteness and inconsistency in the curation of metadata describing the details of each experimental condition.
To address point 1, we only allowed a single platform (Affymetrix) for each of the three species currently in the database (E. coli, Shewanella, and yeast). To address point 2, we collected unnormalized raw CEL files for all of the experiments and uniformly normalized them as a group with RMA. In our previous work, we found that this RMA normalization makes comparisons possible on microarrays of the same platform that are run in different laboratories (see the section "Verification of array data normalization and consistency" and Figure S5 in the supplement Protocol S1 to our network inference paper). And to address point 3, we generated human curated (and computationally validated) experimental metadata for each microarray publication—converting each chemical and growth attribute into a structured and computable set of experimental features with consistent naming conventions and units.

I believe this Many Microbe Microarrays Database (M3D) provides the essential starting point for moving towards the use of microarrays as collective knowledge: a set of curated microarray datasets that have already proven useful in a large-scale application.

If you build it, some will come
M3D has been online for about a year now. The site currently gets around 200 unique visitors and 3000 hits a day. But based on the emails I get from folks, the audience i
s primarily computational folks like myself that are eager to test their new algorithm on a large, well-annotated experimental dataset.

But will the biologists ever arrive?

The wetlab biologists are the folks I would really like to begin adopting this resource. Since I do experimental work in E. coli, whenever I want to find out what a particular gene does or what genes might regulate it, I dig around in RegulonDB or EcoCyc to see what's known about the gene. How might expression data be incorporated into such a website to aid understanding the biology of E. coli? For one, if RegulonDB has published evidence that lexA transcriptionally regulates recA, it would be nice to see if the collective microarray knowledge currently supports the published evidence. This type of information could be provided by including a scatterplot of the expression values of the transcription factor -vs- those of its target (see the image below generated from M3D). If you generate this plot on M3D, you can also mouseover each point to receive the details of the experiment represented by the datapoint.
I think the key to adoption is probably integration with the currently available resources. Folks don't want to go to yet-another-website and figure out how to work it. Because of this, I've created a way for external websites to automatically include M3D generated images on their own websites (instructions). Here are a couple examples, these are drawn on-the-fly from M3D rather than uploaded to this blog:




The problem with this approach is that if M3D goes down, the automatically generated images also fail on the remote website as well. So perhaps, remote websites would want to automatically generate and locally cache the images.

Hey microarrays, what can you do for me?
So perhaps integration with other databases will increase the awareness of microarray data, but I really want people using the data. So the question is what would folks like to do with microarray data? Like a sequence database, can a microarray database allow scientists to better understand their own data? In general people don't want to just browse around NCBI, they want to BLAST their sequence to help them improve and publish their own work.

So I'd be really interested if anyone had ideas about applications that might allow more people to use the collective expression knowledge in M3D (or any other microarray database for that matter).

Here are some things that are already available M3D:
  • what genes changed expression in condition X (z-test)
  • what genes changed relative expression between conditions X and Y (t-test or fold-change)
  • do genes in particular chromosomal regions tend to change expression as a group
Here are some things I'm considering adding:
  • array-blast: submit your raw microarray data (CEL file) and you get back a list of the most similar arrays in the database along with the condition information for those arrays
  • expression-based-function annotation: across the compendium, what is the effect of each experimental feature (e.g. glucose concentration is correlated with the expression of genes X,Y,Z)
  • array-changed-genes: submit your CEL file(s) and run a z-test against the entire compendium or a t-test against a particular subset of arrays in the compendium to determine the set of genes whose expression changed in your microarrays
How you can help!
If you have any ideas/suggestions for tools, applications, or anything else that might be done to a website like M3D to help folks use this collective expression knowledge, I'd like to hear your ideas (just leave a comment on this post).

1 comment:

Anonymous said...

Hi! I'm a biologist. I was searching for info on Shewanella making nanowires. Ah! Practical use of gene expression data! The Biomimicry Guild is interested in how to use Nature's ideas in sustainable manner and to reduce toxicity of industries...including nanotechnology. Can we mimic Shewanella's nanowires? Maybe off subject a bit...but thought you'd like to know that a biologist has read your posting. Cheers!