The next time you bite into a crisp juicy apple and the tart juices spill out around your tongue, remember the honeybee. Our fall harvest depends heavily on honeybees carrying pollen from plant to plant. Luscious fruits and vegetables wouldn't grace our table, were it not for the honeybees and other pollinators.
Lately though, the buzz about our furry little helpers hasn't been good. Honeybees have been dying, victims of a new disease called "colony collapse disorder," with the US, alone losing a large number of hives in recent years.
Researchers have speculated about everything from cell phone towers, genetically modified food, and pesticides to too much traveling around by beekeepers who cart their hives around the country to catch the flowering plants.
Metagenomics is a term that's used to describe the process of obtaining nucleic acids (RNA and/or DNA), determining the sequence of the nucleotides, and identifying the source of the RNA or DNA by comparing it to known sequences. This technique can be used to answer many different questions and discover lots of new things. We're using it in our bioinformatics class right now and will be using it in the spring as well. (if you want to try it yourself, I'm making the data and some of tools available, contact me at digitalbio at gmail dot com),
Yesterday, I listened to a Science/AAAS webinar to learn more about how metagenomics has been used to investigate the mystery of the dying honeybees (if you're interested, the tape and slides are here) and the identification of a possible suspect. The speakers were W. Ian Lipkin, M.D., from Columbia University, New York, NY, and Michael Egholm, PhD, from 454 Life Sciences.
In this project, the researchers gathered samples of bees that had died from colony collapse disorder, isolated DNA and RNA, and sequenced it. They generated hundreds of thousands of short sequences, averaging 250 bases long, according to the webinar.
The informatics part of the project was largely glossed over, in short they grouped similar sequences together (clustering), assembled sequences into longer sequences (contigs) and used programs like blastx and blastn to identify the sequences by comparing them to a database of sequences. Through this process, they found sequences that showed a strong statistical association with the bees that were killed. These sequences came from the Israeli acute paralysis virus (IAPV).
I was curious to know how they managed to store and work with all those data files - I think they said they have over 400,000 flowgrams. Where did they put them all? How did they evaluate the quality of the flowgrams?
No one mentioned that.
Dr. Egholm said that Sanger dideoxy sequencing remains the gold standard, in terms of quality and read length, but that 454 will replace it because you can gather so much more data. In fact, it seemed that in order to identify the bee-killing culprit, they did need a very sensitive method. In one slide, only 65 reads out of 97,435 were identified as viral sequences. And, of those sequences, I think they said that 14 were from IAPV.
Once they had sorted through the haystack of reads and picked out IAPV, of course, they took pools of bees and tested them directly for the presence of the virus. They found that 83% of the bees that had died from colonly collapse disorder were infected with IAPV, while IAPV could only be found in 5% of the bees obtained from healthy colonies.
Koch's postulates, unfortunately, are a bit difficult to satisfy in many cases. Statistical associations between the presence of a pathogen and the existence of a disease, are often the best we can do.
At the end, they discussed the potential application of metagenomic analysis to the problem of identifying other diseases with unknown causes. Perhaps the puzzle of diseases like autism and diabetes could be solved, if we could sequence everything that's present and find out if there are pathogens hiding in the plumbing.
What new pathogens will we find when we've sequenced the world? Whatever we find, I hope we can treat them.