One of my colleagues has a two part series on FinchTalk (starting today) that discusses uncertainty in measurement and what that uncertainty means for the present and Next Generation DNA sequencing technologies.
I've been running into this uncertainty myself lately.
I have always known that DNA sequencing errors occur. This is why people build tools for measuring the error rate and why quality measurements are so useful for determining which data to use and which data to believe. But, some of the downstream consequences didn't really hit home for me until a recent project. This project involves having students clone and sequence uncharacterized genes from genomic DNA. My part of the project was to do some research and write the bioinformatics section of the student lab manual.
One of the steps in this process involves using shorter DNA sequences to reconstruct a longer sequence of DNA that we call a contig. We call this process "DNA sequence assembly" and we have to do it because of technical limitations.
This time, however, things are a bit different from my past experience in part because this time we have far less data. For many reasons, the quality of the student-generated chromatograms tends to be low, with only 25-50% of the files containing usable data. This means that each student or lab group only has about three to four reads that they can assemble to create their contig. In some cases, this also means that they might only get the sequence from a single strand.
Since I've been testing the project to find out how things will work for the students, I've been doing many of these assemblies with different small data sets and reviewing the results. It's been quite surprising to realize how frequently errors occur.
I'm finding the errors by two different methods. First, I can detect errors when I look at the assemblies. In the case, below I found a position where one read had a deletion relative to the other. When I reviewed the trace in FinchTV, I could see that the base-caller had missed that A. When I find errors like that, I edit the reads in FinchTV to fix the sequence of bases and save my changes back to the iFinch database.
The other place where I detect errors is the step where we compare our proposed genomic sequence to a set of reference mRNAs. In this case, when I look at the blastn results, I can sometimes see alignments that look like this:
In this case, you can see that all of the sequences below my query (shown at the top) have an extra T or C that my query is missing. Again, I go to FinchTV and review the trace to find out if there should be another base in my read that somehow got missed.
I know it's strange, but despite all the assemblies that I've done, it's been working with the small assemblies that have really impressed me with the need to have lots of redundant data. Now, I know what people mean when they say that they minimize errors by collecting more data. I think one of the benefits of this project is that students are going to learn why many of us are excited about Next Generation sequencing technology. The more data we collect, the more we can confirm our results.
I'm certain, in the future, we won't be quite as uncertain.