If you've read the previous posts on this topic, here and here, you're probably aware by now that I have this weird (okay, maybe fanatical) obsession with data. Or at least, with knowing if my data are right so I can get on with life, do the analysis and figure out the results.
My results from last week suggested that re-processing chromatogram data (from the ABI 3730) with phred was probably a bad idea, but still, I only had one data point and I really wanted to know if anyone had done a more thorough study and compared larger numbers of chromatograms.
Naturally, someone had.
And of course it was ABI. And, the results aren't even new (except to me, I guess).
ABI and their collaborators at the Washington University and Baylor College of Medicine genome centers presented this work in a poster at the Advances in Genome Biology and Technology (AGBT) meeting in 2004 at Marco Island (1).
They looked at basecalling performance with data from 20,000 chromatograms and concluded that:
- 1. KB produced fewer errors.
- 2. KB was able to call more bases, which resulted in longer reads.
It certainly puts my quick conclusion from one chromatogram to shame. Oh, why oh why don't I ever read those user bulletins?
Never mind that. ABI kindly gave me permission to post some of their data (2):
These box and whisker plots show the results from chromatograms that were basecalled with the KB basecaller (on top, in blue), chromatograms from ABI instruments (without KB) that were re-processed by phred (in the middle, in red), and chromatograms that were first processed with KB, and then with phred (green, on the bottom) (this was the method that I used the other day with my one chromatogram).
In each case, they compared the read sequences that were obtained with a reference sequence in order to determine the error rate.
(What is a read? A read is a DNA sequence that's been obtained from a chromatogram file. The chromatogram file has lots of extra information like the kind of matrix, the run time, the name of the base calling program, the peak heights, etc. A read sequence only contains the sequence of bases: ATAGAGCTCATCGATCATCTACGTA.... etc. )
We can evaluate reads in a few ways.
- We can look at the number of high quality bases (Q20, Q30, Q40).
- We can look at the length of the read after trimming off the bad stuff.
- And, we can compare the read to a known sequence and count the number of differences.
Part A in the figure shows the length of the read sequence after trimming the poor quality data (less than Q20) bases from both the 5' and 3' ends. In each case, it appears that the KB base caller gave longer reads. In this figure, it looks like the mean values were around 650, 775, and 950 bases for reads from short, medium, and long runs.
Part B shows the error rates. For the rapid runs (top), it looks like phred has a slightly lower mean error rate when it's used to re-process KB-called data. KB and re-processed KB data appear to be tied for the medium length runs and KB wins with the long runs.
To quote ABI: .
..since phred replaces (and ignores) the initial called sequence, re-processing KB-analyzed samples with phred will, on average, degrade the accuracy of the analysis in terms of actual sequence error. Analysis improvements provided by KB algorithm outlined above will be essentially lost.
There you have it, the end of this read and this sequence of posts at the same time. Time to move on to the next generation.