How did the human genome ever get finished if every one of the three billion bases had to be reviewed by human eyes?
In the early days of the human genome project, laboratory personnel routinely scanned printed copies of chromatograms, editing and reviewing all DNA sequences by eye. For more background, see the post on qualitative measures of DNA quality.
Later on, when the genome sequencing turned into a race, and the pace of DNA sequencing began to increase, some genome centers realized that it was too expensive and time consuming to have Ph.D. scientists, or even technicians, review all the printed chromatograms by eye and manually edit files.
Editing sequence files is still a common practice in some labs, but this usually depends on the volume of sequencing that the lab carries out.
It became clear that better methods were needed.
Who is this "phred" and what is his formula?
One of the first and most popular programs for assessing sequence quality was, and still is "phred." Phred (named from "Phil's revised editing program") was written by Phil Green at the University of Washington (1-3). After a chromatogram file has been processed by the software in a sequencing instrument, it can be evaluated by Phred. Phred uses information about the shape of a peak, the spacing between peaks, and the height of a peak to calculate a quality score for every base in a DNA sequence. The quality score is obtained by taking the log of the probability that the base call was an error and multiplying it by negative ten.
The formula for a Phred score is this: Q = -10 log10 P(error)
So, for example, if there is a 1 in 10 chance of an error, P = 0.10, the Phred quality score (or usually just called a "Phred score") would be 10. A 1 in 100 chance of error, would have a quality score of 20, a 1 in 1000 chance of an error, 30, and so on.
There are other programs for determining quality values, too. Newer DNA sequencing instruments from ABI even come equipped with basecalling software, like the KB basecaller.
Let's see some Phred scores
I was lazy today and obtained Phred values for a sequence file by uploading my chromatogram file to a Finch Server (www.geospiza.com). (The Finch Server can run Phred automatically when sequences are uploaded.) (Licenses for Phred can be obtained from the UW, and I can run it on my computer, since it has UNIX, but like I said, I'm lazy.)
FinchTV tells us that the quality values are 13 (shown on the right) and 10. So, there's a little less than a 1 in 10 chance that the base-calling software made a mistake. The data for these two bases still aren't very good, but now I know just how bad they are.
Can Phred improve my data or at least tell me more about bad data?
No. It's a computer program, not a miracle worker. Even Phred can't turn bad data turn into good data, but we can know which parts of the sequence are good and which parts are not. We can identify regions and bases that are questionable.
We can see more about this below, and see how the quality varies throughout the sequence, in a Finch Server quality graph for this chromatogram file. The middle of the sequence looks pretty good, but there are regions with lower quality sequences on each end.
Overall, as more and more people have started using DNA sequencing to learn about their ancestry and ask about their genetic likelihood to develop disease, it's becoming more and more important to know just how good the data are.
I will write more later about how we use quality information, but for now, next time, you look at a DNA sequence, like AAGATAGATAGAT, ask yourself: Which parts of that sequence can we be confident about? And how confident can we be?
1. Brent Ewing, LaDeana Hillier, Michael C. Wendl, and Phil Green. 1998. "Base-Calling of Automated Sequencer Traces Using Phred. I. Accuracy Assessment." Genome Res. 8: 175-185.
2. Brent Ewing and Phil Green. 1998. "Base-Calling of Automated Sequencer Traces Using Phred. II. Error Probabilities." Genome Res. 8: 186-194.
3. Peter Richterich. 1998. "Estimation of Errors in "Raw" DNA Sequences: A Validation Study." Genome Res. 8: 251-259.