During the past few Fridays (or least here and here), we've been looking at a paper that was published from China with some Β-lactamase sequences that were supposedly from Streptococcus pneumoniae. The amazing thing about these particular sequences is that Β-lactamase has never been seen in S. pneumoniae before, making this a rather significant (and possibly scary) discovery.
If it's correct.
The way this sequence was identified as Β-lactamase was through a blastn search at the NCBI. And in fact, it was correct to conclude that this sequence is Β-lactamase. There are only three bases that differ between this Β-lactamase and the one from a common E. coli cloning vector.
This picture shows the two sequences aligned to each other with dots representing identical bases. I colored the different bases yellow to make the differences easier to spot.
The problem, is that are there are only three bases that differed between this sequence and one from a common cloning vector (what is a vector?). And, as others have pointed out, that same vector is also used to produce Taq polymerase, an enzyme used in the procedure for identifying the sequence. PCR is very sensitive (what is PCR?), and not just for detecting the DNA that you want to see. It's quite good at detecting contaminants as well.
You can see in the blast results below that quite a few sequences matched cloning vectors pretty well.
In fact, I wonder if those top two E. coli sequences from "clinical isolates," in Russia and France, were real results or just more PCR contamination. I'm suspicious of that Acinetobacter sequence from China, too.
And, I'm suspicious of the Klebsiella sequence from St. Louis, Mo. That one was part of a Klebsiella genome project and annotated by a nice friendly computer that probably doesn't have one skeptical program on it's hard drive.
I think it's a problem that as we sequence more stuff, we end up with more sequences in the database whose identification isn't confirmed by any other kind of supporting data.
One of the hard things about science, is that you can't just do the experiments that will give you results you want to see. You also have to think of all kinds of other possible explanations for your results - and test them.
So, now I've explained why I don't believe some of the results. Well, actually I do believe that the sequences are properly identified, I just don't believe that they really originated from the bacterial species on record.
I'll give you all one more chance to think of some other experiments that could be done to test whether the sequences really came from the bacteria that are listed in GenBank
And sequencing more DNA is NOT the answer.