Sometimes when you go digging through the databases, you find unexpected things.
I wanted to find out how many different organisms made insulin, so I used a database at the NCBI called Blink. Blink is a database of protein blast search results. Using Blink can save you lots of time because it organizes blast results from all the organisms in the non-redundant protein sequence database, but getting to Blink can be tricky because it's a little hard to find.
If I used the Taxonomy report link, I would see a list summarizing all 218 species with proteins similar to insulin.
Matching sequences are also organized into larger groups. As you can see below there were 1359 results from animals with mouths (metazoans), some results classified as "Other" (those came from protein structures), and curiously one result identified as "Plant."
Now, I know that people have put insulin genes into plants as a way of producing this drug, but I hadn't read anything about plants making insulin before.
What else could I do? I had to click the link and see where it led.
I chose "Only plants."
Then, I clicked the blast score (232) to see how well the human preproinsulin matched the plant protein.
The alignment looked good, you can see in the image above that over half the amino acids are identical and the E value, at 3 x 10-31 shows me there's very little chance of getting a match this good from random sequences. The middle protein is missing from the plant sequence, but that's the part that gets cut out during processing so that's okay.
These results bothered me.
Over 202 plant genomes have been sequenced and are in the NCBI database. Why was there only one result from a plant? That seemed strange. If plants really make insulin, there should be more results from plant sequences.
So, I clicked the accession number (PIR A59151) to learn more about this protein sequence from jack beans and where it came from.
Clue 2: The next clue to a possible problem came from the title of the paper:
"Jack bean seed coat contains a protein with complete sequence homology to bovine insulin."
Not only was this sequence the only sequence from any plant that matched insulin, it was supposedly identical to insulin from a cow!
I decided to look for the paper. And I found that this group had published a second paper in 2003, with this title:
"A protein with amino acid sequence homology to bovine insulin is present in the legume Vigna unguiculata (cowpea)."
Clue 3: Not only was this group the only lab to ever find a plant insulin protein, the protein they sequenced was identical to cow insulin, and they found it twice.
As I skimmed through the papers, I learned that this group obtained the sequence of these proteins by protein sequencing, and they were using bovine insulin in the lab for other experiments.
This sounded fishy.
Clue 4: To be complete, I also used tblastn to translate the DNA from all 202 plant genomes and compare the predicted protein sequences to the insulin protein sequence. Suffice it to say, none of the 202 plant genomes contained genes for bovine insulin.
It seems pretty clear that the jack bean protein insulin sequence was obtained from a bit of contaminating cow protein.
Why is this sequence in the NCBI database if it's misidentified?
The presence of the cow insulin sequence from jack beans illustrates an important point about the NCBI database. It's an archive. Sequences get entered that aren't always right and they can persist.
The take home message is this: when a result seems unlikely, it probably is.
Be sure to look for other lines of evidence before deciding a result is correct.
It seems to me that sequences like this should have some kind of red flag or notation to indicate a potential problem. Until that time, examples like these can be good case studies for biology students.
1. Oliveira,A.E.A., Machado,O.L.T., Gomes,V.M., Xavier-Neto,J., Pereira,A.C.P., Vieira,J.G.H., Fernandes, K.V.S. and Xavier-Filho,J. "Jack bean seed coat contains a protein with complete sequence homology to bovine insulin." Protein Pept. Lett. 6, 15-21 (1999).
2. Venancio, T.M., Oliveira,A.E.A., Silva, L.B., Machado,O.L.T., Fernandes, K.V.S. and Xavier-Filho,J. "A protein with amino acid sequence homology to bovine insulin is present in the legume Vigna unguiculata (cowpea)." Brazilian Journal of Medical and Biological Research (2003) 36: 1167-1173.