No more delays! BLAST away!
Time to blast. Let's see what it means for sequences to be similar.
First, we'll plan our experiment. When I think about digital biology experiments, I organize the steps in the following way:
A. Defining the question
B. Making the data sets
C. Analyzing the data sets
D. Interpreting the results
I'm going intersperse my results with a few instructions so you can repeat the things that I've done below. I've some people writing that only experts should be analyzing data. But I disagree with those who say that sequence analysis should be left to the experts. Okay, expert input is important for the final interpretation, but there's no compelling reason to keep anyone from evaluating public data for themselves.
A. What is question are we going to ask?
The question we're going to adress today, is:
How similar are the 2009 H1N1 sequences to each other?
Why do we care? I think it would be a good frame of reference.
B. Make the data set(s)
1. Go to the NCBI influenza resources database.
Every time I visit, they have more sequences!
(But they still don't have the sequences from Mexico! Why are those sequences missing?)
2. I decided to look at the influenza nucleotide segments that code for the hemeagglutinin protein (HA), this is the H part of the strain name in H1N1. (why nucleotides?)
To get the sequences, I searched for any Human Influenza A nucleotide sequences between 2009-03-01 and 2009-05-01. (How to do the search.)
I limited the search by H1, specified nucleotide sequences from H1, and required that sequences be full length sequences (you need to all the sequences to be the same length to do good comparisons and some of the new sequences are only partial sequences.)
This gave me 10 H1 sequences.
Doing the Analyses
I did two things to look at similarity. First, since I wanted to look at full-length sequences, I downloaded the accession numbers and compared them with BLAST. Second, I downloaded the sequences, and used JalView and ClustalW to make an image so you better see the similarity.
1. Go to the BLAST nucleotide page.
I selected the checkbox for aligning two or more sequences since I only want to compare these sequences to each other. Then, I used one of the accession numbers as the query and all ten as the subject sequences.
Here are the accession numbers so you can do this yourself:
Full length human H1 subject sequences: CY039527
I had all these in a file, so I uploaded the file.
2. Then, I clicked BLAST.
I found that all the new sequences (in my data set) were between 99-100% identical to the one I used as a query. That one, my positive control, was of course 100% identical to itself (surprise!).
Changing the formatting to a query-anchored alignment, shows us a little bit about the similarity. I'm only showing part of the alignment in the image below, but you see from the few positions with differences, the sequences are pretty similar.
These HA sequences are all really similar. That's why people are convinced that it's the same strain of virus, at least in Texas and California.
But, how does this compare to other HA proteins?
There's a really nice, user-friendly program called "JalView" that I like to when for working with multiple sequence alignments. JalView has some web connections that you can use to do multiple alignments with the ClustalW or Muscle algorithms. It's easy to edit, add, or group sequences. And, most of all, I love the coloring options.
In JalView, I added a human H3 nucleotide sequence for comparison and did a multiple alignment with ClustalW. Three pairs of HA sequences were identical, so I put those into groups. Then, I colored the sequences by identity to make the differences stand out.
The top picture shows part of the sequence, the bottom image shows the nucleotide sequences from the entire HA segment. Groups of identical sequences appear as dark bars and any differences appear as white lines.
Interpreting the results
From these analyses, we can see that:
- The H1 nucleotide sequences from the April 2009 outbreak (in California and Texas, at least) are at least 99% identical.
- The H3 sequences is very, very different (about 50%) from the H1 nucleotide sequences.
Next - what do we see when we blast the whole database?