Next Generation Sequencing adds thousands of new genes

<< Return to the Archive

Share to: 
Sandra Porter

I had the good fortune on Thursday to hear a fascinating talk on deep transcriptome analysis by Chris Mason, Assistant Professor, at the Institute for Computational Biomedicine at
Cornell University. 

Several intriguing observations were presented during the talk.  I'll present the key points first and then discuss the data.

These data concern the human transcriptome, and at least some of the results are supported by  follow on studies with data from the pigmy tailed macaque.

Some of the most interesting points from Mason's talk were:

  1. A large fraction of the existing genome annotation is wrong.
  2. We have far more than 30,000 genes, perhaps as many as 88,000. 
  3. About ten thousand genes use over 6 different sites for polyadenylation.
  4. 98% of all genes are alternatively spliced.
  5. Several thousand genes are transcribed from the "anti-sense"strand.
  6. Lots of genes don't code for proteins.  In fact, most genes don't code for proteins.

Mason also described the discovery of 26,187 new genes that were present in at least two different tissue types.  

This shakes things up a bit. 

What data supports these claims? And what does this have to do with the definition of a gene?

The data and analyses come from work that Mason has been involved in during the past few years (1-5). Much of the data came from the SEQC consortium, a group established to evaluate the reproducibility of Next Generation Sequencing technologies.  The SEQC project was initiated by the same group (MAQC) that examined reproducibility in microarrays (4).  The transcriptomes in these experiments were characterized using NGS RNA-Seq data from Roche (454), Helicos, Illumina (formerly Solexa), and LifeTech (formerly ABI) from 16 different human tissues.  Some of the analyses came from a collaboration with Geospiza (3).

Background What is a transcriptome?

A transcriptome is the complete collection of all the RNA molecules in a cell. Figure 1 shows many types of RNA that have been classified so far.  All of these molecules are called transcripts since they're produced by transcription. 

I think it's interesting that 11 different types of RNA are shown below and only one type codes for protein (mRNA). 


Fig. 1.  RNA drawing from FinchTalk used with permission from Todd Smith, Geospiza, Inc.(6). 

What have we been measuring and how did we get so many things wrong?

Mason began the seminar by reminding us that until 2009, our knowledge of the human transcriptome was based on a small number of cDNA libraries
of questionable quality. 

To put this information in perspective, I'm including a table that summarizes the total number of sequences in dbEST in 2009.  At that time, about 8 million sequences were available from humans.  It should also be noted that many of these cDNA libraries came from tumors or other unusual tissue types, which may have altered the composition of their transcriptomes relative to normal tissues. 

i-23de4d9706ca6458bd161116bc0e62b7-ESTs-v-NGS-722747.pngFig. 2.  Image from FinchTalk used with permission from Geospiza.

Eight million sequences sounds like quite a bit and it does represent 4-5 Gigabases of transcriptome sequence data.  Today, however, we have over 100 times more.  SEQC alone has obtained 600 Gb of RNA sequence data from sixteen human tissues and tens of billions of RNA molecules.  All this extra data has given us a much more comprehensive picture of the activities inside a cell and the ways the human genome gets put to use.

Collecting and analyzing more data has emphasized how little we knew before and how much has changed.

The larger numbers of data have also led to the conclusion that many of the annotations in the RefSeq
and Ensemble are incorrect or at least incomplete.  Even in June 2009,
comparing AceView with the data from MAQC and RefSeq indicated that many exons were missing (5).  Mason pointed out there are at twice as many exons as were thought and many more transcripts are spliced in new ways and polyadenylated at different locations.

How do we identify genes in RNA-Seq data?

There are several data analysis pipelines that researchers use. Each pipeline is specific for a particular type of analysis and there can be many steps depending on the research question.  The slide in the seminar had 20-50 little boxes of different operations. 

The pipelines that I've used and are most familiar with identify protein coding genes by aligning RNA-Seq data to annotated data from sources like RefSeq.  After generating the alignments, the number of aligning sequences are counted for each positions.  Since each alignment represents a transcript, the alignments allow us to count the number of RNA molecules produced from every gene.

If the sequences do not align to RefSeq, they might be identified through alignments to other databases, such as the databases for microRNAs, Ensemble, or AceView.

An alternative approach described in Mason's seminar, is to take the RNA-Seq data and assemble it. From the assembled data, Mason's group found that thousands of repetitive elements are expressed in a tissue specific manner.  In the non-repetitive DNA, they found about 26,000 new "genes."  Many of these new genes were expressed at low levels, transcribed from the opposite strand of known genes in a regions of introns. Further, these new genes do not code for proteins and their function is unknown.

How do all these new transcripts fit into our definition of a gene?

At one time we considered genes to be regions of DNA that coded for proteins.  That definition changed when we realized that ribosomes contained non-coding RNA and expanded with the realization that there are many types of enzymatically active RNA molecules inside a cell.  tRNAs, microRNAs, and assorted regulatory RNAs have given further insights into the many roles that RNA's play.  Now, we know from Mason's work and others (7) that most of the RNA in a cell doesn't code for proteins.

We also used to dismiss some transcripts as pseudogenes and other transcripts as "noise."  Maybe we were wrong, but it's still not clear which transcripts are encoded by genes and which represent "noise."

Maybe every DNA region that produces a transcript is a gene.

One thing is certain.  We won't be able to count the number of genes in a cell until we can agree on what a gene is.


1.  Chris Mason's seminar 1-6-2011,  UW Systems Biology speaker series.

2.  Marioni, J., Mason, C., Mane, S., Stephens, M., & Gilad, Y. (2008). RNA-seq: An assessment of technical reproducibility and comparison with gene expression arrays Genome Research, 18 (9), 1509-1517 DOI: 10.1101/gr.079558.108

3.  Mason CE, Zumbo P, Sanders S, Folk M, Robinson D, Aydt R, Gollery M, Welsh M, Olson NE, & Smith TM (2010). Standardizing the next generation of bioinformatics software development with BioHDF (HDF5). Advances in experimental medicine and biology, 680, 693-700 PMID: 20865556

4.  Shi, L., et. al. (2010). The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models Nature Biotechnology, 28 (8), 827-838 DOI: 10.1038/nbt.1665

5.  Mane, S., Evans, C., Cooper, K., Crasta, O., Folkerts, O., Hutchison, S., Harkins, T., Thierry-Mieg, D., Thierry-Mieg, J., & Jensen, R. (2009). Transcriptome sequencing of the Microarray Quality Control (MAQC) RNA reference samples using next generation sequencing BMC Genomics, 10 (1) DOI: 10.1186/1471-2164-10-264

6.  Todd Smith.  May, 2009 FinchTalk.  Small RNAs get smaller.

7.  Kapranov, P., St. Laurent, G., Raz, T., Ozsolak, F., Reynolds, C., Sorensen, P., Reaman, G., Milos, P., Arceci, R., Thompson, J., & Triche, T. (2010). The majority of total nuclear-encoded non-ribosomal RNA in a human cell is 'dark matter' un-annotated RNA BMC Biology, 8 (1) DOI: 10.1186/1741-7007-8-149