Super Long Introns of Euarchontoglires
There was another weird result I got about my exon/intron boundaries analysis research. To less diverse species’ genes, intron lengths are shown to increase. However, according to my findings, at a point of Euarchontoglires or Supraprimates, this increase is very sharp and seems unexpected. So, I looked at exon/intron length each gene in each taxonomic rank and try to see what makes Euarchontoglires genes with that long introns.
As you see in the graph above, Euarchontoglires introns are very long compared to the rest. So I got the Euarchontoglires genes having more than 10000 bp long introns in average which are;
When I checked their summaries on Ensembl, I saw that most of them have transcripts that are not protein coding so they tend to have longer introns relative to protein coding transcripts’ introns.
So a solution might be retrieving biotypes of transcripts and filtering the ones that are not protein coding. Because in this project, we’re focusing on the protein coding genes.
Remember Ensembl API, getting biotypes is really easy. All I need to do is add following to my script;
So I got my data with its biotype information and filtered out the ones that are not protein coding. Later, when I repeated the analysis with the new data, the unexpected peak at Euarchontoglires introns was gone.
There is still a lot to be done of course, but for this particular issue, I solved it in this way.