Convert Gene Symbols to Entrez IDs in R
By Güngör Budak
- 2 minutes read - 347 wordsBioinformatics studies usually includes gene symbols as identifiers (IDs) as they are more recognizable comparing to other IDs such as Entrez IDs. However, certain analyses (tools) may not use gene symbols as there are usually more than one symbol so it is more difficult to implement a method to work with gene symbols. In such cases, you may need to do a conversion which is very common thing to do in bioinformatics.
For this task, I have been using org.Hs.eg.db
Bioconductor package which have worked very well so far. It is a genome wide annotation for human, primarily based on mapping using Entrez Gene identifiers.
Open the R console or RStudio and go to its console and use following commands to install and load the package:
1# install
2source('https://bioconductor.org/biocLite.R')
3biocLite('org.Hs.eg.db')
4
5# load
6library('org.Hs.eg.db')
Run columns(org.Hs.eg.db)
to see available identifiers that can be used in this package. There are actually a lot of things such as Ensembl IDs, Uniprot IDs, protein families and GO annotations:
1columns(org.Hs.eg.db)
2 [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT"
3 [5] "ENSEMBLTRANS" "ENTREZID" "ENZYME" "EVIDENCE"
4 [9] "EVIDENCEALL" "GENENAME" "GO" "GOALL"
5[13] "IPI" "MAP" "OMIM" "ONTOLOGY"
6[17] "ONTOLOGYALL" "PATH" "PFAM" "PMID"
7[21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG"
8[25] "UNIGENE" "UNIPROT"
Let’s make a sample gene symbol list to work with and do the conversion using mapIds
which required 4 arguments, the first is the object itself, the second is the list of identifiers (symbols in this case), the third is the identifier type we want to convert to, and the last is the type of identifier for the second argument:
1# you will have your own list here
2symbols <- c('AHNAK', 'BOD1L1', 'HSPB1', 'SMARCA4', 'TRIM28')
3
4# use mapIds method to obtain Entrez IDs
5mapIds(org.Hs.eg.db, symbols, 'ENTREZID', 'SYMBOL')
6'select()' returned 1:1 mapping between keys and columns
7 AHNAK BOD1L1 HSPB1 SMARCA4 TRIM28
8 "79026" "259282" "3315" "6597" "10155"
As you see the function mapIds
returned Entrez gene IDs for the given gene symbols.
You can assign the result to a variable and use it wherever you want.
Check out org.Hs.eg.db reference manual for more information.