> Gene Function Annotations > Orthologous gene mapping across species

Orthologous gene mapping across species

1.1.1 biomaRt
1.1.2 homologene
1.1.3 g:Orth
1.1.4 OmaDB
1.1.5 EggNOG
1.1.6 OrthoVenn2
1.1.7 OrthoDB
1.1.8 ORCAN
1.1.9 KEGG Orthology
1.1.10 InParanoid
1.1.11 COG

This document was provided to support the orthologous gene mapping across species (especially non-model taxa) for the downstream data analysis.

Breaking: A summary of eleven R packages and web services tools, including the number of species, package installation, operating vignettes.

1.1.1 biomaRt

Introduction: In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R.

Installation: To install this package, start R (version “4.2”) and enter:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    BiocManager::install("biomaRt")

Aplication: This vignette details provide a number of example usecases that can be used as the basis for specifying your own queries. biomaRt provides a number of functions that are tailored to work specifically with the BioMart instances provided by Ensembl.

For older versions of R, please refer to the appropriate Bioconductor release.

1.1.2 homologene

Introduction: A wrapper for the homologene database by the National Center for Biotechnology Information (‘NCBI’). It allows searching for gene homologs across species. The package also includes an updated version of the homologene database where gene identifiers and symbols are replaced with their latest (at the time of submission) version and functions to fetch latest annotation data to keep updated. Given a list of genes and a taxid, returns a data frame inlcuding the genes and their corresponding homologues.

More species can be added on request.

Installation: Install the latest version of this package by entering the following in R:

 install.packages("homologene")

library(devtools)
install_github('oganm/homologene')

Aplication: Given a list of genes and a taxid, returns a data frame inlcuding the genes and their corresponding homologues according to the manual.

Examples:

homologene::homologene(c('Eno2','17441'), inTax = 10090, outTax = 9606)

##   10090 9606 10090_ID 9606_ID
## 1  Eno2 ENO2    13807    2026
## 2   Mog  MOG    17441    4340

1.1.3 g:Orth

Introduction: g:Orth (part of the g:Profiler) translates gene identifiers between organisms. We provide orthologous gene mappings based on the information retrieved from the 199 Ensembl database.

Installation: To install this package, start R (version “4.2”) and enter:

install.packages("gprofiler2")

Aplication: g:Orth translates gene identifiers between organisms. For example, to convert gene list between mouse (source_organism = mmusculus) and human (target_organism = hsapiens):

library(gprofiler2)
gorth(query = c("Klf4", "Sox2", "71950"), source_organism = "mmusculus",
target_organism = "hsapiens", numeric_ns = "ENTREZGENE_ACC")

##   input_number input         input_ensg ensg_number ortholog_name
## 1            1  Klf4 ENSMUSG00000003032       1.1.1          KLF4
## 2            2  Sox2 ENSMUSG00000074637       2.1.1          SOX2
## 3            3 71950 ENSMUSG00000012396       3.1.1         NANOG
## 4            3 71950 ENSMUSG00000012396       3.1.2       NANOGP8
##     ortholog_ensg
## 1 ENSG00000136826
## 2 ENSG00000181449
## 3 ENSG00000111704
## 4 ENSG00000255192
##                                                          description
## 1           Kruppel like factor 4 [Source:HGNC Symbol;Acc:HGNC:6348]
## 2 SRY-box transcription factor 2 [Source:HGNC Symbol;Acc:HGNC:11195]
## 3                 Nanog homeobox [Source:HGNC Symbol;Acc:HGNC:20857]
## 4    Nanog homeobox retrogene P8 [Source:HGNC Symbol;Acc:HGNC:23106]

Web server: g:Orth g:Orth

1.1.4 OmaDB

Introduction: OmaDB is a wrapper for the REST API for the Orthologous MAtrix project (OMA) which is a database for the inference of orthologs among complete genomes. OMA’s inference algorithm consists of three main phases. First, to infer homologous sequences (sequences of common ancestry), all-against-all Smith-Waterman alignments are computed and significant matches are retained. Second, to infer orthologous pairs (the subset of homologs related by speciation events), mutually closest homologs are identified based on evolutionary distances, taking into account distance inference uncertainty and the possibility for differential gene losses (for more details, see Roth et al 2008). Third, these orthologs are clustered in two different ways, which are useful for different purposes: 1) We identify cliques of orthologous pairs (“OMA groups” ), which are useful as marker genes for phylogenetic reconstruction and tend to be very specific; 2) We identify hierarchical orthologous groups (“HOGs”), groups of genes defined for particular taxonomic ranges and identify all genes that have descended from a common ancestral gene in that taxonomic range.

Species information

Type	Number
species	2496
proteins	18871762
OMA groups	962065
deepest level HOGs	846849
proteins in a HOG	14029907 (74.3%)
proteins in an OMA Group	13481594 (71.4%)

Installation: To install this package, start R (version “4.2”) and enter:

  if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    BiocManager::install("OmaDB")

For older versions of R, please refer to the appropriate Bioconductor release.

Aplication: This little vignette shows you how to get started with the OmaDB package. For more details on the OMA project.

Another useful function of the OmaDB package is its functionality to exactly and partially match sequences. A case of sequence as a example was provided in this vignette. Chapter page

1.1.5 EggNOG

Introduction: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. The initial step in the EggNOG pipeline is the clustering of the 9.6 million proteins from 2031 genomes. Homology comparisons are executed by the SIMAP initiative and processed by the EggNOG orthology prediction pipeline. Orthologous groups are automatically generated by dividing species space into ‘core’ species, which are central for defining orthologous groups using the strict triangular criterion, and ‘periphery’ species.

Web server: EggNOG v5.0. Chapter page

1.1.6 OrthoVenn2

Introduction: OrthoVenn is a powerful web platform for the comparison and analysis of whole-genome orthologous clusters among up to 12 species. A larger number of species, including 142 vertebrates, 71 metazoa, 65 protists, 94 fungi, 57 plants and 111 bacteria species, have been added in OrthoVenn2.

Installation: The tool provides three stand-alone installation versions:

Install Docker: The release of OrthoVenn2 as a Docker provides an isolated and self-contained package without the need to install dependencies and change environmental settings.

Visit Official install document to install docker on your server. It supports different systems, such as unbuntu, centos, mac, windows

Install Database: The analysis result data was stored in mongodb, so run the command below to install and start mongodb.

your_mongo_data_path is your mongo data path on your local server , if not exist it will create automatically.

sudo docker run --name venns-mongo -v /your_mongodb_path:/data/db -d -p   27017:27017 mongo:latest

Install Web API/Front: Web API is the main analysis program, so install and run it.

your_work_path is your mongo data path on your local server , if not exist it will create automatically.

sudo docker run -d --name venns -p 6001:6001 --link venns-mongo -v  /your_work_path:/data/orthovenn2 -e MONGO_HOST=venns-mongo lufang0411/orthovenn2:latest

At last, install and run the Web Front server to load the UI.

sudo docker run --name venns-front -p 9999:80 --link venns -v /your_work_path:/data/orthovenn2 -e API_HOST=venns -e API_PORT=6001 -d    lufang0411/orthovenn2-front:latest

You can use command sudo docker ps to check is all the three service start successfully. If all the sercies is ok, congratulations, you can visit http://localhost:9999 or http://your-machine-ip:9999 to use the local version ortho_venns program.

Web server: OrthoVenn2. Chapter page

1.1.7 OrthoDB

Introduction: OrthoDB presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.
OrthoDB can be queried using a gene name, identifier, annotation keywords, protein sequence, etc. Then it indexed many relevant identifiers of proteins and genes, including UniProtKB, Ensembl, InterPro, KEGG, GenBank, RefSeq, etc.

Eukaryotes	Prokaryotes
1,271	6,013

Web server: OrthoDB. Chapter page

1.1.8 ORCAN

Introduction: ORCAN (ORtholog sCANner) is a web-based meta-server for one-click evolutionary and functional annotation of protein sequences. The server combines information from the most popular orthology-prediction resources, including four tools and four online databases. Functional annotation utilizes five additional comparisons between the query and identified homologs, including: sequence similarity, protein domain architectures, functional motifs, Gene Ontology term assignments and a list of associated articles. Furthermore, the server uses a plurality-based rating system to evaluate the orthology relationships and to rank the reference proteins by their evolutionary and functional relevance to the query.

Chapter page Web server: ORCAN

1.1.9 KEGG Orthology

Introduction: The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs, includig 8504 species(813 eukaryotes,7291 bacteria, 400 archaea). A functional ortholog is manually defined in the context of KEGG molecular networks, namely, KEGG pathway maps, BRITE hierarchies and KEGG modules. Each node of the network, such as a box in the KEGG pathway map, is given a KO identifier (called K number) as a functional ortholog defined from experimentally characterized genes and proteins in specific organisms, which are then used to assign orthologous genes in other organisms based on sequence similarity.

Web server: KEGG Orthology KEGGOrthology

1.1.10 InParanoid

Introduction: InParanoid is a program for automatic identification of orthologs while differentiating between inparalogs and outparalogs. An InParanoid cluster is seeded by a reciprocally bestmatching ortholog pair, around which inparalogs are gathered independently, while outparalogs are excluded. The InParanoid database is a collection of pairwise ortholog groups aiming to include all ‘completely sequenced’ eukaryotic genomes.

Type	Number
Species	273
Species_pairs	37128
Ortholog_groups	79717666
Proteins_processed	3718323
Orthologous_proteins	2999062

Web server: InParanoid

1.1.11 COG

Introduction: The Clusters of Orthologous Genes (COG) database, also referred to as the Clusters of Orthologous Groups of proteins, was created in 1997 and went through several rounds of updates, most recently, in 2022. The current release includes 4877 COGs.

Web server: COG COG