This document was provided to support the orthologous gene mapping across species (especially non-model taxa) for the downstream data analysis.
Breaking: A summary of eleven R packages and web services tools, including the number of species, package installation, operating vignettes.
Introduction: In recent years a wealth of biological data has become available in public data repositories. Easy access to these valuable data resources and firm integration with data analysis is needed for comprehensive bioinformatics data analysis. The biomaRt package, provides an interface to a growing collection of databases implementing the BioMart software suite. The package enables retrieval of large amounts of data in a uniform way without the need to know the underlying database schemas or write complex SQL queries. Examples of BioMart databases are Ensembl, Uniprot and HapMap. These major databases give biomaRt users direct access to a diverse set of data and enable a wide range of powerful online queries from R.
Installation: To install this package, start R (version “4.2”) and enter:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("biomaRt")
Aplication: This vignette details provide a number of example usecases that can be used as the basis for specifying your own queries. biomaRt provides a number of functions that are tailored to work specifically with the BioMart instances provided by Ensembl.
For older versions of R, please refer to the appropriate Bioconductor release.
Introduction: A wrapper for the homologene database by the National Center for Biotechnology Information (‘NCBI’). It allows searching for gene homologs across species. The package also includes an updated version of the homologene database where gene identifiers and symbols are replaced with their latest (at the time of submission) version and functions to fetch latest annotation data to keep updated. Given a list of genes and a taxid, returns a data frame inlcuding the genes and their corresponding homologues.
Available species are: Homo sapiens| Mus musculus | Rattus norvegicus | Danio rerio | Caenorhabditis elegans | Drosophila melanogaster | Rhesus macaque |.
More species can be added on request.
Installation: Install the latest version of this package by entering the following in R:
install.packages("homologene")
library(devtools)
install_github('oganm/homologene')
Aplication: Given a list of genes and a taxid, returns a data frame inlcuding the genes and their corresponding homologues according to the manual.
Examples:
homologene::homologene(c('Eno2','17441'), inTax = 10090, outTax = 9606)
## 10090 9606 10090_ID 9606_ID
## 1 Eno2 ENO2 13807 2026
## 2 Mog MOG 17441 4340
Introduction: g:Orth (part of the g:Profiler) translates gene identifiers between organisms. We provide orthologous gene mappings based on the information retrieved from the 199 Ensembl database.
Installation: To install this package, start R (version “4.2”) and enter:
install.packages("gprofiler2")
Aplication: g:Orth translates gene identifiers between organisms. For example, to convert gene list between mouse (source_organism = mmusculus) and human (target_organism = hsapiens):
library(gprofiler2)
gorth(query = c("Klf4", "Sox2", "71950"), source_organism = "mmusculus",
target_organism = "hsapiens", numeric_ns = "ENTREZGENE_ACC")
## input_number input input_ensg ensg_number ortholog_name
## 1 1 Klf4 ENSMUSG00000003032 1.1.1 KLF4
## 2 2 Sox2 ENSMUSG00000074637 2.1.1 SOX2
## 3 3 71950 ENSMUSG00000012396 3.1.1 NANOG
## 4 3 71950 ENSMUSG00000012396 3.1.2 NANOGP8
## ortholog_ensg
## 1 ENSG00000136826
## 2 ENSG00000181449
## 3 ENSG00000111704
## 4 ENSG00000255192
## description
## 1 Kruppel like factor 4 [Source:HGNC Symbol;Acc:HGNC:6348]
## 2 SRY-box transcription factor 2 [Source:HGNC Symbol;Acc:HGNC:11195]
## 3 Nanog homeobox [Source:HGNC Symbol;Acc:HGNC:20857]
## 4 Nanog homeobox retrogene P8 [Source:HGNC Symbol;Acc:HGNC:23106]
Web server: g:Orth
Introduction: OmaDB is a wrapper for the REST API for the Orthologous MAtrix project (OMA) which is a database for the inference of orthologs among complete genomes. OMA’s inference algorithm consists of three main phases. First, to infer homologous sequences (sequences of common ancestry), all-against-all Smith-Waterman alignments are computed and significant matches are retained. Second, to infer orthologous pairs (the subset of homologs related by speciation events), mutually closest homologs are identified based on evolutionary distances, taking into account distance inference uncertainty and the possibility for differential gene losses (for more details, see Roth et al 2008). Third, these orthologs are clustered in two different ways, which are useful for different purposes: 1) We identify cliques of orthologous pairs (“OMA groups” ), which are useful as marker genes for phylogenetic reconstruction and tend to be very specific; 2) We identify hierarchical orthologous groups (“HOGs”), groups of genes defined for particular taxonomic ranges and identify all genes that have descended from a common ancestral gene in that taxonomic range.
Type | Number |
---|---|
species | 2496 |
proteins | 18871762 |
OMA groups | 962065 |
deepest level HOGs | 846849 |
proteins in a HOG | 14029907 (74.3%) |
proteins in an OMA Group | 13481594 (71.4%) |
Installation: To install this package, start R (version “4.2”) and enter:
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("OmaDB")
For older versions of R, please refer to the appropriate Bioconductor release.
Aplication: This little vignette shows you how to get started with the OmaDB package. For more details on the OMA project.
Another useful function of the OmaDB package is its functionality to exactly and partially match sequences. A case of sequence as a example was provided in this vignette.
Introduction: A hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. The initial step in the EggNOG pipeline is the clustering of the 9.6 million proteins from 2031 genomes. Homology comparisons are executed by the SIMAP initiative and processed by the EggNOG orthology prediction pipeline. Orthologous groups are automatically generated by dividing species space into ‘core’ species, which are central for defining orthologous groups using the strict triangular criterion, and ‘periphery’ species.
Web server: EggNOG v5.0.
Introduction: OrthoVenn is a powerful web platform for the comparison and analysis of whole-genome orthologous clusters among up to 12 species. A larger number of species, including 142 vertebrates, 71 metazoa, 65 protists, 94 fungi, 57 plants and 111 bacteria species, have been added in OrthoVenn2.
Installation: The tool provides three stand-alone installation versions:
Visit Official install document to install docker on your server. It supports different systems, such as unbuntu, centos, mac, windows
your_mongo_data_path
is your mongo data path on your local server , if not exist it will create automatically.
sudo docker run --name venns-mongo -v /your_mongodb_path:/data/db -d -p 27017:27017 mongo:latest
your_work_path
is your mongo data path on your local server , if not exist it will create automatically.
sudo docker run -d --name venns -p 6001:6001 --link venns-mongo -v /your_work_path:/data/orthovenn2 -e MONGO_HOST=venns-mongo lufang0411/orthovenn2:latest
sudo docker run --name venns-front -p 9999:80 --link venns -v /your_work_path:/data/orthovenn2 -e API_HOST=venns -e API_PORT=6001 -d lufang0411/orthovenn2-front:latest
You can use command sudo docker ps
to check is all the three service start successfully. If all the sercies is ok, congratulations, you can visit http://localhost:9999
or http://your-machine-ip:9999
to use the local version ortho_venns program.
Web server: OrthoVenn2.
Introduction: OrthoDB presents a catalog of orthologous protein-coding genes across vertebrates, arthropods, fungi, plants, and bacteria. The database of orthologs presents available protein descriptors, together with Gene Ontology and InterPro attributes, which serve to provide general descriptive annotations of the orthologous groups, and facilitate comprehensive orthology database querying. OrthoDB also provides computed evolutionary traits of orthologs, such as gene duplicability and loss profiles, divergence rates, sibling groups, and gene intron-exon architectures.
OrthoDB can be queried using a gene name, identifier, annotation keywords, protein sequence, etc. Then it indexed many relevant identifiers of proteins and genes, including UniProtKB, Ensembl, InterPro, KEGG, GenBank, RefSeq, etc.
Eukaryotes | Prokaryotes |
---|---|
1,271 | 6,013 |
Web server: OrthoDB.
Introduction: ORCAN (ORtholog sCANner) is a web-based meta-server for one-click evolutionary and functional annotation of protein sequences. The server combines information from the most popular orthology-prediction resources, including four tools and four online databases. Functional annotation utilizes five additional comparisons between the query and identified homologs, including: sequence similarity, protein domain architectures, functional motifs, Gene Ontology term assignments and a list of associated articles. Furthermore, the server uses a plurality-based rating system to evaluate the orthology relationships and to rank the reference proteins by their evolutionary and functional relevance to the query.
Web server: ORCAN
Introduction: The KO (KEGG Orthology) database is a database of molecular functions represented in terms of functional orthologs, includig 8504 species(813 eukaryotes,7291 bacteria, 400 archaea). A functional ortholog is manually defined in the context of KEGG molecular networks, namely, KEGG pathway maps, BRITE hierarchies and KEGG modules. Each node of the network, such as a box in the KEGG pathway map, is given a KO identifier (called K number) as a functional ortholog defined from experimentally characterized genes and proteins in specific organisms, which are then used to assign orthologous genes in other organisms based on sequence similarity.
Web server: KEGG Orthology
Introduction: InParanoid is a program for automatic identification of orthologs while differentiating between inparalogs and outparalogs. An InParanoid cluster is seeded by a reciprocally bestmatching ortholog pair, around which inparalogs are gathered independently, while outparalogs are excluded. The InParanoid database is a collection of pairwise ortholog groups aiming to include all ‘completely sequenced’ eukaryotic genomes.
Type | Number |
---|---|
Species | 273 |
Species_pairs | 37128 |
Ortholog_groups | 79717666 |
Proteins_processed | 3718323 |
Orthologous_proteins | 2999062 |
Web server: InParanoid
Introduction: The Clusters of Orthologous Genes (COG) database, also referred to as the Clusters of Orthologous Groups of proteins, was created in 1997 and went through several rounds of updates, most recently, in 2022. The current release includes 4877 COGs.
Web server: COG