Interpreting the relationship between molecular mechanisms and phenotypes, in addition to the GO and pathway annotation, other functional attributes also provide a window into the genes, such as, diseases-associated gene annotations, human/mouse phenotypes, hallmark genes, oncogenic signature, immunologic signature, cell type signature, chromosome cytoband, subcellular locations, developmental stage, gene variant, literatures, domains annotaions, etc.
Breaking: The comprehensive gene/protein annotation databases are summarized in this document, including the number of species, package installation, operating vignettes.
Introduction: InterPro is the world’s most comprehensive resource for protein family and domain information. InterPro integrates protein signatures from 13 member databases, which use a variety of different methods to classify proteins. Each of the databases has a particular focus (e.g. protein domains defined from structure, or full length protein families with shared function).
InterPro integrates signatures from the following 13 member databases: CATH, CDD, HAMAP, MobiDB Lite, Panther, Pfam, PIRSF, PRINTS, Prosite, SFLD, SMART, SUPERFAMILY AND NCBIfam.
The CATH-Gene3D database describes protein families and domain architectures in complete genomes. Protein families are formed using a Markov clustering algorithm, followed by multi-linkage clustering according to sequence identity. Mapping of predicted structure and sequence domains is undertaken using hidden Markov models libraries representing CATH and Pfam domains.
CDD is a protein annotation resource that consists of a collection of annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domain models, which use 3D-structure information to explicitly define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases.
HAMAP stands for High-quality Automated and Manual Annotation of Proteins. HAMAP profiles are manually created by expert curators. They identify proteins that are part of well-conserved protein families or subfamilies.
MobiDB offers a centralized resource for annotations of intrinsic protein disorder. The database features three levels of annotation: manually curated, indirect and predicted. The different sources present a clear tradeoff between quality and coverage. By combining them all into a consensus annotation, MobiDB aims at giving the best possible picture of the “disorder landscape” of a given protein of interest.
PANTHER is a large collection of protein families that have been subdivided into functionally related subfamilies, using human expertise. These subfamilies model the divergence of specific functions within protein families, allowing more accurate association with function, as well as inference of amino acids important for functional specificity. Hidden Markov models (HMMs) are built for each family and subfamily for classifying additional protein sequences.
Pfam is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains.
PIRSF protein classification system is a network with multiple levels of sequence diversity from superfamilies to subfamilies that reflects the evolutionary relationship of full-length proteins and domains.
PRINTS is a compendium of protein fingerprints. A fingerprint is a group of conserved motifs used to characterise a protein family or domain.
PROSITE is a database of protein families and domains. It consists of biologically significant sites, patterns and profiles that help to reliably identify to which known protein family a new sequence belongs.
SFLD (Structure-Function Linkage Database) is a hierarchical classification of enzymes that relates specific sequence-structure features to specific chemical capabilities.
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures.
SUPERFAMILY is a library of profile hidden Markov models that represent all proteins of known structure. The library is based on the SCOP classification of proteins: each model corresponds to a SCOP domain and aims to represent the entire SCOP superfamily that the domain belongs to.
Introduction: UCSC Genome Browser database is a web-based viewer for genome sequence data and annotations, the ability to juxtapose annotations of many types is one of the reasons for the popularity. The gene annotations including clinical variant interpretation, chromosome band location, pseudogene annotation, domain, cis-regulatory elements combination, organs/tissue/cell types annotation and gene interactions from curated databases and text-mining.
Web server: UCSC
Introduction: Molecular Signatures Database (MSigDB) is a resource of tens of thousands of annotated gene sets for use with GSEA software, divided into Human and Mouse collections.The 33196 gene sets in the Human Molecular Signatures Database (MSigDB) are divided into 9 major collections, and several sub-collections.
Web server: MSigDB
Introduction: HumanBase is a “one stop shop” for biological researchers interested in data-driven predictions of gene expression, function, regulation, and interactions in human, particularly in the context of specific cell types/tissues and human disease. This resource is not merely a public database of primary genomics data or biological literature. The data-driven integrative analyses (i.e. algorithms that “learn” from large genomic data collections) presented in HumanBase are especially powerful because they separate signal from noise in large biological data collections to reach beyond “existing biological knowledge” represented in the biological literature to identify novel associations that are not biased toward well-studied areas of biomedical research.
Type | Resource number |
---|---|
Ontology definitions | 3 |
Gene annotations | 3 |
Co-expression | 980 |
Interaction | 4 |
Genetic and chemical perturbations | 1 |
MicroRNA targets | 1 |
TF binding | 1 |
Web server: HumanBase
HumanBase from the following six analysis tools:
Functional gene networks In order to leverage the vast collections of raw, noisy genomic data, they must be integrated, summarized, and presented in a biologically informative manner. We provide a means of mining tens of thousands of whole-genome experiments by way of functional interaction networks. Each interaction network represents a body of data, probabilistically weighted and integrated, focused on a particular biological question.
Tissue-specific gene networks (GIANT) HumanBase builds genome-scale functional maps of human tissues by integrating a collection of data sets covering thousands of experiments contained in more than 14,000 distinct publications. We automatically assess each data set for its relevance to each of 144 tissueand cell lineage–specific functional contexts. The resulting functional gene maps provide a detailed portrait of protein function and interactions in specific human tissues and cell lineages ranging from B lymphocytes to the renal glomerulus and the whole brain. This approach allows HumanBase to profile the specialized function of genes in a high-throughput manner, even in tissues and cell lineages for which no or few tissue-specific data exist.
Functional module detection HumanBase applies community detection to find cohesive gene clusters from a provided gene list and a selected relevant tissue. Genes within a cluster share local network neighborhoods and together form a cohesive, specific functional module. Module detection enables systematic association of genes even functionally uncharacterized genes to specific processes and phenotypes represented in the detected modules. Functional modules are identified with tissue-specific networks, which predict gene interactions from massive data collections. Thus the discovered modules potentially capture higher-order tissue-specific function.
GWAS re-prioritization (NetWAS) Tissue-specific networks provide a new means to generate hypotheses related to the molecular basis of human disease. In NetWAS, the statistical associations from a standard GWAS guide the analysis of functional networks. NetWAS, in conjunction with tissue-specific networks, effectively reprioritizes statistical associations from GWAS to identify disease-associated genes. This reprioritization method is driven by GWAS discovery and does not depend on prior disease knowledge.
Chromatin effects (Sei / DeepSEA) DeepSEA is a deep learning-based algorithmic framework for predicting the chromatin effects of sequence alterations with single nucleotide sensitivity. DeepSEA can accurately predict the epigenetic state of a sequence, including transcription factors binding, DNase I sensitivities and histone marks in multiple cell types, and further utilize this capability to predict the chromatin effects of sequence variants and prioritize regulatory variants. Sei provides a global map from any sequence to regulatory activities, as represented by 40 sequence classes by integrating predictions for 21,907 chromatin profiles.
Tissue-specific variant effects (ExPecto) ExPecto makes highly accurate cell-type-specific predictions of gene expression solely from DNA sequence. With ExPecto, the tissue-specific impact of gene transcriptional dysregulation can be systematically probed ‘in silico’, at a scale not yet possible experimentally. ExPecto leverages deep learning-based sequence models trained on chromatin profiling data, and integrated with spatial transformation and regularized linear models.
Introduction: DISGENET plus is a discovery platform that integrates human gene and variant-disease associations from various expert curated databases and the scientific literature. Information extracted from expert-curated resources and directly from the literature using state-of-the-art text mining technologies (92 % F-score). All disease classes included supporting applications in any therapeutic area. And the DISGENET database also provided the CLINICAL BIOMARKERS app. The Clinical Biomarkers App contains more than 3,000 biomarkers measured in 43,000 Clinical Trials involving 2,600 conditions.
Type | Resource number |
---|---|
Gene and variant-disease associations | More than 2 millions |
Gene | 24,643 |
Genomic variants | 565,000 |
Diseases and phenotypes | 34,000 |
Installation:The database offers the Cytoscape command for scripting purposes. Install the latest version of this package by entering the following in R:
library(devtools)
install_bitbucket("ibi_group/disgenet2r")
Web server: DISGENET, DISGENET plus
Introduction: VaProS, VAriation effect on PROtein Structure and function, is a new data cloud for Structural Life Science and is the core technology to lead the collaboration between the discipline in Structural Biology and the whole Life Sciences. Led by the initiative of National Institute of Genetics, VaProS has been developed around the Integrated Structural Biology Database at Institute for Protein Research in Osaka University, together with the selected outcomes from Protein 3000 Project, Targeted Proteins Research Program, Genome Network Project and Cell Innovation Project.
Web server: VaProS,
Introduction: A database of up-to-date proteins related to autophagy (self-digestion process in eukaryotic cells). The information here includes from protein sequences to protein-protein interactions and is human curated.
Web server: Autophagy Database,
Introduction: MitoCarta3.0 is an inventory of 1136 human and 1140 mouse genes encoding proteins with strong support of mitochondrial localization, now with sub-mitochondrial compartment and pathway annotations. MitoCarta3.0, released 2020, uses manual literature curation to revise the previous MitoCarta2.0 inventory (78 added and 100 removed genes), provide annotation of sub-mitochondrial localization, and assign genes to a custom ontology of 149 mitochondrial pathways.
Web server: MitoCarta3.0,
Introduction: OMIM is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is freely available and updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 16,000 genes. OMIM focuses on the relationship between phenotype and genotype.
Type | Resource number |
---|---|
Gene description | 16,874 |
Gene and phenotype, combined | 27 |
Phenotype description, molecular basis known | 6,509 |
Phenotype description or locus, molecular basis unknown | 1,510 |
Other, mainly phenotypes with suspected mendelian basis | 1,750 |
Web server: OMIM,
Introduction: The Human Gene Mutation Database (HGMD) constitutes a comprehensive collection of published germline mutations in nuclear genes that are thought to underlie, or are closely associated with human inherited disease. The database contains in excess of 289,000 different gene lesions identified in over 11,100 genes manually curated from 72,987 articles published in over 3100 peer-reviewed journals.
Web server: HGMD
Introduction: CancerSEA is the first dedicated database that aims to comprehensively resolve distinct functional states of cancer cells at the single-cell level. It portrays a cancer single-cell functional state atlas, involving 14 functional states (including stemness, invasion, metastasis, proliferation, EMT, angiogenesis, apoptosis, cell cycle, differentiation, DNA damage, DNA repair, hypoxia, inflammation and quiescence) of 41,900 cancer single cells from 25 cancer types. CancerSEA allows users to query which functional states the gene or gene list of interest is related to in different cancers. Furthermore, it provides PCG/lncRNA repertoires frequently associated with functional states at single-cell resolution across all cancer types, in a specific cancer type and in individual cancer single-cell datasets. Finally, CancerSEA provides a user-friendly interface for comprehensively searching, browsing, visualizing and downloading functional state activity profiles, PCGs/lncRNAs expression profiles and 14 functional state signatures.
Web server: CancerSEA
Introduction: The Human Phenotype Ontology (HPO) provides a standardized vocabulary of phenotypic abnormalities encountered in human disease. Each term in the HPO describes a phenotypic abnormality, such as Atrial septal defect. The HPO is currently being developed using the medical literature, Orphanet, DECIPHER, and OMIM. HPO currently contains over 13,000 terms and over 156,000 annotations to hereditary diseases. The HPO project and others have developed software for phenotype-driven differential diagnostics, genomic diagnostics, and translational research.
Application: The Human Phenotype Ontology tutorial can be found at https://hpo.jax.org/app/resources/clinician-guide.
Web server: The Human Phenotype Ontology
Introduction: GeneCards is a searchable, integrative database that provides comprehensive, user-friendly information on all annotated and predicted human genes. The knowledgebase automatically integrates gene-centric data from ~150 web sources, including genomic, transcriptomic, proteomic, genetic, clinical and functional information.
Application: The GeneCards tutorial can be found at https://www.genecards.org/Guide.
Web server: GeneCards
Introduction: GeneALaCart generates a file of GeneCards annotations associated with your list of genes. For each query, you supply the “batch” of gene identifiers and select the annotations that interest you; GeneALaCart then extracts the information from GeneCards, and produces a customized results file.
Web server: GeneALaCart